[
  {
    "path": ".codecov.yml",
    "content": "codecov:\n    notify:\n        after_n_builds: 5\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/blank_issue.md",
    "content": "---\nname: Blank Issue\nabout: Create a blank issue\ntitle: ''\nlabels: ''\nassignees: ''\n\n---\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/bug_report.md",
    "content": "---\nname: Bug Report\nabout: Create a bug report to help us improve Featuretools\ntitle: ''\nlabels: 'bug'\nassignees: ''\n\n---\n\n[A clear and concise description of what the bug is.]\n\n#### Code Sample, a copy-pastable example to reproduce your bug.\n\n```python\n# Your code here\n\n```\n\n#### Output of ``featuretools.show_info()``\n\n<details>\n\n[paste the output of ``featuretools.show_info()`` here below this line]\n\n</details>\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/config.yml",
    "content": "blank_issues_enabled: true\ncontact_links:\n  - name: General Technical Question\n    about: \"If you have a question like *How should I create my EntitySet?* you can ask on StackOverflow using the #featuretools tag.\"\n    url: https://stackoverflow.com/questions/tagged/featuretools\n  - name: Real-time chat\n    url: https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA\n    about: \"If you want to meet others in the community and chat about all things Alteryx OSS then check out our Slack.\"\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/documentation_improvement.md",
    "content": "---\nname: Documentation Improvement\nabout: Suggest an idea for improving the documentation\ntitle: ''\nlabels: 'documentation'\nassignees: ''\n\n---\n\n[a description of what documentation you believe needs to be fixed/improved]\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/feature_request.md",
    "content": "---\nname: Feature Request\nabout: Suggest an idea for this project\ntitle: ''\nlabels: 'new feature'\nassignees: ''\n\n---\n\n- As a [user/developer], I wish I could use Featuretools to ...\n\n#### Code Example\n\n```python\n# Your code here, if applicable\n\n```\n"
  },
  {
    "path": ".github/auto_assign.yml",
    "content": "# Set to author to set pr creator as assignee\naddAssignees: author\n"
  },
  {
    "path": ".github/workflows/auto_approve_dependency_PRs.yaml",
    "content": "name: Auto Approve Dependency PRs\non:\n  schedule:\n      - cron: '*/30 * * * *'\n  workflow_dispatch:\n  workflow_run:\n    workflows: [\"Unit Tests - Latest Dependencies\", \"Unit Tests - 3.9 Minimum Dependencies\"]\n    branches:\n      - 'latest-dep-update-[a-f0-9]+'\n      - 'min-dep-update-[a-f0-9]+'\n    types:\n      - completed\njobs:\n  build:\n    if: ${{ github.repository_owner == 'alteryx' }}\n    runs-on: ubuntu-latest\n    steps:\n      - name: Find dependency PRs\n        id: find_prs\n        run: |\n          gh auth status\n          gh pr list --repo \"${{ github.repository }}\" --assignee \"machineFL\" --base main --state open --search \"status:success review:required\" --limit 1 --json number > dep_PRs_waiting_approval.json\n          dep_pull_request=$(cat dep_PRs_waiting_approval.json | grep -Eo \"[0-9]*\")\n          echo ::set-output name=dep_pull_request::${dep_pull_request}\n        env:\n          GITHUB_TOKEN: ${{ secrets.AUTO_APPROVE_TOKEN }}\n      - name: Approve dependency PRs and enable auto-merge\n        if: ${{ steps.find_prs.outputs.dep_pull_request > 1 }}\n        run: |\n          gh pr review --repo \"${{ github.repository }}\" --comment --body \"auto approve\" ${{ steps.find_prs.outputs.dep_pull_request }}\n          gh pr review --repo \"${{ github.repository }}\" --approve ${{ steps.find_prs.outputs.dep_pull_request }}\n          gh pr merge --repo \"${{ github.repository }}\" --auto --squash --delete-branch ${{ steps.find_prs.outputs.dep_pull_request }}\n        env:\n          GITHUB_TOKEN: ${{ secrets.AUTO_APPROVE_TOKEN }}\n"
  },
  {
    "path": ".github/workflows/broken_link_check.yaml",
    "content": "name: Broken link check\non:\n  workflow_dispatch:\n  schedule:\n    - cron: \"* * * * 1\"\n\njobs:\n  my-broken-link-checker:\n    name: Check for broken links\n    runs-on: ubuntu-latest\n    strategy:\n      fail-fast: false\n    steps:\n      - name: Check for broken links\n        uses: ruzickap/action-my-broken-link-checker@v2\n        with:\n          url: https://featuretools.alteryx.com/en/latest/\n          cmd_params: '--max-connections=10 --color=always --ignore-fragments --buffer-size=8192 --skip-tls-verification --exclude=\"(twitter|github|cloudflare|featuretools\\\\.alteryx\\\\.com\\\\/en\\\\/(stable|main|v.+).*)\"'\n      - name: Add to job output\n        run: echo \"${{steps.link-report.outputs.result}}\" >> $GITHUB_STEP_SUMMARY\n"
  },
  {
    "path": ".github/workflows/build_docs.yaml",
    "content": "name: Build Docs\non:\n  pull_request:\n    types: [opened, synchronize]\n  push:\n    branches:\n      - main\n  workflow_dispatch:\nenv:\n  PYARROW_IGNORE_TIMEZONE: 1\n  JAVA_HOME: \"/usr/lib/jvm/java-11-openjdk-amd64\"\njobs:\n  build_docs:\n    name: ${{ matrix.python_version }} build docs\n    runs-on: ubuntu-latest\n    strategy:\n      fail-fast: false\n      matrix:\n        python_version: [\"3.9\", \"3.10\", \"3.11\", \"3.12\"]\n    steps:\n      - name: Checkout repository\n        uses: actions/checkout@v3\n        with:\n          ref: ${{ github.event.pull_request.head.ref }}\n          repository: ${{ github.event.pull_request.head.repo.full_name }}\n      - name: Set up python ${{ matrix.python_version }}\n        uses: actions/setup-python@v4\n        with:\n          python-version: ${{ matrix.python_version }}\n          cache: 'pip' \n          cache-dependency-path: 'pyproject.toml'\n      - uses: actions/cache@v3\n        id: cache\n        with:\n          path: ${{ env.pythonLocation }} \n          key: ${{ matrix.python_version }}-docs-${{ env.pythonLocation }}-${{ hashFiles('**/pyproject.toml') }}-v01\n      - name: Build featuretools package\n        run: |\n          make package\n      - name: Install complete version of featuretools from sdist (not using cache)\n        if: steps.cache.outputs.cache-hit != 'true'\n        run: |\n          python -m pip install \"unpacked_sdist/[dev]\"\n      - name: Install complete version of featuretools from sdist (using cache)\n        if: steps.cache.outputs.cache-hit == 'true'\n        run: |\n          python -m pip install \"unpacked_sdist/[dev]\" --no-deps\n      - name: Install apt packages\n        run: |\n          sudo apt update\n          sudo apt install -y pandoc\n          sudo apt install -y graphviz\n          python -m pip check\n      - name: Build docs\n        run: make -C docs/ -e \"SPHINXOPTS=-W -j auto\" clean html\n"
  },
  {
    "path": ".github/workflows/create_feedstock_pr.yaml",
    "content": "on:\n  workflow_dispatch:\n    inputs:\n      version:\n        description: 'released PyPI version to use (ex - v1.11.1)'\n        required: true\n\nname: Create Feedstock PR\njobs:\n  create_feedstock_pr:\n    name: Create Feedstock PR\n    runs-on: ubuntu-latest\n    steps:\n      - name: Checkout inputted version\n        uses: actions/checkout@v3\n        with:\n          repository: ${{ github.event.pull_request.head.repo.full_name }}\n          ref: ${{ github.event.inputs.version }}\n          path: \"./featuretools\"\n      - name: Pull latest from upstream for user forked feedstock\n        run: |\n          gh auth status\n          gh repo sync alteryx/featuretools-feedstock --branch main --source conda-forge/featuretools-feedstock --force\n        env:\n          GITHUB_TOKEN: ${{ secrets.AUTO_APPROVE_TOKEN }}\n      - uses: actions/checkout@v3\n        with:\n          repository: alteryx/featuretools-feedstock\n          ref: main\n          path: \"./featuretools-feedstock\"\n          fetch-depth: '0'\n      - name: Run Create Feedstock meta YAML\n        id: create-feedstock-meta\n        uses: alteryx/create-feedstock-meta-yaml@v4\n        with:\n          project: \"featuretools\"\n          pypi_version: ${{ github.event.inputs.version }}\n          project_metadata_filepath: \"featuretools/pyproject.toml\"\n          meta_yaml_filepath: \"featuretools-feedstock/recipe/meta.yaml\"\n          add_to_test_requirements: \"graphviz !=2.47.2\"\n      - name: View updated meta yaml\n        run: cat featuretools-feedstock/recipe/meta.yaml\n      - name: Push updated yaml\n        run: |\n          cd featuretools-feedstock\n          git config --unset-all http.https://github.com/.extraheader\n          git config --global user.email \"machineOSS@alteryx.com\"\n          git config --global user.name \"machineAYX Bot\"\n          git remote set-url origin https://${{ secrets.AUTO_APPROVE_TOKEN }}@github.com/alteryx/featuretools-feedstock\n          git checkout -b ${{ github.event.inputs.version }}\n          git add recipe/meta.yaml\n          git commit -m \"${{ github.event.inputs.version }}\"\n          git push origin ${{ github.event.inputs.version }}\n      - name: Adding URL to job output\n        run: |\n          echo \"Conda Feedstock Pull Request: https://github.com/alteryx/featuretools-feedstock/pull/new/${{ github.event.inputs.version }}\" >> $GITHUB_STEP_SUMMARY\n"
  },
  {
    "path": ".github/workflows/install_test.yaml",
    "content": "name: Install Test\non:\n  pull_request:\n    types: [opened, synchronize]\n  push:\n    branches:\n      - main\nenv:\n  ALTERYX_OPEN_SRC_UPDATE_CHECKER: False\njobs:\n  install_ft_complete:\n    name: ${{ matrix.os }} - ${{ matrix.python_version }} install featuretools complete\n    strategy:\n      fail-fast: false\n      matrix:\n        os: [ubuntu-latest, macos-latest, windows-latest]\n        python_version: [\"3.9\", \"3.10\", \"3.11\", \"3.12\"]\n    runs-on: ${{ matrix.os }}\n    steps:\n      - name: Checkout repository\n        uses: actions/checkout@v3\n        with:\n          ref: ${{ github.event.pull_request.head.ref }}\n          repository: ${{ github.event.pull_request.head.repo.full_name }}\n      - name: Set up python ${{ matrix.python_version }}\n        uses: actions/setup-python@v4\n        with:\n          python-version: ${{ matrix.python_version }}\n          cache: 'pip' \n          cache-dependency-path: 'pyproject.toml'\n      - name: Build featuretools package\n        run: |\n          make package\n      - name: Install complete version of featuretools from sdist\n        run: |\n          python -m pip install \"unpacked_sdist/[complete]\"\n      - name: Test by importing packages\n        run: |\n          python -c \"import premium_primitives\"\n          python -c \"from nlp_primitives import PolarityScore\"\n      - name: Check package conflicts\n        run: |\n          python -m pip check\n      - name: Verify extra_requires commands\n        run: |\n          python -m pip install \"unpacked_sdist/[nlp]\"\n"
  },
  {
    "path": ".github/workflows/kickoff_evalml_unit_tests.yaml",
    "content": "name: Kickoff EvalML Unit Tests\n\non:\n  push:\n    branches:\n      - main\n  workflow_dispatch:\n\njobs:\n  kickoff:\n    name: Run EvalML unit tests\n    if: github.repository_owner == 'alteryx'\n    runs-on: ubuntu-latest\n    steps:\n      - name: Run workflow for EvalML unit tests\n        run: gh workflow run unit_tests_with_featuretools_main_branch.yaml --repo \"alteryx/evalml\"\n        env:\n          GITHUB_TOKEN: ${{ secrets.REPO_SCOPED_TOKEN }}\n"
  },
  {
    "path": ".github/workflows/latest_dependency_checker.yaml",
    "content": "# This workflow will install dependenies and if any critical dependencies have changed a pull request\n# will be created which will trigger a CI run with the new dependencies.\n\nname: Latest Dependency Checker\non:\n  schedule:\n    - cron: '0 * * * *'\n  workflow_dispatch:\njobs:\n  build:\n    if: ${{ github.repository_owner == 'alteryx' }}\n    runs-on: ubuntu-latest\n    timeout-minutes: 5\n    steps:\n      - name: Checkout repository\n        uses: actions/checkout@v3\n        with:\n          ref: ${{ github.event.pull_request.head.ref }}\n          repository: ${{ github.event.pull_request.head.repo.full_name }}\n      - uses: actions/setup-python@v4\n        with:\n          python-version: 3.9\n      - name: Update dependencies\n        run: |\n          python -m pip install --upgrade pip\n          python -m pip install -e \".[dask,test]\"\n          make checkdeps OUTPUT_PATH=featuretools/tests/requirement_files/latest_requirements.txt\n      - name: Create pull request\n        uses: peter-evans/create-pull-request@v3\n        with:\n          token: ${{ secrets.REPO_SCOPED_TOKEN }}\n          commit-message: Update latest dependencies\n          title: Automated Latest Dependency Updates\n          author: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>\n          body: \"This is an auto-generated PR with **latest** dependency updates.\n                Please do not delete the `latest-dep-update` branch because it's needed by the auto-dependency bot.\"\n          branch: latest-dep-update\n          branch-suffix: short-commit-hash\n          base: main\n          assignees: machineFL\n          reviewers: machineAYX\n"
  },
  {
    "path": ".github/workflows/lint_check.yaml",
    "content": "name: Lint Check\non:\n  pull_request:\n    types: [opened, synchronize]\n  push:\n    branches:\n      - main\njobs:\n  lint_check:\n    name: ${{ matrix.python_version }} lint check\n    runs-on: ubuntu-latest\n    strategy:\n      fail-fast: false\n      matrix:\n        python_version: [\"3.12\"]\n    steps:\n      - name: Checkout repository\n        uses: actions/checkout@v3\n        with:\n          ref: ${{ github.event.pull_request.head.ref }}\n          repository: ${{ github.event.pull_request.head.repo.full_name }}\n      - name: Set up python ${{ matrix.python_version }}\n        uses: actions/setup-python@v4\n        with:\n          python-version: ${{ matrix.python_version }}\n          cache: 'pip' \n          cache-dependency-path: 'pyproject.toml'\n      - uses: actions/cache@v3\n        id: cache\n        with:\n          path: ${{ env.pythonLocation }} \n          key: ${{ matrix.python_version }}-lint-${{ env.pythonLocation }}-${{ hashFiles('**/pyproject.toml') }}-v01\n      - name: Install featuretools with optional, dev, and test requirements (not using cache)\n        if: steps.cache.outputs.cache-hit != 'true'\n        run: |\n          python -m pip install -e .[dev]\n      - name: Install featuretools with no requirements (using cache)\n        if: steps.cache.outputs.cache-hit == 'true'\n        run: |\n          python -m pip install -e .[dev] --no-deps\n      - name: Run lint test\n        run: make lint\n"
  },
  {
    "path": ".github/workflows/minimum_dependency_checker.yaml",
    "content": "name: Minimum Dependency Checker\non:\n  workflow_dispatch:\n  push:\n    branches:\n      - main\n    paths:\n      - 'pyproject.toml'\njobs:\n  build:\n    runs-on: ubuntu-latest\n    steps:\n      - name: Checkout repository\n        uses: actions/checkout@v3\n        with:\n          ref: ${{ github.event.pull_request.head.ref }}\n          repository: ${{ github.event.pull_request.head.repo.full_name }}\n      - name: Run min dep generator - test reqs\n        id: min_dep_gen_test\n        uses: alteryx/minimum-dependency-generator@v3\n        with:\n          paths: 'pyproject.toml'\n          options: 'dependencies'\n          extras_require: 'test'\n          output_filepath: featuretools/tests/requirement_files/minimum_test_requirements.txt\n      - name: Run min dep generator - core reqs\n        id: min_dep_gen_core\n        uses: alteryx/minimum-dependency-generator@v3\n        with:\n          paths: 'pyproject.toml'\n          options: 'dependencies'\n          output_filepath: featuretools/tests/requirement_files/minimum_core_requirements.txt\n      - name: Run min dep generator - dask\n        id: min_dep_gen_dask\n        uses: alteryx/minimum-dependency-generator@v3\n        with:\n          paths: 'pyproject.toml'\n          options: 'dependencies'\n          extras_require: 'dask'\n          output_filepath: featuretools/tests/requirement_files/minimum_dask_requirements.txt\n      - name: Create Pull Request\n        uses: peter-evans/create-pull-request@v3\n        with:\n          token: ${{ secrets.REPO_SCOPED_TOKEN }}\n          commit-message: Update minimum dependencies\n          title: Automated Minimum Dependency Updates\n          author: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>\n          body: \"This is an auto-generated PR with **minimum** dependency updates.\n                 Please do not delete the `min-dep-update` branch because it's needed by the auto-dependency bot.\"\n          branch: min-dep-update\n          branch-suffix: short-commit-hash\n          base: main\n          assignees: machineFL\n          reviewers: machineAYX\n"
  },
  {
    "path": ".github/workflows/performance-check.yaml",
    "content": "name: performance-check\non:\n  push:\n    branches:\n      - main\n  workflow_dispatch:\njobs:\n  run-performance-analysis:\n    runs-on: ubuntu-latest\n    steps:\n      - name: Configure AWS Credentials\n        uses: aws-actions/configure-aws-credentials@v1\n        with:\n          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}\n          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}\n          aws-region: ${{ secrets.AWS_REGION }}\n      - name: Run Lambda\n        env:\n          lambda_function: ${{ secrets.LAMBDA_FUNC }}\n        run: |\n          echo \"{\\\"TestCommit\\\": \\\"$GITHUB_SHA\\\", \\\"Flags\\\": \\\"--upload-slack\\\"}\" | base64 > payload.b64\n          aws lambda invoke --function-name $lambda_function --payload file://payload.b64 --invocation-type Event /dev/stdout 1>/dev/null\n"
  },
  {
    "path": ".github/workflows/pull_request_check.yaml",
    "content": "name: Pull Request Check\non:\n  pull_request:\n    types: [opened, edited, reopened, synchronize]\njobs:\n  pull_request_check:\n    name: pull request check\n    runs-on: ubuntu-latest\n    steps:\n      - uses: nearform-actions/github-action-check-linked-issues@v1.4.5\n        id: check-linked-issues\n        with:\n          exclude-branches: \"release_v**, backport_v**, main, latest-dep-update-**, min-dep-update-**, dependabot/**\"          \n          github-token: ${{ secrets.REPO_SCOPED_TOKEN }}\n"
  },
  {
    "path": ".github/workflows/release.yaml",
    "content": "on:\n  release:\n    types: [published]\n\nname: Release\njobs:\n  pypi-publish:\n    name: PyPI Release\n    runs-on: ubuntu-latest\n    permissions:\n      id-token: write\n    steps:\n    - uses: actions/checkout@v4\n    - uses: actions/setup-python@v5\n    - name: Install deps\n      run: |\n        python -m pip install --quiet --upgrade pip\n        python -m pip install --quiet --upgrade build\n        python -m pip install --quiet --upgrade setuptools\n    - name: Remove build artifacts and docs\n      run: |\n        rm -rf .eggs/ dist/ build/ docs/\n    - name: Build distribution\n      run: python -m build\n\n    - name: Publish package distributions to PyPI\n      uses: pypa/gh-action-pypi-publish@release/v1\n    - name: Run workflow to create feedstock pull request\n      run: |\n        gh workflow run create_feedstock_pr.yaml --repo \"alteryx/featuretools\" -f version=${{ github.event.release.tag_name }}\n      env:\n        GITHUB_TOKEN: ${{ secrets.REPO_SCOPED_TOKEN }}\n"
  },
  {
    "path": ".github/workflows/release_notes_updated.yaml",
    "content": "name: Release Notes Updated\non:\n  pull_request:\n    types: [opened, synchronize]\njobs:\n  release_notes_updated:\n    name: release notes updated\n    runs-on: ubuntu-latest\n    steps:\n      - name: Check for development branch\n        id: branch\n        shell: python\n        env:\n          REF: ${{ github.event.pull_request.head.ref }}\n        run: |\n          from re import compile\n          import os\n          main = '^main$'\n          release = '^release_v\\d+\\.\\d+\\.\\d+$'\n          backport = '^backport_v\\d+\\.\\d+\\.\\d+$'\n          dep_update = '^latest-dep-update-[a-f0-9]{7}$'\n          min_dep_update = '^min-dep-update-[a-f0-9]{7}$'\n          regex = main, release, backport, dep_update, min_dep_update\n          patterns = list(map(compile, regex))\n          ref = os.environ[\"REF\"]\n          is_dev = not any(pattern.match(ref) for pattern in patterns)\n          print('::set-output name=is_dev::' + str(is_dev))\n      - if: ${{ steps.branch.outputs.is_dev == 'true' }}\n        name: Checkout repository\n        uses: actions/checkout@v3\n        with:\n          ref: ${{ github.event.pull_request.head.ref }}\n          repository: ${{ github.event.pull_request.head.repo.full_name }}\n      - if: ${{ steps.branch.outputs.is_dev == 'true' }}\n        name: Check if release notes were updated\n        run: cat docs/source/release_notes.rst | grep \":pr:\\`${{ github.event.number }}\\`\"\n        \n"
  },
  {
    "path": ".github/workflows/test_without_test_dependencies.yaml",
    "content": "name: Test without Test Dependencies\non:\n  pull_request:\n    types: [opened, synchronize]\n  push:\n    branches:\n      - main\n  workflow_dispatch:\njobs:\n  use_featuretools_without_test_dependencies:\n    name: Test featuretools without Test Dependencies\n    runs-on: ubuntu-latest\n    strategy:\n      fail-fast: false\n    steps:\n      - name: Set up python 3.10\n        uses: actions/setup-python@v4\n        with:\n          python-version: \"3.10\"\n      - name: Checkout repository\n        uses: actions/checkout@v3\n        with:\n          ref: ${{ github.event.pull_request.head.ref }}\n          repository: ${{ github.event.pull_request.head.repo.full_name }}\n      - name: Build featuretools and install\n        run: |\n          make package\n          python -m pip install unpacked_sdist/\n      - name: Run simple featuretools usage\n        run: |\n          import featuretools as ft\n          es = ft.demo.load_mock_customer(return_entityset=True)\n          ft.dfs(\n              entityset=es,\n              target_dataframe_name=\"customers\",\n              agg_primitives=[\"count\"],\n              trans_primitives=[\"month\"],\n              max_depth=1,\n          )\n          from featuretools.primitives import IsFreeEmailDomain\n          is_free_email_domain = IsFreeEmailDomain()\n          is_free_email_domain(['name@gmail.com', 'name@featuretools.com']).tolist()\n        shell: python\n"
  },
  {
    "path": ".github/workflows/tests_with_latest_deps.yaml",
    "content": "name: Tests\non:\n  pull_request:\n    types: [opened, synchronize]\n  push:\n    branches:\n      - main\n  workflow_dispatch:\njobs:\n  tests:\n    name: ${{ matrix.python_version }} unit tests\n    runs-on: ubuntu-latest\n    strategy:\n      fail-fast: false\n      matrix:\n        python_version: [\"3.9\", \"3.10\", \"3.11\", \"3.12\"]\n\n    steps:\n      - uses: actions/setup-python@v4\n        with:\n          python-version: ${{ matrix.python_version }}\n      - name: Checkout repository\n        uses: actions/checkout@v3\n        with:\n          ref: ${{ github.event.pull_request.head.ref }}\n          repository: ${{ github.event.pull_request.head.repo.full_name }}\n      - name: Build featuretools package\n        run: make package\n      - name: Set up pip and graphviz\n        run: |\n          pip config --site set global.progress_bar off\n          python -m pip install --upgrade pip\n          sudo apt update && sudo apt install -y graphviz\n      - name: Install featuretools with test requirements\n        run: |\n          python -m pip install -e unpacked_sdist/\n          python -m pip install -e unpacked_sdist/[test,dask]\n      - if: ${{ matrix.python_version == 3.9 }}\n        name: Generate coverage args\n        run: echo \"coverage_args=--cov=featuretools --cov-config=../pyproject.toml --cov-report=xml:../coverage.xml\" >> $GITHUB_ENV\n      - if: ${{ env.coverage_args }}\n        name: Erase coverage files\n        run: |\n          cd unpacked_sdist\n          coverage erase\n      - name: Run unit tests\n        run: |\n          cd unpacked_sdist\n          pytest featuretools/ -n auto ${{ env.coverage_args }}\n      - if: ${{ env.coverage_args }}\n        name: Upload coverage to Codecov\n        uses: codecov/codecov-action@v3\n        with:\n          token: ${{ secrets.CODECOV_TOKEN }}\n          fail_ci_if_error: true\n          files: ${{ github.workspace }}/coverage.xml\n          verbose: true\n\n\n  win_unit_tests:\n    name: ${{ matrix.python_version }} windows unit tests\n    runs-on: windows-latest\n    strategy:\n      fail-fast: false\n      matrix:\n        python_version: [\"3.9\", \"3.10\", \"3.11\", \"3.12\"]\n    steps:\n      - name: Download miniconda\n        shell: pwsh\n        run: |\n          $File = \"Miniconda3-latest-Windows-x86_64.exe\"\n          $Uri = \"https://repo.anaconda.com/miniconda/$File\"\n          $ProgressPreference = \"silentlyContinue\"\n          Invoke-WebRequest -Uri $Uri -Outfile \"$env:USERPROFILE/$File\"\n          $hashFromFile = Get-FileHash \"$env:USERPROFILE/$File\" -Algorithm SHA256\n          $hashFromUrl = \"f4d6147b40ea6822255c2dcec8bb0d357c09e230976213f70d7b8c4a10d86bb0\"\n          if ($hashFromFile.Hash -ne \"$hashFromUrl\") {\n            Throw \"$File hashes do not match\"\n          }\n      - name: Install miniconda\n        shell: cmd\n        run: start /wait \"\" %UserProfile%\\Miniconda3-latest-Windows-x86_64.exe /InstallationType=JustMe /RegisterPython=0 /S /D=%UserProfile%\\Miniconda3\n      - name: Create python ${{ matrix.python_version }} environment\n        shell: pwsh\n        run: |\n          . $env:USERPROFILE\\Miniconda3\\shell\\condabin\\conda-hook.ps1\n          conda create -n featuretools python=${{ matrix.python_version }}\n      - name: Checkout repository\n        uses: actions/checkout@v3\n        with:\n          ref: ${{ github.event.pull_request.head.ref }}\n          repository: ${{ github.event.pull_request.head.repo.full_name }}\n      - name: Install featuretools with test requirements\n        shell: pwsh\n        run: |\n          . $env:USERPROFILE\\Miniconda3\\shell\\condabin\\conda-hook.ps1\n          conda activate featuretools\n          conda config --add channels conda-forge\n          conda install -q -y -c conda-forge python-graphviz graphviz\n          python -m pip install --upgrade pip\n          python -m pip install .[test,dask]\n      - name: Run unit tests\n        run: |\n          . $env:USERPROFILE\\Miniconda3\\shell\\condabin\\conda-hook.ps1\n          conda activate featuretools\n          pytest featuretools\\ -n auto\n"
  },
  {
    "path": ".github/workflows/tests_with_minimum_deps.yaml",
    "content": "name: Tests - Minimum Dependencies\non:\n  pull_request:\n    types: [opened, synchronize]\n  push:\n    branches:\n      - main\n  workflow_dispatch:\njobs:\n  py39_tests_minimum_dependencies:\n    name: Tests - 3.9 Minimum Dependencies\n    runs-on: ubuntu-latest\n    strategy:\n      fail-fast: false\n      matrix:\n        python_version: [\"3.9\"]\n    steps:\n      - name: Checkout repository\n        uses: actions/checkout@v3\n        with:\n          ref: ${{ github.event.pull_request.head.ref }}\n          repository: ${{ github.event.pull_request.head.repo.full_name }}\n      - uses: actions/setup-python@v4\n        with:\n          python-version: 3.9\n      - name: Config pip, upgrade pip, and install graphviz\n        run: |\n          sudo apt update\n          sudo apt install -y graphviz\n          pip config --site set global.progress_bar off\n          python -m pip install --upgrade pip\n          python -m pip install wheel\n      - name: Install featuretools with no dependencies\n        run: |\n          python -m pip install -e . --no-dependencies\n      - name: Install featuretools - minimum tests dependencies\n        run: |\n          python -m pip install -r featuretools/tests/requirement_files/minimum_test_requirements.txt\n      - name: Install featuretools - minimum core dependencies\n        run: |\n          python -m pip install -r featuretools/tests/requirement_files/minimum_core_requirements.txt\n      - name: Install featuretools - minimum Dask dependencies\n        run: |\n          python -m pip install -r featuretools/tests/requirement_files/minimum_dask_requirements.txt\n      - name: Run unit tests without code coverage\n        run: python -m pytest -x -n auto featuretools/tests/"
  },
  {
    "path": ".github/workflows/tests_with_woodwork_main_branch.yaml",
    "content": "name: Tests - Featuretools with Woodwork main branch\non:\n  workflow_dispatch:\njobs:\n  tests_woodwork_main:\n    if: ${{ github.repository_owner == 'alteryx' }}\n    name: ${{ matrix.python_version }} tests ${{ matrix.libraries }}\n    runs-on: ubuntu-latest\n    strategy:\n      fail-fast: true\n      matrix:\n        python_version: [\"3.9\", \"3.10\", \"3.11\", \"3.12\"]\n\n    steps:\n      - uses: actions/setup-python@v4\n        with:\n          python-version: ${{ matrix.python_version }}\n      - name: Checkout repository\n        uses: actions/checkout@v3\n      - name: Build featuretools package\n        run: make package\n      - name: Set up pip and graphviz\n        run: |\n          pip config --site set global.progress_bar off\n          python -m pip install -U pip\n          sudo apt update && sudo apt install -y graphviz\n      - name: Install Woodwork & Featuretools - test requirements\n        run: |\n          python -m pip install -e unpacked_sdist/[test,dask]\n          python -m pip uninstall -y woodwork\n          python -m pip install https://github.com/alteryx/woodwork/archive/main.zip\n      - name: Log test run info\n        run: |\n          echo \"Run unit tests without code coverage for ${{ matrix.python_version }}\"\n          echo \"Testing with woodwork version:\" `python -c \"import woodwork; print(woodwork.__version__)\"`\n      - name: Run unit tests without code coverage\n        run: pytest featuretools/ -n auto\n\n  slack_alert_failure:\n    name: Send Slack alert if failure\n    needs: tests_woodwork_main\n    runs-on: ubuntu-latest\n    if: ${{ always() }}\n    steps:\n      - name: Send Slack alert if failure\n        if: ${{ needs.tests_woodwork_main.result != 'success' }}\n        id: slack\n        uses: slackapi/slack-github-action@v1\n        with:\n          payload: |\n            {\n              \"url\": \"${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}\"\n            }\n        env:\n          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}\n"
  },
  {
    "path": ".gitignore",
    "content": "#\ndocs/source/generated/\ndocs/source/getting_started/graphs\nvenv/\ndata/\ninstalled/\noutput.csv\nhtmlcov/\n.idea/\nfeaturetools/tests/integration_data/*.csv\nfeaturetools/tests/integration_data/*.gzip\nfeaturetools/tests/integration_data/customers.gzip\nfeaturetools/tests/integration_data/log-0.gzip\nfeaturetools/tests/integration_data/log-1.gzip\nfeaturetools/tests/integration_data/log.gzip\nfeaturetools/tests/integration_data/products.gzip\nfeaturetools/tests/integration_data/regions.gzip\nfeaturetools/tests/integration_data/sessions.gzip\nfeaturetools/tests/integration_data/stores.gzip\n**/dask-worker-space/*\n*.dirlock\n*.~lock*\nunpacked_sdist/\n\n# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n**/.DS_Store\n.DS_Store\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\nenv/\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\nwheels/\n*.egg-info/\n.installed.cfg\n*.egg\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*.cover\n.hypothesis/\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\nlocal_settings.py\n\n# Flask stuff:\ninstance/\n.webassets-cache\n\n# Scrapy stuff:\n.scrapy\n\n# Sphinx documentation\ndocs/_build/\n\n# PyBuilder\ntarget/\n\n# Jupyter Notebook\n.ipynb_checkpoints\n\n# pyenv\n.python-version\n\n# celery beat schedule file\ncelerybeat-schedule\n\n# SageMath parsed files\n*.sage.py\n\n# dotenv\n.env\n\n# virtualenv\n.venv\nvenv/\nENV/\n\n# Spyder project settings\n.spyderproject\n.spyproject\n\n# Rope project settings\n.ropeproject\n\n# mkdocs documentation\n/site\n\n# mypy\n.mypy_cache/\n\n# pickle files\n*.p\n*.pickle\n\n.pytest_cache\n\n#IDE\n.vscode\n.devcontainer\n\n*.stats\nDockerfile.arm\n.dockerignore\n"
  },
  {
    "path": ".pre-commit-config.yaml",
    "content": "exclude: |\n  (?x)\n  .html$|.csv$|.svg$|.md$|.txt$|.json$|.xml$|.pickle$|^.github/|\n  (LICENSE.*|README.*)\nrepos:\n  - repo: https://github.com/kynan/nbstripout\n    rev: 0.5.0\n    hooks:\n      - id: nbstripout\n        entry: nbstripout\n        language: python\n        types: [jupyter]\n  - repo: https://github.com/pre-commit/pre-commit-hooks\n    rev: v4.3.0\n    hooks:\n      - id: end-of-file-fixer\n      - id: trailing-whitespace\n  - repo: https://github.com/MarcoGorelli/absolufy-imports\n    rev: v0.3.1\n    hooks:\n      - id: absolufy-imports\n        files: ^featuretools/\n  - repo: https://github.com/asottile/add-trailing-comma\n    rev: v2.2.3\n    hooks:\n      - id: add-trailing-comma\n        name: Add trailing comma\n  - repo: https://github.com/charliermarsh/ruff-pre-commit\n    rev: 'v0.3.3'\n    hooks:\n      - id: ruff\n        types_or: [ python, pyi, jupyter ]\n        args:\n          - --fix\n          - --config=./pyproject.toml\n      - id: ruff-format\n        types_or: [ python, pyi, jupyter ]\n        args:\n          - --config=./pyproject.toml\n"
  },
  {
    "path": ".readthedocs.yaml",
    "content": "# .readthedocs.yaml\n# Read the Docs configuration file\n# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details\n\n# Required\nversion: 2\n\n# Build documentation in the docs/ directory with Sphinx\nsphinx:\n  configuration: docs/source/conf.py\n\n# Optionally build your docs in additional formats such as PDF and ePub\nformats: []\n\nbuild:\n  os: \"ubuntu-22.04\"\n  tools:\n    python: \"3.9\"\n  apt_packages:\n    - graphviz\n    - openjdk-11-jre-headless\n  jobs:\n    post_build:\n      - export JAVA_HOME=\"/usr/lib/jvm/java-11-openjdk-amd64\"\n\npython:\n  install:\n    - method: pip\n      path: .\n      extra_requirements:\n        - docs\n"
  },
  {
    "path": "LICENSE",
    "content": "BSD 3-Clause License\n\nCopyright (c) 2017, Feature Labs, Inc.\nAll rights reserved.\n\nRedistribution and use in source and binary forms, with or without\nmodification, are permitted provided that the following conditions are met:\n\n* Redistributions of source code must retain the above copyright notice, this\n  list of conditions and the following disclaimer.\n\n* Redistributions in binary form must reproduce the above copyright notice,\n  this list of conditions and the following disclaimer in the documentation\n  and/or other materials provided with the distribution.\n\n* Neither the name of the copyright holder nor the names of its\n  contributors may be used to endorse or promote products derived from\n  this software without specific prior written permission.\n\nTHIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\"\nAND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE\nIMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE\nDISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE\nFOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL\nDAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR\nSERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER\nCAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,\nOR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE\nOF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\n"
  },
  {
    "path": "Makefile",
    "content": ".PHONY: clean\nclean:\n\tfind . -name '*.pyo' -delete\n\tfind . -name '*.pyc' -delete\n\tfind . -name __pycache__ -delete\n\tfind . -name '*~' -delete\n\tfind . -name '.coverage.*' -delete\n\n.PHONY: lint\nlint:\n\tpython docs/notebook_version_standardizer.py check-execution\n\truff check . --config=./pyproject.toml\n\truff format . --check --config=./pyproject.toml\n\n.PHONY: lint-fix\nlint-fix:\n\tpython docs/notebook_version_standardizer.py standardize\n\truff check . --fix --config=./pyproject.toml\n\truff format . --config=./pyproject.toml\n\n.PHONY: test\ntest:\n\tpython -m pytest featuretools/ -n auto\n\n.PHONY: testcoverage\ntestcoverage:\n\tpython -m pytest featuretools/ --cov=featuretools -n auto\n\n.PHONY: installdeps\ninstalldeps: upgradepip\n\tpip install -e .\n\n.PHONY: installdeps-dev\ninstalldeps-dev: upgradepip\n\tpip install -e \".[dev]\"\n\tpre-commit install\n\n.PHONY: installdeps-test\ninstalldeps-test: upgradepip\n\tpip install -e \".[test]\"\n\n.PHONY: checkdeps\ncheckdeps:\n\t$(eval allow_list='holidays|scipy|numpy|pandas|tqdm|cloudpickle|distributed|dask|psutil|woodwork')\n\tpip freeze | grep -v \"alteryx/featuretools.git\" | grep -E $(allow_list) > $(OUTPUT_PATH)\n\n.PHONY: upgradepip\nupgradepip:\n\tpython -m pip install --upgrade pip\n\n.PHONY: upgradebuild\nupgradebuild:\n\tpython -m pip install --upgrade build\n\n.PHONY: upgradesetuptools\nupgradesetuptools:\n\tpython -m pip install --upgrade setuptools\n\n.PHONY: package\npackage: upgradepip upgradebuild upgradesetuptools\n\tpython -m build\n\t$(eval PACKAGE=$(shell python -c 'import setuptools; setuptools.setup()' --version))\n\ttar -zxvf \"dist/featuretools-${PACKAGE}.tar.gz\"\n\tmv \"featuretools-${PACKAGE}\" unpacked_sdist\n"
  },
  {
    "path": "README.md",
    "content": "<p align=\"center\">\n<img width=50% src=\"https://www.featuretools.com/wp-content/uploads/2017/12/FeatureLabs-Logo-Tangerine-800.png\" alt=\"Featuretools\" />\n</p>\n<p align=\"center\">\n<i>\"One of the holy grails of machine learning is to automate more and more of the feature engineering process.\"</i> ― Pedro Domingos, <a href=\"https://bit.ly/things_to_know_ml\">A Few Useful Things to Know about Machine Learning</a>\n</p>\n\n<p align=\"center\">\n    <a href=\"https://github.com/alteryx/featuretools/actions/workflows/tests_with_latest_deps.yaml\" alt=\"Tests\" target=\"_blank\">\n        <img src=\"https://github.com/alteryx/featuretools/actions/workflows/tests_with_latest_deps.yaml/badge.svg?branch=main\" alt=\"Tests\" />\n    </a>\n    <a href=\"https://codecov.io/gh/alteryx/featuretools\">\n        <img src=\"https://codecov.io/gh/alteryx/featuretools/branch/main/graph/badge.svg\"/>\n    </a>\n    <a href='https://featuretools.alteryx.com/en/stable/?badge=stable'>\n        <img src='https://readthedocs.com/projects/feature-labs-inc-featuretools/badge/?version=stable' alt='Documentation Status' />\n    </a>\n    <a href=\"https://badge.fury.io/py/featuretools\" target=\"_blank\">\n        <img src=\"https://badge.fury.io/py/featuretools.svg?maxAge=2592000\" alt=\"PyPI Version\" />\n    </a>\n    <a href=\"https://anaconda.org/conda-forge/featuretools\" target=\"_blank\">\n        <img src=\"https://anaconda.org/conda-forge/featuretools/badges/version.svg\" alt=\"Anaconda Version\" />\n    </a>\n    <a href=\"https://stackoverflow.com/questions/tagged/featuretools\" target=\"_blank\">\n        <img src=\"http://img.shields.io/badge/questions-on_stackoverflow-blue.svg\" alt=\"StackOverflow\" />\n    </a>\n    <a href=\"https://pepy.tech/project/featuretools\" target=\"_blank\">\n        <img src=\"https://static.pepy.tech/badge/featuretools/month\" alt=\"PyPI Downloads\" />\n    </a>\n</p>\n<hr>\n\n[Featuretools](https://www.featuretools.com) is a python library for automated feature engineering. See the [documentation](https://docs.featuretools.com) for more information.\n\n## Installation\nInstall with pip\n\n```\npython -m pip install featuretools\n```\n\nor from the Conda-forge channel on [conda](https://anaconda.org/conda-forge/featuretools):\n\n```\nconda install -c conda-forge featuretools\n```\n\n### Add-ons\n\nYou can install add-ons individually or all at once by running:\n\n```\npython -m pip install \"featuretools[complete]\"\n```\n\n**Premium Primitives** - Use Premium Primitives from the premium-primitives repo\n\n```\npython -m pip install \"featuretools[premium]\"\n```\n\n**NLP Primitives** - Use Natural Language Primitives from the nlp-primitives repo\n\n```\npython -m pip install \"featuretools[nlp]\"\n```\n\n**Dask Support** - Use Dask to run DFS with njobs > 1\n\n```\npython -m pip install \"featuretools[dask]\"\n```\n\n## Example\nBelow is an example of using Deep Feature Synthesis (DFS) to perform automated feature engineering. In this example, we apply DFS to a multi-table dataset consisting of timestamped customer transactions.\n\n```python\n>> import featuretools as ft\n>> es = ft.demo.load_mock_customer(return_entityset=True)\n>> es.plot()\n```\n\n<img src=\"https://github.com/alteryx/featuretools/blob/main/docs/source/_static/images/entity_set.png?raw=true\" width=\"350\">\n\nFeaturetools can automatically create a single table of features for any \"target dataframe\"\n```python\n>> feature_matrix, features_defs = ft.dfs(entityset=es, target_dataframe_name=\"customers\")\n>> feature_matrix.head(5)\n```\n\n```\n            zip_code  COUNT(transactions)  COUNT(sessions)  SUM(transactions.amount) MODE(sessions.device)  MIN(transactions.amount)  MAX(transactions.amount)  YEAR(join_date)  SKEW(transactions.amount)  DAY(join_date)                   ...                     SUM(sessions.MIN(transactions.amount))  MAX(sessions.SKEW(transactions.amount))  MAX(sessions.MIN(transactions.amount))  SUM(sessions.MEAN(transactions.amount))  STD(sessions.SUM(transactions.amount))  STD(sessions.MEAN(transactions.amount))  SKEW(sessions.MEAN(transactions.amount))  STD(sessions.MAX(transactions.amount))  NUM_UNIQUE(sessions.DAY(session_start))  MIN(sessions.SKEW(transactions.amount))\ncustomer_id                                                                                                                                                                                                                                  ...\n1              60091                  131               10                  10236.77               desktop                      5.60                    149.95             2008                   0.070041               1                   ...                                                     169.77                                 0.610052                                   41.95                               791.976505                              175.939423                                 9.299023                                 -0.377150                                5.857976                                        1                                -0.395358\n2              02139                  122                8                   9118.81                mobile                      5.81                    149.15             2008                   0.028647              20                   ...                                                     114.85                                 0.492531                                   42.96                               596.243506                              230.333502                                10.925037                                  0.962350                                7.420480                                        1                                -0.470007\n3              02139                   78                5                   5758.24               desktop                      6.78                    147.73             2008                   0.070814              10                   ...                                                      64.98                                 0.645728                                   21.77                               369.770121                              471.048551                                 9.819148                                 -0.244976                               12.537259                                        1                                -0.630425\n4              60091                  111                8                   8205.28               desktop                      5.73                    149.56             2008                   0.087986              30                   ...                                                      83.53                                 0.516262                                   17.27                               584.673126                              322.883448                                13.065436                                 -0.548969                               12.738488                                        1                                -0.497169\n5              02139                   58                4                   4571.37                tablet                      5.91                    148.17             2008                   0.085883              19                   ...                                                      73.09                                 0.830112                                   27.46                               313.448942                              198.522508                                 8.950528                                  0.098885                                5.599228                                        1                                -0.396571\n\n[5 rows x 69 columns]\n```\nWe now have a feature vector for each customer that can be used for machine learning. See the [documentation on Deep Feature Synthesis](https://featuretools.alteryx.com/en/stable/getting_started/afe.html) for more examples.\n\nFeaturetools contains many different types of built-in primitives for creating features. If the primitive you need is not included, Featuretools also allows you to [define your own custom primitives](https://featuretools.alteryx.com/en/stable/getting_started/primitives.html#defining-custom-primitives).\n\n## Demos\n**Predict Next Purchase**\n\n[Repository](https://github.com/alteryx/open_source_demos/blob/main/predict-next-purchase/) | [Notebook](https://github.com/alteryx/open_source_demos/blob/main/predict-next-purchase/Tutorial.ipynb)\n\nIn this demonstration, we use a multi-table dataset of 3 million online grocery orders from Instacart to predict what a customer will buy next. We show how to generate features with automated feature engineering and build an accurate machine learning pipeline using Featuretools, which can be reused for multiple prediction problems. For more advanced users, we show how to scale that pipeline to a large dataset using Dask.\n\nFor more examples of how to use Featuretools, check out our [demos](https://www.featuretools.com/demos) page.\n\n## Testing & Development\n\nThe Featuretools community welcomes pull requests. Instructions for testing and development are available [here.](https://featuretools.alteryx.com/en/stable/install.html#development)\n\n## Support\nThe Featuretools community is happy to provide support to users of Featuretools. Project support can be found in four places depending on the type of question:\n\n1. For usage questions, use [Stack Overflow](https://stackoverflow.com/questions/tagged/featuretools) with the `featuretools` tag.\n2. For bugs, issues, or feature requests start a [Github issue](https://github.com/alteryx/featuretools/issues).\n3. For discussion regarding development on the core library, use [Slack](https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA).\n4. For everything else, the core developers can be reached by email at open_source_support@alteryx.com\n\n## Citing Featuretools\n\nIf you use Featuretools, please consider citing the following paper:\n\nJames Max Kanter, Kalyan Veeramachaneni. [Deep feature synthesis: Towards automating data science endeavors.](https://dai.lids.mit.edu/wp-content/uploads/2017/10/DSAA_DSM_2015.pdf) *IEEE DSAA 2015*.\n\nBibTeX entry:\n\n```bibtex\n@inproceedings{kanter2015deep,\n  author    = {James Max Kanter and Kalyan Veeramachaneni},\n  title     = {Deep feature synthesis: Towards automating data science endeavors},\n  booktitle = {2015 {IEEE} International Conference on Data Science and Advanced Analytics, DSAA 2015, Paris, France, October 19-21, 2015},\n  pages     = {1--10},\n  year      = {2015},\n  organization={IEEE}\n}\n```\n\n## Built at Alteryx\n\n**Featuretools** is an open source project maintained by [Alteryx](https://www.alteryx.com). To see the other open source projects we’re working on visit [Alteryx Open Source](https://www.alteryx.com/open-source). If building impactful data science pipelines is important to you or your business, please get in touch.\n\n<p align=\"center\">\n  <a href=\"https://www.alteryx.com/open-source\">\n    <img src=\"https://alteryx-oss-web-images.s3.amazonaws.com/OpenSource_Logo-01.png\" alt=\"Alteryx Open Source\" width=\"800\"/>\n  </a>\n</p>\n"
  },
  {
    "path": "contributing.md",
    "content": "# Contributing to Featuretools\n\n:+1::tada: First off, thank you for taking the time to contribute! :tada::+1:\n\nWhether you are a novice or experienced software developer, all contributions and suggestions are welcome!\n\nThere are many ways to contribute to Featuretools, with the most common ones being contribution of code or documentation to the project.\n\n**To contribute, you can:**\n1. Help users on our [Slack channel](https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA). Answer questions under the featuretools tag on [Stack Overflow](https://stackoverflow.com/questions/tagged/featuretools)\n\n2. Submit a pull request for one of [Good First Issues](https://github.com/alteryx/featuretools/issues?q=is%3Aopen+is%3Aissue+label%3A%22Good+First+Issue%22)\n\n3. Make changes to the codebase, see [Contributing to the codebase](#Contributing-to-the-Codebase).\n\n4. Improve our documentation, which can be found under the [docs](docs/) directory or at https://docs.featuretools.com\n\n5. [Report issues](#Report-issues) you're facing, and give a \"thumbs up\" on issues that others reported and that are relevant to you. Issues should be used for bugs, and feature requests only.\n\n6. Spread the word: reference Featuretools from your blog and articles, link to it from your website, or simply star it in GitHub to say \"I use it\".\n    * If you would like to be featured on [ecosystem page](https://featuretools.alteryx.com/en/stable/resources/ecosystem.html), you can submit a [pull request](https://github.com/alteryx/featuretools).\n\n## Contributing to the Codebase\n\nBefore starting major work, you should touch base with the maintainers of Featuretools by filing an issue on GitHub or posting a message in the [#development channel on Slack](https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA). This will increase the likelihood your pull request will eventually get merged in.\n\n#### 1. Fork and clone repo\n* The code is hosted on GitHub, so you will need to use Git to fork the project and make changes to the codebase. To start, go to the [Featuretools GitHub page](https://github.com/alteryx/featuretools) and click the `Fork` button.\n* After you have created the fork, you will want to clone the fork to your machine and connect your version of the project to the upstream Featuretools repo.\n  ```bash\n  git clone https://github.com/your-user-name/featuretools.git\n  cd featuretools\n  git remote add upstream https://github.com/alteryx/featuretools\n  ```\n* Once you have obtained a copy of the code, you should create a development environment that is separate from your existing Python environment so that you can make and test changes without compromising your own work environment. You can run the following steps to create a separate virtual environment, and install Featuretools in editable mode.\n  ```bash\n  python -m venv venv\n  source venv/bin/activate\n  make installdeps\n  git checkout -b issue####-branch_name\n  ```\n\n* You will need to install GraphViz, and Pandoc to run all unit tests & build docs:\n\n  > Pandoc is only needed to build the documentation locally.\n\n     **macOS (Intel)** (use [Homebrew](https://brew.sh/)):\n     ```console\n     brew install graphviz pandoc\n     ```\n\n     **macOS (M1)** (use [Homebrew](https://brew.sh/)):\n     ```console\n     brew install graphviz pandoc\n     ```\n\n     **Ubuntu**:\n     ```console\n     sudo apt install graphviz pandoc -y\n     ```\n\n#### 2. Implement your Pull Request\n\n* Implement your pull request. If needed, add new tests or update the documentation.\n* Before submitting to GitHub, verify the tests run and the code lints properly\n  ```bash\n  # runs linting\n  make lint\n\n  # will fix some common linting issues automatically\n  make lint-fix\n\n  # runs test\n  make test\n  ```\n* If you made changes to the documentation, build the documentation locally.\n  ```bash\n  # go to docs and build\n  cd docs\n  make html\n\n  # view docs locally\n  open build/html/index.html\n  ```\n* Before you commit, a few lint fixing hooks will run. You can also manually run these.\n  ```bash\n  # run linting hooks only on changed files\n  pre-commit run\n\n  # run linting hooks on all files\n  pre-commit run --all-files\n  ```\n\n#### 3. Submit your Pull Request\n\n* Once your changes are ready to be submitted, make sure to push your changes to GitHub before creating a pull request.\n* If you need to update your code with the latest changes from the main Featuretools repo, you can do that by running the commands below, which will merge the latest changes from the Featuretools `main` branch into your current local branch. You may need to resolve merge conflicts if there are conflicts between your changes and the upstream changes. After the merge, you will need to push the updates to your forked repo after running these commands.\n  ```bash\n  git fetch upstream\n  git merge upstream/main\n  ```\n* Create a pull request to merge the changes from your forked repo branch into the Featuretools `main` branch. Creating the pull request will automatically run our continuous integration.\n* If this is your first contribution, you will need to sign the Contributor License Agreement as directed.\n* Update the \"Future Release\" section of the release notes (`docs/source/release_notes.rst`) to include your pull request and add your github username to the list of contributors.  Add a description of your PR to the subsection that most closely matches your contribution:\n    * Enhancements: new features or additions to Featuretools.\n    * Fixes: things like bugfixes or adding more descriptive error messages.\n    * Changes: modifications to an existing part of Featuretools.\n    * Documentation Changes\n    * Testing Changes\n\n   Documentation or testing changes rarely warrant an individual release notes entry; the PR number can be added to their respective \"Miscellaneous changes\" entries.\n* We will review your changes, and you will most likely be asked to make additional changes before it is finally ready to merge. However, once it's reviewed by a maintainer of Featuretools, passes continuous integration, we will merge it, and you will have successfully contributed to Featuretools!\n\n## Report issues\nWhen reporting issues please include as much detail as possible about your operating system, Featuretools version and python version. Whenever possible, please also include a brief, self-contained code example that demonstrates the problem.\n"
  },
  {
    "path": "docs/Makefile",
    "content": "# Makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line.\nSPHINXOPTS    =\nSPHINXBUILD   = sphinx-build\nPAPER         =\nBUILDDIR      = build\nGENDIR        = source/generated\n\n# User-friendly check for sphinx-build\nifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1)\n\t$(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don\\'t have Sphinx installed, grab it from http://sphinx-doc.org/)\nendif\n\n# Internal variables.\nPAPEROPT_a4     = -D latex_paper_size=a4\nPAPEROPT_letter = -D latex_paper_size=letter\nALLSPHINXOPTS   = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source\n# the i18n builder cannot share the environment and doctrees with the others\nI18NSPHINXOPTS  = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source\n\n.PHONY: help\nhelp:\n\t@echo \"Please use \\`make <target>' where <target> is one of\"\n\t@echo \"  html       to make standalone HTML files\"\n\t@echo \"  dirhtml    to make HTML files named index.html in directories\"\n\t@echo \"  singlehtml to make a single large HTML file\"\n\t@echo \"  pickle     to make pickle files\"\n\t@echo \"  json       to make JSON files\"\n\t@echo \"  htmlhelp   to make HTML files and a HTML help project\"\n\t@echo \"  qthelp     to make HTML files and a qthelp project\"\n\t@echo \"  applehelp  to make an Apple Help Book\"\n\t@echo \"  devhelp    to make HTML files and a Devhelp project\"\n\t@echo \"  epub       to make an epub\"\n\t@echo \"  epub3      to make an epub3\"\n\t@echo \"  latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter\"\n\t@echo \"  latexpdf   to make LaTeX files and run them through pdflatex\"\n\t@echo \"  latexpdfja to make LaTeX files and run them through platex/dvipdfmx\"\n\t@echo \"  text       to make text files\"\n\t@echo \"  man        to make manual pages\"\n\t@echo \"  texinfo    to make Texinfo files\"\n\t@echo \"  info       to make Texinfo files and run them through makeinfo\"\n\t@echo \"  gettext    to make PO message catalogs\"\n\t@echo \"  changes    to make an overview of all changed/added/deprecated items\"\n\t@echo \"  xml        to make Docutils-native XML files\"\n\t@echo \"  pseudoxml  to make pseudoxml-XML files for display purposes\"\n\t@echo \"  linkcheck  to check all external links for integrity\"\n\t@echo \"  doctest    to run all doctests embedded in the documentation (if enabled)\"\n\t@echo \"  coverage   to run coverage check of the documentation (if enabled)\"\n\t@echo \"  dummy      to check syntax errors of document sources\"\n\n.PHONY: clean\nclean:\n\trm -rf $(BUILDDIR)/*\n\trm -rf $(GENDIR)/*\n\n.PHONY: html\nhtml:\n\t$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html $(SPHINXOPTS)\n\t@echo\n\t@echo \"Build finished. The HTML pages are in $(BUILDDIR)/html.\"\n\n.PHONY: dirhtml\ndirhtml:\n\t$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml\n\t@echo\n\t@echo \"Build finished. The HTML pages are in $(BUILDDIR)/dirhtml.\"\n\n.PHONY: singlehtml\nsinglehtml:\n\t$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml\n\t@echo\n\t@echo \"Build finished. The HTML page is in $(BUILDDIR)/singlehtml.\"\n\n.PHONY: pickle\npickle:\n\t$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle\n\t@echo\n\t@echo \"Build finished; now you can process the pickle files.\"\n\n.PHONY: json\njson:\n\t$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json\n\t@echo\n\t@echo \"Build finished; now you can process the JSON files.\"\n\n.PHONY: htmlhelp\nhtmlhelp:\n\t$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp\n\t@echo\n\t@echo \"Build finished; now you can run HTML Help Workshop with the\" \\\n\t      \".hhp project file in $(BUILDDIR)/htmlhelp.\"\n\n.PHONY: qthelp\nqthelp:\n\t$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp\n\t@echo\n\t@echo \"Build finished; now you can run \"qcollectiongenerator\" with the\" \\\n\t      \".qhcp project file in $(BUILDDIR)/qthelp, like this:\"\n\t@echo \"# qcollectiongenerator $(BUILDDIR)/qthelp/featuretools.qhcp\"\n\t@echo \"To view the help file:\"\n\t@echo \"# assistant -collectionFile $(BUILDDIR)/qthelp/featuretools.qhc\"\n\n.PHONY: applehelp\napplehelp:\n\t$(SPHINXBUILD) -b applehelp $(ALLSPHINXOPTS) $(BUILDDIR)/applehelp\n\t@echo\n\t@echo \"Build finished. The help book is in $(BUILDDIR)/applehelp.\"\n\t@echo \"N.B. You won't be able to view it unless you put it in\" \\\n\t      \"~/Library/Documentation/Help or install it in your application\" \\\n\t      \"bundle.\"\n\n.PHONY: devhelp\ndevhelp:\n\t$(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp\n\t@echo\n\t@echo \"Build finished.\"\n\t@echo \"To view the help file:\"\n\t@echo \"# mkdir -p $$HOME/.local/share/devhelp/featuretools\"\n\t@echo \"# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/featuretools\"\n\t@echo \"# devhelp\"\n\n.PHONY: epub\nepub:\n\t$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub\n\t@echo\n\t@echo \"Build finished. The epub file is in $(BUILDDIR)/epub.\"\n\n.PHONY: epub3\nepub3:\n\t$(SPHINXBUILD) -b epub3 $(ALLSPHINXOPTS) $(BUILDDIR)/epub3\n\t@echo\n\t@echo \"Build finished. The epub3 file is in $(BUILDDIR)/epub3.\"\n\n.PHONY: latex\nlatex:\n\t$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex\n\t@echo\n\t@echo \"Build finished; the LaTeX files are in $(BUILDDIR)/latex.\"\n\t@echo \"Run \\`make' in that directory to run these through (pdf)latex\" \\\n\t      \"(use \\`make latexpdf' here to do that automatically).\"\n\n.PHONY: latexpdf\nlatexpdf:\n\t$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex\n\t@echo \"Running LaTeX files through pdflatex...\"\n\t$(MAKE) -C $(BUILDDIR)/latex all-pdf\n\t@echo \"pdflatex finished; the PDF files are in $(BUILDDIR)/latex.\"\n\n.PHONY: latexpdfja\nlatexpdfja:\n\t$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex\n\t@echo \"Running LaTeX files through platex and dvipdfmx...\"\n\t$(MAKE) -C $(BUILDDIR)/latex all-pdf-ja\n\t@echo \"pdflatex finished; the PDF files are in $(BUILDDIR)/latex.\"\n\n.PHONY: text\ntext:\n\t$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text\n\t@echo\n\t@echo \"Build finished. The text files are in $(BUILDDIR)/text.\"\n\n.PHONY: man\nman:\n\t$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man\n\t@echo\n\t@echo \"Build finished. The manual pages are in $(BUILDDIR)/man.\"\n\n.PHONY: texinfo\ntexinfo:\n\t$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo\n\t@echo\n\t@echo \"Build finished. The Texinfo files are in $(BUILDDIR)/texinfo.\"\n\t@echo \"Run \\`make' in that directory to run these through makeinfo\" \\\n\t      \"(use \\`make info' here to do that automatically).\"\n\n.PHONY: info\ninfo:\n\t$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo\n\t@echo \"Running Texinfo files through makeinfo...\"\n\tmake -C $(BUILDDIR)/texinfo info\n\t@echo \"makeinfo finished; the Info files are in $(BUILDDIR)/texinfo.\"\n\n.PHONY: gettext\ngettext:\n\t$(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale\n\t@echo\n\t@echo \"Build finished. The message catalogs are in $(BUILDDIR)/locale.\"\n\n.PHONY: changes\nchanges:\n\t$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes\n\t@echo\n\t@echo \"The overview file is in $(BUILDDIR)/changes.\"\n\n.PHONY: linkcheck\nlinkcheck:\n\t$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck\n\t@echo\n\t@echo \"Link check complete; look for any errors in the above output \" \\\n\t      \"or in $(BUILDDIR)/linkcheck/output.txt.\"\n\n.PHONY: doctest\ndoctest:\n\t$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest\n\t@echo \"Testing of doctests in the sources finished, look at the \" \\\n\t      \"results in $(BUILDDIR)/doctest/output.txt.\"\n\n.PHONY: coverage\ncoverage:\n\t$(SPHINXBUILD) -b coverage $(ALLSPHINXOPTS) $(BUILDDIR)/coverage\n\t@echo \"Testing of coverage in the sources finished, look at the \" \\\n\t      \"results in $(BUILDDIR)/coverage/python.txt.\"\n\n.PHONY: xml\nxml:\n\t$(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml\n\t@echo\n\t@echo \"Build finished. The XML files are in $(BUILDDIR)/xml.\"\n\n.PHONY: pseudoxml\npseudoxml:\n\t$(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml\n\t@echo\n\t@echo \"Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml.\"\n\n.PHONY: dummy\ndummy:\n\t$(SPHINXBUILD) -b dummy $(ALLSPHINXOPTS) $(BUILDDIR)/dummy\n\t@echo\n\t@echo \"Build finished. Dummy builder generates no files.\"\n"
  },
  {
    "path": "docs/backport_release.md",
    "content": "# Backport Release Process\n\nIn situations where we need to backport commits to earlier versions of our software, we'll need to perform the release process slightly differently than a normal release.\n\n<p align=\"center\">\n<img width=60% src=\"source/_static/images/backport_release.png\" alt=\"Backport Release\" />\n</p>\n\nThis document outlines the differences between a normal release and a backport release. It uses the same outline as the [Release Guide](../release.md).\n\n## 0. Pre-Release Checklist\n\nBefore starting the backport release process, verify the following:\n\n- Get agreement on the latest commit to use for targeting the release. A backport release will be targeted on some commit other than the latest on main. Many times the new target will be an old release, which will have a tag that can be referenced--for example `v0.11.1`.\n- Get agreement on the commits to port over for the backport release.\n- Get agreement on the version number to use for the backport release.\n\n#### Version Numbering for Backport Releases\n\nFeaturetools uses [semantic versioning](https://semver.org/). Every release has a major, minor and patch version number, and are displayed like so: `<majorVersion>.<minorVersion>.<patchVersion>`. **A backport release will increment the patch version.**\n\nThis may be an intermediate number between two preexisting releases--for example a new `0.11.2` to be added between existing `0.11.1` and `0.12.0` releases. It can also be a new latest release--so `0.12.1` in the same situation--using only some of the commits that are present in the Future Release section of the release notes.\n\n## 0.5. Create target branch for backport release\n\n#### Checkout intended target commit\n\n1. Checkout the agreed upon latest commit for targeting the release. If this is a previous release, you may checkout its tag with `git checkout v0.11.1`.\n\n#### Create backport branch\n\n1. Branch off of the target commit. For the branch name, please use the most recent major and minor versions to this commit (in this example `0` and `11` respectively), leaving the patch number as an `x`. This means that we would create `0.11.x` in the working example. This is necessary so that if any further backport releases are needed, we could continue to use this branch as the target. This branch is to be treated as `main` is treated in a normal release. It will be the target for our release.\n\nThis branch will be automatically protected (unless the version exceeds 9.Y.x or X.99.x, in which case contact the repo team about expanding the protection rules) to avoid unintended commits from making their way into the release undetected.\n\n#### Port over desired commits\n\n1. Create a feature branch off the backport branch. For the branch name, please use \"backport_vX.Y.Z\" as the naming scheme (e.g. \"backport_v0.11.2). Doing so will bypass our release notes checkin test which requires all other PRs to add a release note entry.\n2. Cherry-pick the desired commits onto `backport_v0.11.2`.\n3. Create a pull request with the backport `0.11.x` branch as its target, get confirmation that the desired changes were added, and confirm that the CI checks pass.\n4. Under the \"Future Release\" section in the release notes, include the ported over commits' release notes (don't remove them from their original location back on `main`), indicating that they are a backport of the original PR.\n\n   ```\n   Future Release\n   ==============\n       * Enhancements\n       * Fixes\n           * Fix bug (backport of :pr:`1110`)\n       * Changes\n       * Documentation Changes\n       * Testing Changes\n\n   Thanks to the following people for contributing to this release:\n   ```\n\n5. Merge the PR into the `0.11.x` backport branch\n\n## 1. Create Featuretools Backport release on Github\n\nWith our backport branch `0.11.x` as our target, we now proceed with the release of `0.11.2`.\n\n#### Create release branch\n\n1. **Branch off of the backport branch `0.11.x`.** For the branch name, please use \"release_vX.Y.Z\" as the naming scheme (e.g. \"release_v0.11.2\"). Doing so will bypass our release notes checkin test which requires all other PRs to add a release note entry.\n\n#### Bump version number\n\n1. Bump `__version__` in `setup.py`, `featuretools/version.py`, and `featuretools/tests/test_version.py`.\n\n#### Update Release Notes\n\n1. Replace **\"Future Release\"** in `docs/source/release_notes.rst` with the current date\n\n   ```\n   v0.11.2 Sep 28, 2020\n   ====================\n   ```\n\n2. Remove any unused Release Notes sections for this release (e.g. Fixes, Testing Changes)\n3. Add yourself to the list of contributors to this release and **put the contributors in alphabetical order**\n4. The release PR does not need to be mentioned in the list of changes\n5. Add a commented out \"Future Release\" section with all of the Release Notes sections above the current section\n\n   ```\n   .. Future Release\n     ==============\n       * Enhancements\n       * Fixes\n       * Changes\n       * Documentation Changes\n       * Testing Changes\n\n   .. Thanks to the following people for contributing to this release:\n   ```\n\n#### Create Release PR\n\nA [release pr](https://github.com/alteryx/featuretools/pull/1915) should have the version number as the title and the release notes for that release as the PR body text. The contributors list is not necessary. The special sphinx docs syntax (:pr:\\`547\\`) needs to be changed to github link syntax (#547).\n\nChecklist before merging:\n\n- All tests are currently green on checkin and on `0.11.x`.\n- The ReadtheDocs build for the release PR branch has passed, and the resulting docs contain the expected release notes.\n- PR has been reviewed and approved.\n- Confirm with the team that `0.11.x` will be frozen until step 2 (Github Release) is complete.\n\n## 2. Create Github Release\n\nAfter the release pull request has been merged into the `0.11.x` branch, it is time draft the github release. [Example release](https://github.com/alteryx/featuretools/releases/tag/v1.6.0)\n\n- **The target should be the `0.11.x` backport branch**\n- The tag should be the version number with a v prefix (e.g. v0.11.2)\n- Release title is the same as the tag\n- Release description should be the full Release Notes updates for the release, including the line thanking contributors. Contributors should also have their links changed from the docs syntax (:user:\\`gsheni\\`) to github syntax (@gsheni)\n- This is not a pre-release\n- Publishing the release will automatically upload the package to PyPI\n\nNote that this backported release will show up on the repository's front page as the latest release even if there is technically a later `0.12.0` release.\n\n## Release on conda-forge\n\nIf a later release exists, conda-forge will not automatically create a new PR in [conda-forge/featuretools-feedstock](https://github.com/conda-forge/featuretools-feedstock/pulls). Instead a PR will need to be manually created. You can do either of the following:\n\n- Branch off of the 0.11.1 meta.yaml update commit for the 0.11.2 meta.yaml changes. This is \"cleaner\" and sometimes easier, but if migration files (like py310) have been added between 0.11.1 and 0.12.0 you will have to add them in and re-render yourself.\n- Tack the 0.11.2 changes on after the 0.12.0 update commit in the feedstock repo. This means that if any of the boilerplate has changed, you do not have to manually re-add it yourself. An example of this can be seen from a Woodwork backport release [here](https://github.com/conda-forge/woodwork-feedstock/pull/32).\n\nOnce the PR is created:\n\n1. Update requirements changes in `recipe/meta.yaml` - you may need to handle the version, source links, and SHA256 if you had to open the PR yourself. You will also need to update the requirements.\n2. After tests pass, a maintainer will merge the PR in\n"
  },
  {
    "path": "docs/make.bat",
    "content": "@ECHO OFF\r\n\r\nREM Command file for Sphinx documentation\r\n\r\nif \"%SPHINXBUILD%\" == \"\" (\r\n\tset SPHINXBUILD=sphinx-build\r\n)\r\nset BUILDDIR=build\r\nset ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% source\r\nset I18NSPHINXOPTS=%SPHINXOPTS% source\r\nif NOT \"%PAPER%\" == \"\" (\r\n\tset ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS%\r\n\tset I18NSPHINXOPTS=-D latex_paper_size=%PAPER% %I18NSPHINXOPTS%\r\n)\r\n\r\nif \"%1\" == \"\" goto help\r\n\r\nif \"%1\" == \"help\" (\r\n\t:help\r\n\techo.Please use `make ^<target^>` where ^<target^> is one of\r\n\techo.  html       to make standalone HTML files\r\n\techo.  dirhtml    to make HTML files named index.html in directories\r\n\techo.  singlehtml to make a single large HTML file\r\n\techo.  pickle     to make pickle files\r\n\techo.  json       to make JSON files\r\n\techo.  htmlhelp   to make HTML files and a HTML help project\r\n\techo.  qthelp     to make HTML files and a qthelp project\r\n\techo.  devhelp    to make HTML files and a Devhelp project\r\n\techo.  epub       to make an epub\r\n\techo.  epub3      to make an epub3\r\n\techo.  latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter\r\n\techo.  text       to make text files\r\n\techo.  man        to make manual pages\r\n\techo.  texinfo    to make Texinfo files\r\n\techo.  gettext    to make PO message catalogs\r\n\techo.  changes    to make an overview over all changed/added/deprecated items\r\n\techo.  xml        to make Docutils-native XML files\r\n\techo.  pseudoxml  to make pseudoxml-XML files for display purposes\r\n\techo.  linkcheck  to check all external links for integrity\r\n\techo.  doctest    to run all doctests embedded in the documentation if enabled\r\n\techo.  coverage   to run coverage check of the documentation if enabled\r\n\techo.  dummy      to check syntax errors of document sources\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"clean\" (\r\n\tfor /d %%i in (%BUILDDIR%\\*) do rmdir /q /s %%i\r\n\tdel /q /s %BUILDDIR%\\*\r\n\tgoto end\r\n)\r\n\r\n\r\nREM Check if sphinx-build is available and fallback to Python version if any\r\n%SPHINXBUILD% 1>NUL 2>NUL\r\nif errorlevel 9009 goto sphinx_python\r\ngoto sphinx_ok\r\n\r\n:sphinx_python\r\n\r\nset SPHINXBUILD=python -m sphinx.__init__\r\n%SPHINXBUILD% 2> nul\r\nif errorlevel 9009 (\r\n\techo.\r\n\techo.The 'sphinx-build' command was not found. Make sure you have Sphinx\r\n\techo.installed, then set the SPHINXBUILD environment variable to point\r\n\techo.to the full path of the 'sphinx-build' executable. Alternatively you\r\n\techo.may add the Sphinx directory to PATH.\r\n\techo.\r\n\techo.If you don't have Sphinx installed, grab it from\r\n\techo.http://sphinx-doc.org/\r\n\texit /b 1\r\n)\r\n\r\n:sphinx_ok\r\n\r\n\r\nif \"%1\" == \"html\" (\r\n\t%SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The HTML pages are in %BUILDDIR%/html.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"dirhtml\" (\r\n\t%SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"singlehtml\" (\r\n\t%SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"pickle\" (\r\n\t%SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished; now you can process the pickle files.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"json\" (\r\n\t%SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished; now you can process the JSON files.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"htmlhelp\" (\r\n\t%SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished; now you can run HTML Help Workshop with the ^\r\n.hhp project file in %BUILDDIR%/htmlhelp.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"qthelp\" (\r\n\t%SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished; now you can run \"qcollectiongenerator\" with the ^\r\n.qhcp project file in %BUILDDIR%/qthelp, like this:\r\n\techo.^> qcollectiongenerator %BUILDDIR%\\qthelp\\featuretools.qhcp\r\n\techo.To view the help file:\r\n\techo.^> assistant -collectionFile %BUILDDIR%\\qthelp\\featuretools.ghc\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"devhelp\" (\r\n\t%SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"epub\" (\r\n\t%SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The epub file is in %BUILDDIR%/epub.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"epub3\" (\r\n\t%SPHINXBUILD% -b epub3 %ALLSPHINXOPTS% %BUILDDIR%/epub3\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The epub3 file is in %BUILDDIR%/epub3.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"latex\" (\r\n\t%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished; the LaTeX files are in %BUILDDIR%/latex.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"latexpdf\" (\r\n\t%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex\r\n\tcd %BUILDDIR%/latex\r\n\tmake all-pdf\r\n\tcd %~dp0\r\n\techo.\r\n\techo.Build finished; the PDF files are in %BUILDDIR%/latex.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"latexpdfja\" (\r\n\t%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex\r\n\tcd %BUILDDIR%/latex\r\n\tmake all-pdf-ja\r\n\tcd %~dp0\r\n\techo.\r\n\techo.Build finished; the PDF files are in %BUILDDIR%/latex.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"text\" (\r\n\t%SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The text files are in %BUILDDIR%/text.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"man\" (\r\n\t%SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The manual pages are in %BUILDDIR%/man.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"texinfo\" (\r\n\t%SPHINXBUILD% -b texinfo %ALLSPHINXOPTS% %BUILDDIR%/texinfo\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The Texinfo files are in %BUILDDIR%/texinfo.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"gettext\" (\r\n\t%SPHINXBUILD% -b gettext %I18NSPHINXOPTS% %BUILDDIR%/locale\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The message catalogs are in %BUILDDIR%/locale.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"changes\" (\r\n\t%SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.The overview file is in %BUILDDIR%/changes.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"linkcheck\" (\r\n\t%SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Link check complete; look for any errors in the above output ^\r\nor in %BUILDDIR%/linkcheck/output.txt.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"doctest\" (\r\n\t%SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Testing of doctests in the sources finished, look at the ^\r\nresults in %BUILDDIR%/doctest/output.txt.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"coverage\" (\r\n\t%SPHINXBUILD% -b coverage %ALLSPHINXOPTS% %BUILDDIR%/coverage\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Testing of coverage in the sources finished, look at the ^\r\nresults in %BUILDDIR%/coverage/python.txt.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"xml\" (\r\n\t%SPHINXBUILD% -b xml %ALLSPHINXOPTS% %BUILDDIR%/xml\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The XML files are in %BUILDDIR%/xml.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"pseudoxml\" (\r\n\t%SPHINXBUILD% -b pseudoxml %ALLSPHINXOPTS% %BUILDDIR%/pseudoxml\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The pseudo-XML files are in %BUILDDIR%/pseudoxml.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"dummy\" (\r\n\t%SPHINXBUILD% -b dummy %ALLSPHINXOPTS% %BUILDDIR%/dummy\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. Dummy builder generates no files.\r\n\tgoto end\r\n)\r\n\r\n:end\r\n"
  },
  {
    "path": "docs/notebook_version_standardizer.py",
    "content": "import json\nimport os\n\nimport click\n\nDOCS_PATH = os.path.join(os.path.dirname(os.path.abspath(__file__)), \"source\")\n\n\ndef _get_ipython_notebooks(docs_source):\n    directories_to_skip = [\"_templates\", \"generated\", \".ipynb_checkpoints\"]\n    notebooks = []\n    for root, _, filenames in os.walk(docs_source):\n        if any(dir_ in root for dir_ in directories_to_skip):\n            continue\n        for filename in filenames:\n            if filename.endswith(\".ipynb\"):\n                notebooks.append(os.path.join(root, filename))\n    return notebooks\n\n\ndef _check_delete_empty_cell(notebook, delete=True):\n    with open(notebook, \"r\") as f:\n        source = json.load(f)\n    cell = source[\"cells\"][-1]\n    if cell[\"cell_type\"] == \"code\" and cell[\"source\"] == []:\n        # this is an empty cell, which we should delete\n        if delete:\n            source[\"cells\"] = source[\"cells\"][:-1]\n        else:\n            return False\n    if delete:\n        with open(notebook, \"w\") as f:\n            json.dump(source, f, ensure_ascii=False, indent=1)\n    else:\n        return True\n\n\ndef _check_execution_and_output(notebook):\n    with open(notebook, \"r\") as f:\n        source = json.load(f)\n    for cells in source[\"cells\"]:\n        if cells[\"cell_type\"] == \"code\" and (\n            cells[\"execution_count\"] is not None or cells[\"outputs\"] != []\n        ):\n            return False\n    return True\n\n\ndef _check_python_version(notebook, default_version):\n    with open(notebook, \"r\") as f:\n        source = json.load(f)\n    if source[\"metadata\"][\"language_info\"][\"version\"] != default_version:\n        return False\n    return True\n\n\ndef _fix_python_version(notebook, default_version):\n    with open(notebook, \"r\") as f:\n        source = json.load(f)\n    source[\"metadata\"][\"language_info\"][\"version\"] = default_version\n    with open(notebook, \"w\") as f:\n        json.dump(source, f, ensure_ascii=False, indent=1)\n\n\ndef _fix_execution_and_output(notebook):\n    with open(notebook, \"r\") as f:\n        source = json.load(f)\n    for cells in source[\"cells\"]:\n        if cells[\"cell_type\"] == \"code\" and cells[\"execution_count\"] is not None:\n            cells[\"execution_count\"] = None\n            cells[\"outputs\"] = []\n    source[\"metadata\"][\"kernelspec\"][\"display_name\"] = \"Python 3\"\n    source[\"metadata\"][\"kernelspec\"][\"name\"] = \"python3\"\n    with open(notebook, \"w\") as f:\n        json.dump(source, f, ensure_ascii=False, indent=1)\n\n\ndef _get_notebooks_with_executions_and_empty(notebooks, default_version=\"3.9.2\"):\n    executed = []\n    empty_last_cell = []\n    versions = []\n    for notebook in notebooks:\n        if not _check_execution_and_output(notebook):\n            executed.append(notebook)\n        if not _check_delete_empty_cell(notebook, delete=False):\n            empty_last_cell.append(notebook)\n        if not _check_python_version(notebook, default_version):\n            versions.append(notebook)\n    return (executed, empty_last_cell, versions)\n\n\ndef _fix_versions(notebooks, default_version=\"3.9.2\"):\n    for notebook in notebooks:\n        _fix_python_version(notebook, default_version)\n\n\ndef _remove_notebook_empty_last_cell(notebooks):\n    for notebook in notebooks:\n        _check_delete_empty_cell(notebook, delete=True)\n\n\ndef _standardize_outputs(notebooks):\n    for notebook in notebooks:\n        _fix_execution_and_output(notebook)\n\n\n@click.group()\ndef cli():\n    \"\"\"no-op\"\"\"\n\n\n@cli.command()\ndef standardize():\n    notebooks = _get_ipython_notebooks(DOCS_PATH)\n    (\n        executed_notebooks,\n        empty_cells,\n        versions,\n    ) = _get_notebooks_with_executions_and_empty(notebooks)\n    if executed_notebooks:\n        _standardize_outputs(executed_notebooks)\n        executed_notebooks = [\"\\t\" + notebook for notebook in executed_notebooks]\n        executed_notebooks = \"\\n\".join(executed_notebooks)\n        click.echo(f\"Removed the outputs for:\\n {executed_notebooks}\")\n    if empty_cells:\n        _remove_notebook_empty_last_cell(empty_cells)\n        empty_cells = [\"\\t\" + notebook for notebook in empty_cells]\n        empty_cells = \"\\n\".join(empty_cells)\n        click.echo(f\"Removed the empty cells for:\\n {empty_cells}\")\n    if versions:\n        _fix_versions(versions)\n        versions = [\"\\t\" + notebook for notebook in versions]\n        versions = \"\\n\".join(versions)\n        click.echo(f\"Fixed python versions for:\\n {versions}\")\n\n\n@cli.command()\ndef check_execution():\n    notebooks = _get_ipython_notebooks(DOCS_PATH)\n    (\n        executed_notebooks,\n        empty_cells,\n        versions,\n    ) = _get_notebooks_with_executions_and_empty(notebooks)\n    if executed_notebooks:\n        executed_notebooks = [\"\\t\" + notebook for notebook in executed_notebooks]\n        executed_notebooks = \"\\n\".join(executed_notebooks)\n        raise SystemExit(\n            f\"The following notebooks have executed outputs:\\n {executed_notebooks}\\n\"\n            \"Please run make lint-fix to fix this.\",\n        )\n    if empty_cells:\n        empty_cells = [\"\\t\" + notebook for notebook in empty_cells]\n        empty_cells = \"\\n\".join(empty_cells)\n        raise SystemExit(\n            f\"The following notebooks have empty cells at the end:\\n {empty_cells}\\n\"\n            \"Please run make lint-fix to fix this.\",\n        )\n    if versions:\n        versions = [\"\\t\" + notebook for notebook in versions]\n        versions = \"\\n\".join(versions)\n        raise SystemExit(\n            f\"The following notebooks have the wrong Python version: \\n {versions}\\n\"\n            \"Please run make lint-fix to fix this.\",\n        )\n\n\nif __name__ == \"__main__\":\n    cli()\n"
  },
  {
    "path": "docs/pull_request_template.md",
    "content": "### Pull Request Description\n(replace this text with your description)\n\n-----\n*After creating the pull request: in order to pass the **release_notes_updated** check you will need to update the \"Future Release\" section of* `docs/source/release_notes.rst` *to include this pull request.*\n"
  },
  {
    "path": "docs/source/_static/style.css",
    "content": ".footer {\n    background-color: #0D2345;\n    padding-bottom: 40px;\n    padding-top: 40px;\n    width: 100%;\n}\n\n.footer-cell-1 {\n    grid-row: 1;\n    grid-column: 1 / 3;\n}\n\n.footer-cell-2 {\n    grid-row: 1;\n    grid-column: 4;\n    margin-bottom: 15px;\n    text-align: right;\n}\n\n.footer-cell-3 {\n    grid-row: 2;\n    grid-column: 1 / 5;\n}\n\n.footer-cell-4 {\n    grid-row: 3;\n    grid-column: 1 / 3;\n}\n\n.footer-container {\n    display: grid;\n    margin-left: 10%;\n    margin-right: 10%;\n}\n\n.footer-image-alteryx {\n    padding-top: 22px;\n    width: 270px;\n}\n\n.footer-image-copyright {\n    width: 180px;\n}\n\n.footer-image-github {\n    width: 50px;\n}\n\n.footer-image-twitter {\n    width: 60px;\n}\n\n.footer-line {\n    border-top: 2px solid white;\n    margin-left: 7px;\n    margin-right: 15px;\n}\n"
  },
  {
    "path": "docs/source/api_reference.rst",
    "content": ".. _api_ref:\n\nAPI Reference\n=============\n\n.. currentmodule:: featuretools\n\nDemo Datasets\n~~~~~~~~~~~~~\n.. currentmodule:: featuretools.demo\n\n\n.. autosummary::\n    :toctree: generated/\n\n    load_retail\n    load_mock_customer\n    load_flight\n    load_weather\n\nDeep Feature Synthesis\n~~~~~~~~~~~~~~~~~~~~~~\n.. currentmodule:: featuretools\n\n.. autosummary::\n    :toctree: generated/\n\n    dfs\n    get_valid_primitives\n\nTimedelta\n~~~~~~~~~\n.. currentmodule:: featuretools\n\n.. autosummary::\n    :toctree: generated/\n\n    Timedelta\n\nTime utils\n~~~~~~~~~~\n.. currentmodule:: featuretools\n\n.. autosummary::\n    :toctree: generated/\n\n    make_temporal_cutoffs\n\n\nFeature Primitives\n~~~~~~~~~~~~~~~~~~\n\nPrimitive Types\n---------------\n.. currentmodule:: featuretools.primitives\n\n.. autosummary::\n    :toctree: generated/\n\n    TransformPrimitive\n    AggregationPrimitive\n\n\n.. _api_ref.aggregation_features:\n\nAggregation Primitives\n----------------------\n.. autosummary::\n    :toctree: generated/\n\n    All\n    Any\n    AverageCountPerUnique\n    AvgTimeBetween\n    Count\n    CountAboveMean\n    CountBelowMean\n    CountGreaterThan\n    CountInsideNthSTD\n    CountInsideRange\n    CountLessThan\n    CountOutsideNthSTD\n    CountOutsideRange\n    DateFirstEvent\n    Entropy\n    First\n    FirstLastTimeDelta\n    HasNoDuplicates\n    IsMonotonicallyDecreasing\n    IsMonotonicallyIncreasing\n    IsUnique\n    Kurtosis\n    Last\n    Max\n    MaxConsecutiveFalse\n    MaxConsecutiveNegatives\n    MaxConsecutivePositives\n    MaxConsecutiveTrue\n    MaxConsecutiveZeros\n    MaxCount\n    MaxMinDelta\n    Mean\n    Median\n    MedianCount\n    Min\n    MinCount\n    Mode\n    NMostCommon\n    NMostCommonFrequency\n    NUniqueDays\n    NUniqueDaysOfCalendarYear\n    NUniqueMonths\n    NUniqueWeeks\n    NumConsecutiveGreaterMean\n    NumConsecutiveLessMean\n    NumFalseSinceLastTrue\n    NumPeaks\n    NumTrue\n    NumTrueSinceLastFalse\n    NumUnique\n    NumZeroCrossings\n    PercentTrue\n    PercentUnique\n    Skew\n    Std\n    Sum\n    TimeSinceFirst\n    TimeSinceLast\n    TimeSinceLastFalse\n    TimeSinceLastMax\n    TimeSinceLastMin\n    TimeSinceLastTrue\n    Trend\n    Variance\n\nTransform Primitives\n--------------------\nBinary Transform Primitives\n***************************\n.. autosummary::\n    :toctree: generated/\n\n    AddNumeric\n    AddNumericScalar\n    DivideByFeature\n    DivideNumeric\n    DivideNumericScalar\n    Equal\n    EqualScalar\n    GreaterThan\n    GreaterThanEqualTo\n    GreaterThanEqualToScalar\n    GreaterThanScalar\n    LessThan\n    LessThanEqualTo\n    LessThanEqualToScalar\n    LessThanScalar\n    ModuloByFeature\n    ModuloNumeric\n    ModuloNumericScalar\n    MultiplyBoolean\n    MultiplyNumeric\n    MultiplyNumericBoolean\n    MultiplyNumericScalar\n    NotEqual\n    NotEqualScalar\n    ScalarSubtractNumericFeature\n    SubtractNumeric\n    SubtractNumericScalar\n\n\nCombine features\n****************\n.. autosummary::\n    :toctree: generated/\n\n    IsIn\n    And\n    Or\n    Not\n\n\n.. _api_ref.cumulative_features:\n\nCumulative Transform Primitives\n*******************************\n.. autosummary::\n    :toctree: generated/\n\n    Diff\n    DiffDatetime\n    TimeSincePrevious\n    CumCount\n    CumSum\n    CumMean\n    CumMin\n    CumMax\n    CumulativeTimeSinceLastFalse\n    CumulativeTimeSinceLastTrue\n\n\nDatetime Transform Primitives\n*****************************\n.. autosummary::\n    :toctree: generated/\n\n    Age\n    DateToHoliday\n    DateToTimeZone\n    Day\n    DayOfYear\n    DaysInMonth\n    DistanceToHoliday\n    Hour\n    IsFederalHoliday\n    IsFirstWeekOfMonth\n    IsLeapYear\n    IsLunchTime\n    IsMonthEnd\n    IsMonthStart\n    IsQuarterEnd\n    IsQuarterStart\n    IsWeekend\n    IsWorkingHours\n    IsYearEnd\n    IsYearStart\n    Minute\n    Month\n    NthWeekOfMonth\n    PartOfDay\n    Quarter\n    Season\n    Second\n    TimeSince\n    Week\n    Weekday\n    Year\n\n\nEmail, URL and File Transform Primitives\n****************************************\n.. autosummary::\n    :toctree: generated/\n\n    EmailAddressToDomain\n    FileExtension\n    IsFreeEmailDomain\n    URLToDomain\n    URLToProtocol\n    URLToTLD\n\n\nExponential Transform Primitives\n********************************\n.. autosummary::\n    :toctree: generated/\n\n    ExponentialWeightedAverage\n    ExponentialWeightedSTD\n    ExponentialWeightedVariance\n\n\nGeneral Transform Primitives\n****************************\n.. autosummary::\n    :toctree: generated/\n\n    AbsoluteDiff\n    Absolute\n    Cosine\n    IsNull\n    NaturalLogarithm\n    Negate\n    Percentile\n    PercentChange\n    RateOfChange\n    SameAsPrevious\n    SavgolFilter\n    Sine\n    SquareRoot\n    Tangent\n    Variance\n\nLocation Transform Primitives\n*****************************\n.. autosummary::\n   :toctree: generated/\n\n    CityblockDistance\n    GeoMidpoint\n    Haversine\n    IsInGeoBox\n    Latitude\n    Longitude\n\nName Transform Primitives\n*************************\n.. autosummary::\n   :toctree: generated/\n\n    FullNameToFirstName\n    FullNameToLastName\n    FullNameToTitle\n\nNaturalLanguage Transform Primitives\n************************************\n.. autosummary::\n   :toctree: generated/\n\n   CountString\n   MeanCharactersPerWord\n   MedianWordLength\n   NumCharacters\n   NumUniqueSeparators\n   NumWords\n   NumberOfCommonWords\n   NumberOfHashtags\n   NumberOfMentions\n   NumberOfUniqueWords\n   NumberOfWordsInQuotes\n   PunctuationCount\n   TitleWordCount\n   TotalWordLength\n   UpperCaseCount\n   UpperCaseWordCount\n   WhitespaceCount\n\nPostal Code Primitives\n**********************\n.. autosummary::\n    :toctree: generated/\n\n    OneDigitPostalCode\n    TwoDigitPostalCode\n\nTime Series Transform Primitives\n********************************\n.. autosummary::\n    :toctree: generated/\n\n    ExpandingCount\n    ExpandingMax\n    ExpandingMean\n    ExpandingMin\n    ExpandingSTD\n    ExpandingTrend\n    Lag\n    RollingCount\n    RollingMax\n    RollingMean\n    RollingMin\n    RollingOutlierCount\n    RollingSTD\n    RollingTrend\n\n\nFeature methods\n---------------\n.. currentmodule:: featuretools.feature_base\n.. autosummary::\n    :toctree: generated/\n\n    FeatureBase.rename\n    FeatureBase.get_depth\n\n\nFeature calculation\n~~~~~~~~~~~~~~~~~~~~\n.. currentmodule:: featuretools\n.. autosummary::\n    :toctree: generated/\n\n    calculate_feature_matrix\n    .. approximate_features\n\nFeature descriptions\n~~~~~~~~~~~~~~~~~~~~~\n.. currentmodule:: featuretools\n.. autosummary::\n    :toctree: generated/\n\n    describe_feature\n\nFeature visualization\n~~~~~~~~~~~~~~~~~~~~~~\n.. currentmodule:: featuretools\n.. autosummary::\n    :toctree: generated/\n\n    graph_feature\n\nFeature encoding\n~~~~~~~~~~~~~~~~~\n.. currentmodule:: featuretools\n.. autosummary::\n    :toctree: generated/\n\n    encode_features\n\nFeature Selection\n~~~~~~~~~~~~~~~~~\n.. currentmodule:: featuretools.selection\n.. autosummary::\n    :toctree: generated/\n\n    remove_low_information_features\n    remove_highly_correlated_features\n    remove_highly_null_features\n    remove_single_value_features\n\nFeature Matrix utils\n~~~~~~~~~~~~~~~~~~~~\n.. currentmodule:: featuretools.computational_backends\n.. autosummary::\n    :toctree: generated/\n\n    replace_inf_values\n\n\nSaving and Loading Features\n~~~~~~~~~~~~~~~~~~~~~~~~~~~\n.. currentmodule:: featuretools\n.. autosummary::\n    :toctree: generated/\n\n    save_features\n    load_features\n\n.. _api_ref.dataset:\n\nEntitySet, Relationship\n~~~~~~~~~~~~~~~~~~~~~~~\n\nConstructors\n------------\n.. currentmodule:: featuretools\n.. autosummary::\n    :toctree: generated/\n\n    EntitySet\n    Relationship\n\nEntitySet load and prepare data\n-------------------------------\n.. autosummary::\n    :toctree: generated/\n\n    EntitySet.add_dataframe\n    EntitySet.add_interesting_values\n    EntitySet.add_last_time_indexes\n    EntitySet.add_relationship\n    EntitySet.add_relationships\n    EntitySet.concat\n    EntitySet.normalize_dataframe\n    EntitySet.set_secondary_time_index\n    EntitySet.replace_dataframe\n\nEntitySet serialization\n-------------------------------\n.. currentmodule:: featuretools\n.. autosummary::\n    :toctree: generated/\n\n    read_entityset\n\n.. currentmodule:: featuretools.entityset\n.. autosummary::\n    :toctree: generated/\n\n    EntitySet.to_csv\n    EntitySet.to_pickle\n    EntitySet.to_parquet\n\nEntitySet query methods\n-----------------------\n.. autosummary::\n    :toctree: generated/\n\n    EntitySet.__getitem__\n    EntitySet.find_backward_paths\n    EntitySet.find_forward_paths\n    EntitySet.get_forward_dataframes\n    EntitySet.get_backward_dataframes\n    EntitySet.query_by_values\n\nEntitySet visualization\n-----------------------\n.. autosummary::\n    :toctree: generated/\n\n    EntitySet.plot\n\nRelationship attributes\n-----------------------\n.. autosummary::\n    :toctree: generated/\n\n    Relationship.parent_column\n    Relationship.child_column\n    Relationship.parent_dataframe\n    Relationship.child_dataframe\n\nData Type Util Methods\n----------------------\n.. currentmodule:: featuretools\n.. autosummary::\n    :toctree: generated/\n\n    list_logical_types\n    list_semantic_tags\n\nPrimitive Util Methods\n----------------------\n.. currentmodule:: featuretools\n.. autosummary::\n    :toctree: generated/\n\n    get_recommended_primitives\n    list_primitives\n    summarize_primitives\n"
  },
  {
    "path": "docs/source/conf.py",
    "content": "# -*- coding: utf-8 -*-\n#\n# featuretools documentation build configuration file, created by\n# sphinx-quickstart on Thu May 19 20:40:30 2016.\n#\n# This file is execfile()d with the current directory set to its\n# containing dir.\n#\n# Note that not all possible configuration values are present in this\n# autogenerated file.\n#\n# All configuration values have a default; values that are commented out\n# serve to show the default.\n\nimport os\nimport shutil\nimport subprocess\nimport sys\nfrom pathlib import Path\n\nimport featuretools\n\n# run setup script\npath = os.path.join(os.path.dirname(os.path.abspath(__file__)), \"setup.py\")\nsubprocess.check_call([sys.executable, path])\n\n# If extensions (or modules to document with autodoc) are in another directory,\n# add these directories to sys.path here. If the directory is relative to the\n# documentation root, use os.path.abspath to make it absolute, like shown here.\nsys.path.insert(0, os.path.abspath(\"../featuretools\"))\n\n# -- General configuration ------------------------------------------------\n\n# If your documentation needs a minimal Sphinx version, state it here.\n# needs_sphinx = '1.0'\n\n# Add any Sphinx extension module names here, as strings. They can be\n# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom\n# ones.\nextensions = [\n    \"sphinx.ext.autodoc\",\n    \"sphinx.ext.autosummary\",\n    \"sphinx.ext.napoleon\",\n    \"sphinx.ext.ifconfig\",\n    \"sphinx.ext.githubpages\",\n    \"nbsphinx\",\n    \"IPython.sphinxext.ipython_console_highlighting\",\n    \"IPython.sphinxext.ipython_directive\",\n    \"sphinx.ext.extlinks\",\n    \"sphinx.ext.viewcode\",\n    \"sphinx.ext.graphviz\",\n    \"sphinx_inline_tabs\",\n    \"sphinx_copybutton\",\n    \"myst_parser\",\n]\n\n\n# ipython_mplbackend = None\n\nipython_execlines = [\"import pandas as pd\", \"pd.set_option('display.width', 1000000)\"]\n\n# autosummary_generate=True\nautosummary_generate = [\"api_reference.rst\"]\n\n\n# Add any paths that contain templates here, relative to this directory.\ntemplates_path = [\"templates\"]\n\n# The suffix(es) of source filenames.\n# You can specify multiple suffix as a list of string:\n# source_suffix = ['.rst', '.md']\n\n# The encoding of source files.\n# source_encoding = 'utf-8-sig'\n\n# The master toctree document.\nmaster_doc = \"index\"\n\n# General information about the project.\nproject = \"Featuretools\"\ncopyright = \"2019, Feature Labs. BSD License\"\nauthor = \"Feature Labs, Inc.\"\nlatex_documents = [\n    (master_doc, \"featuretools.tex\", \"test Documentation\", \"test\", \"manual\"),\n]\nlatex_elements = {\n    \"preamble\": r\"\"\"\n\\usepackage[utf8]{inputenc}\n\"\"\",\n}\n\n# The version info for the project you're documenting, acts as replacement for\n# |version| and |release|, also used in various other places throughout the\n# built documents.\n#\n# The short X.Y version.\nversion = featuretools.__version__\n# The full version, including alpha/beta/rc tags.\nrelease = featuretools.__version__\n\n# The language for content autogenerated by Sphinx. Refer to documentation\n# for a list of supported languages.\n#\n# This is also used if you do content translation via gettext catalogs.\n# Usually you set \"language\" from the command line for these cases.\nlanguage = \"en\"\n\n# There are two options for replacing |today|: either, you set today to some\n# non-false value, then it is used:\n# today = ''\n# Else, today_fmt is used as the format for a strftime call.\n# today_fmt = '%B %d, %Y'\n\n# List of patterns, relative to source directory, that match files and\n# directories to ignore when looking for source files.\n# This patterns also effect to html_static_path and html_extra_path\nexclude_patterns = [\"**.ipynb_checkpoints\"]\n\n# The reST default role (used for this markup: `text`) to use for all\n# documents.\n# default_role = None\n\n# If true, '()' will be appended to :func: etc. cross-reference text.\n# add_function_parentheses = True\n\n# If true, the current module name will be prepended to all description\n# unit titles (such as .. function::).\n# add_module_names = True\n\n# If true, sectionauthor and moduleauthor directives will be shown in the\n# output. They are ignored by default.\n# show_authors = False\n\n# The name of the Pygments (syntax highlighting) style to use.\npygments_style = \"sphinx\"\n\n# A list of ignored prefixes for module index sorting.\n# modindex_common_prefix = []\n\n# If true, keep warnings as \"system message\" paragraphs in the built documents.\n# keep_warnings = False\n\n# If true, `todo` and `todoList` produce output, else they produce nothing.\ntodo_include_todos = False\n\n\n# -- Options for HTML output ----------------------------------------------\n\n# The theme to use for HTML and HTML Help pages.  See the documentation for\n# a list of builtin themes.\nhtml_theme = \"pydata_sphinx_theme\"\n\n# Theme options are theme-specific and customize the look and feel of a theme\n# further.  For a list of options available for each theme, see the\n# documentation.\nhtml_theme_options = {\n    \"pygment_light_style\": \"tango\",\n    \"pygment_dark_style\": \"native\",\n    \"icon_links\": [\n        {\n            \"name\": \"GitHub\",\n            \"url\": \"https://github.com/alteryx/featuretools\",\n            \"icon\": \"fab fa-github-square\",\n            \"type\": \"fontawesome\",\n        },\n        {\n            \"name\": \"Twitter\",\n            \"url\": \"https://twitter.com/AlteryxOSS\",\n            \"icon\": \"fab fa-twitter-square\",\n            \"type\": \"fontawesome\",\n        },\n        {\n            \"name\": \"Slack\",\n            \"url\": \"https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA\",\n            \"icon\": \"fab fa-slack\",\n            \"type\": \"fontawesome\",\n        },\n        {\n            \"name\": \"StackOverflow\",\n            \"url\": \"https://stackoverflow.com/questions/tagged/featuretools\",\n            \"icon\": \"fab fa-stack-overflow\",\n            \"type\": \"fontawesome\",\n        },\n    ],\n    \"collapse_navigation\": False,\n    \"navigation_depth\": 2,\n}\n\n# Add any paths that contain custom themes here, relative to this directory.\n# html_theme_path = []\n\n# The name for this set of Sphinx documents.\n# \"<project> v<release> documentation\" by default.\n# html_title = u'featuretools v0.1'\n\n# A shorter title for the navigation bar.  Default is the same as html_title.\n# html_short_title = None\n\n# The name of an image file (relative to this directory) to place at the top\n# of the sidebar.\nhtml_logo = \"_static/images/featuretools_nav2.svg\"\n\n# The name of an image file (relative to this directory) to use as a favicon of\n# the docs.  This file should be a Windows icon file (.ico) being 16x16 or 32x32\n# pixels large.\nhtml_favicon = \"_static/images/favicon.ico\"\n\n# Add any paths that contain custom static files (such as style sheets) here,\n# relative to this directory. They are copied after the builtin static files,\n# so a file named \"default.css\" will overwrite the builtin \"default.css\".\nhtml_static_path = [\"_static\"]\n\n# Add any extra paths that contain custom files (such as robots.txt or\n# .htaccess) here, relative to this directory. These files are copied\n# directly to the root of the documentation.\n# html_extra_path = []\n\n# If not None, a 'Last updated on:' timestamp is inserted at every page\n# bottom, using the given strftime format.\n# The empty string is equivalent to '%b %d, %Y'.\n# html_last_updated_fmt = None\n\n# If true, SmartyPants will be used to convert quotes and dashes to\n# typographically correct entities.\n# html_use_smartypants = True\n\n# Custom sidebar templates, maps document names to template names.\nhtml_sidebars = {\n    \"**\": [\"globaltoc.html\", \"relations.html\", \"sourcelink.html\", \"searchbox.html\"],\n}\n\n\n# Additional templates that should be rendered to pages, maps page names to\n# template names.\n# html_additional_pages = {}\n\n# If false, no module index is generated.\n# html_domain_indices = True\n\n# If false, no index is generated.\n# html_use_index = True\n\n# If true, the index is split into individual pages for each letter.\n# html_split_index = False\n\n# If true, links to the reST sources are added to the pages.\n# html_show_sourcelink = True\n\n# If true, \"Created using Sphinx\" is shown in the HTML footer. Default is True.\nhtml_show_sphinx = False\n\n# If true, \"(C) Copyright ...\" is shown in the HTML footer. Default is True.\n# html_show_copyright = True\n\n# If true, an OpenSearch description file will be output, and all pages will\n# contain a <link> tag referring to it.  The value of this option must be the\n# base URL from which the finished HTML is served.\n# html_use_opensearch = ''\n\n# This is the file name suffix for HTML files (e.g. \".xhtml\").\n# html_file_suffix = None\n\n# Language to be used for generating the HTML full-text search index.\n# Sphinx supports the following languages:\n#   'da', 'de', 'en', 'es', 'fi', 'fr', 'hu', 'it', 'ja'\n#   'nl', 'no', 'pt', 'ro', 'ru', 'sv', 'tr', 'zh'\n# html_search_language = 'en'\n\n# A dictionary with options for the search language support, empty by default.\n# 'ja' uses this config value.\n# 'zh' user can custom change `jieba` dictionary path.\n# html_search_options = {'type': 'default'}\n\n# The name of a javascript file (relative to the configuration directory) that\n# implements a search results scorer. If empty, the default will be used.\n# html_search_scorer = 'scorer.js'\n\n# Output file base name for HTML help builder.\nhtmlhelp_basename = \"featuretoolsdoc\"\n\n# -- Options for Markdown files ----------------------------------------------\n\nmyst_admonition_enable = True\nmyst_deflist_enable = True\nmyst_heading_anchors = 3\n\n# -- Options for Sphinx Copy Button ------------------------------------------\n\ncopybutton_prompt_text = \"myinputprompt\"\ncopybutton_prompt_text = r\">>> |\\.\\.\\. |\\$ |In \\[\\d*\\]: | {2,5}\\.\\.\\.: | {5,8}: \"\ncopybutton_prompt_is_regexp = True\n\n# -- Options for LaTeX output ---------------------------------------------\n\nlatex_elements = {\n    # The paper size ('letterpaper' or 'a4paper').\n    #'papersize': 'letterpaper',\n    # The font size ('10pt', '11pt' or '12pt').\n    #'pointsize': '10pt',\n    # Additional stuff for the LaTeX preamble.\n    #'preamble': '',\n    # Latex figure (float) alignment\n    #'figure_align': 'htbp',\n}\n\n# Grouping the document tree into LaTeX files. List of tuples\n# (source start file, target name, title,\n#  author, documentclass [howto, manual, or own class]).\nlatex_documents = [\n    (\n        master_doc,\n        \"featuretools.tex\",\n        \"Featuretools Documentation\",\n        \"Feature Labs, Inc.\",\n        \"manual\",\n    ),\n]\n\n# The name of an image file (relative to this directory) to place at the top of\n# the title page.\n# latex_logo = None\n\n# For \"manual\" documents, if this is true, then toplevel headings are parts,\n# not chapters.\n# latex_use_parts = False\n\n# If true, show page references after internal links.\n# latex_show_pagerefs = False\n\n# If true, show URL addresses after external links.\n# latex_show_urls = False\n\n# Documents to append as an appendix to all manuals.\n# latex_appendices = []\n\n# If false, no module index is generated.\n# latex_domain_indices = True\n\n\n# -- Options for manual page output ---------------------------------------\n\n# One entry per manual page. List of tuples\n# (source start file, name, description, authors, manual section).\nman_pages = [(master_doc, \"featuretools\", \"featuretools Documentation\", [author], 1)]\n\n# If true, show URL addresses after external links.\n# man_show_urls = False\n\n\n# -- Options for Texinfo output -------------------------------------------\n\n# Grouping the document tree into Texinfo files. List of tuples\n# (source start file, target name, title, author,\n#  dir menu entry, description, category)\ntexinfo_documents = [\n    (\n        master_doc,\n        \"featuretools\",\n        \"featuretools Documentation\",\n        author,\n        \"featuretools\",\n        \"One line description of project.\",\n        \"Miscellaneous\",\n    ),\n]\n\n# Documents to append as an appendix to all manuals.\n# texinfo_appendices = []\n\n# If false, no module index is generated.\n# texinfo_domain_indices = True\n\n# How to display URL addresses: 'footnote', 'no', or 'inline'.\n# texinfo_show_urls = 'footnote'\n\n# If true, do not generate a @detailmenu in the \"Top\" node's menu.\n# texinfo_no_detailmenu = False\n\nnbsphinx_execute = \"auto\"\n\nextlinks = {\n    \"issue\": (\"https://github.com/alteryx/featuretools/issues/%s\", \"GH#%s\"),\n    \"pr\": (\"https://github.com/alteryx/featuretools/pull/%s\", \"GH#%s\"),\n    \"user\": (\"https://github.com/%s\", \"@%s\"),\n}\n\n# Napoleon settings\nnapoleon_google_docstring = True\nnapoleon_numpy_docstring = True\nnapoleon_include_init_with_doc = False\nnapoleon_include_private_with_doc = False\nnapoleon_include_special_with_doc = True\nnapoleon_use_admonition_for_examples = False\nnapoleon_use_admonition_for_notes = False\nnapoleon_use_admonition_for_references = False\nnapoleon_use_ivar = False\nnapoleon_use_param = True\nnapoleon_use_rtype = True\n\n\ndef setup(app):\n    home_dir = os.environ.get(\"HOME\", \"/\")\n    ipython_p = Path(home_dir + \"/.ipython/profile_default/startup\")\n    ipython_p.mkdir(parents=True, exist_ok=True)\n    file_p = os.path.abspath(os.path.dirname(__file__))\n    shutil.copy(\n        file_p + \"/set-headers.py\",\n        home_dir + \"/.ipython/profile_default/startup\",\n    )\n    app.add_css_file(\"style.css\")\n"
  },
  {
    "path": "docs/source/getting_started/afe.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Deep Feature Synthesis\\n\",\n    \"\\n\",\n    \"Deep Feature Synthesis (DFS) is an automated method for performing feature engineering on relational and temporal data.\\n\",\n    \"\\n\",\n    \"## Input Data\\n\",\n    \"\\n\",\n    \"Deep Feature Synthesis requires structured datasets in order to perform feature engineering. To demonstrate the capabilities of DFS, we will use a mock customer transactions dataset.\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \".. note ::\\n\",\n    \"\\n\",\n    \"  Before using DFS, it is recommended that you prepare your data as an :class:`EntitySet`.  See :doc:`using_entitysets` to learn how.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import featuretools as ft\\n\",\n    \"\\n\",\n    \"es = ft.demo.load_mock_customer(return_entityset=True)\\n\",\n    \"es\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Once data is prepared as an `.EntitySet`, we are ready to automatically generate features for a target dataframe - e.g. `customers`.\\n\",\n    \"\\n\",\n    \"## Running DFS\\n\",\n    \"\\n\",\n    \"Typically, without automated feature engineering, a data scientist would write code to aggregate data for a customer, and apply different statistical functions resulting in features quantifying the customer's behavior. In this example, an expert might be interested in features such as: *total number of sessions* or *month the customer signed up*.\\n\",\n    \"\\n\",\n    \"These features can be generated by DFS when we specify the target_dataframe as `customers` and `\\\"count\\\"` and `\\\"month\\\"` as primitives.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    agg_primitives=[\\\"count\\\"],\\n\",\n    \"    trans_primitives=[\\\"month\\\"],\\n\",\n    \"    max_depth=1,\\n\",\n    \")\\n\",\n    \"feature_matrix\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In the example above, `\\\"count\\\"` is an **aggregation primitive** because it computes a single value based on many sessions related to one customer. `\\\"month\\\"` is called a **transform primitive** because it takes one value for a customer transforms it to another.\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \".. note ::\\n\",\n    \"\\n\",\n    \"  Feature primitives are a fundamental component to Featuretools. To learn more read :doc:`primitives`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating \\\"Deep Features\\\"\\n\",\n    \"\\n\",\n    \"The name Deep Feature Synthesis comes from the algorithm's ability to stack primitives to generate more complex features. Each time we stack a primitive we increase the \\\"depth\\\" of a feature. The `max_depth` parameter controls the maximum depth of the features returned by DFS. Let us try running DFS with `max_depth=2`\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    agg_primitives=[\\\"mean\\\", \\\"sum\\\", \\\"mode\\\"],\\n\",\n    \"    trans_primitives=[\\\"month\\\", \\\"hour\\\"],\\n\",\n    \"    max_depth=2,\\n\",\n    \")\\n\",\n    \"feature_matrix\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/markdown\"\n   },\n   \"source\": [\n    \"With a depth of 2, a number of features are generated using the supplied primitives. The algorithm to synthesize these definitions is described in this [paper](https://www.jmaxkanter.com/papers/DSAA_DSM_2015.pdf). In the returned feature matrix, let us understand one of the depth 2 features\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix[[\\\"MEAN(sessions.SUM(transactions.amount))\\\"]]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \"For each customer this feature\\n\",\n    \"\\n\",\n    \"1. calculates the ``sum`` of all transaction amounts per session to get total amount per session,\\n\",\n    \"2. then applies the ``mean`` to the total amounts across multiple sessions to identify the *average amount spent per session*\\n\",\n    \"\\n\",\n    \"We call this feature a \\\"deep feature\\\" with a depth of 2.\\n\",\n    \"\\n\",\n    \"Let's look at another depth 2 feature that calculates for every customer *the most common hour of the day when they start a session*\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix[[\\\"MODE(sessions.HOUR(session_start))\\\"]]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"For each customer this feature calculates\\n\",\n    \"\\n\",\n    \"1. The `hour` of the day each of his or her sessions started, then\\n\",\n    \"2. uses the statistical function `mode` to identify the most common hour he or she started a session\\n\",\n    \"\\n\",\n    \"Stacking results in features that are more expressive than individual primitives themselves. This enables the automatic creation of complex patterns for machine learning.\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \".. note ::\\n\",\n    \"    You can graphically visualize the lineage of a feature by calling :func:`featuretools.graph_feature` on it. You can also generate an English description of the feature with :func:`featuretools.describe_feature`. See :doc:`/guides/feature_descriptions` for more details.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Changing Target DataFrame\\n\",\n    \"\\n\",\n    \"DFS is powerful because we can create a feature matrix for any dataframe in our dataset. If we switch our target dataframe to \\\"sessions\\\", we can synthesize features for each session instead of each customer. Now, we can use these features to predict the outcome of a session.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"sessions\\\",\\n\",\n    \"    agg_primitives=[\\\"mean\\\", \\\"sum\\\", \\\"mode\\\"],\\n\",\n    \"    trans_primitives=[\\\"month\\\", \\\"hour\\\"],\\n\",\n    \"    max_depth=2,\\n\",\n    \")\\n\",\n    \"feature_matrix.head(5)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \"As we can see, DFS will also build deep features based on a parent dataframe, in this case the customer of a particular session. For example, the feature below calculates the mean transaction amount of the customer of the session.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix[[\\\"customers.MEAN(transactions.amount)\\\"]].head(5)\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \"Improve feature output\\n\",\n    \"~~~~~~~~~~~~~~~~~~~~~~\\n\",\n    \"\\n\",\n    \"To learn about the parameters to change in DFS read :doc:`/guides/tuning_dfs`.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \".. here it maybe nice to have a table that shows the number of features generated for AirBnB and other KAGGLE datasets once we have them. We can also give the user access to it.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"celltoolbar\": \"Raw Cell Format\",\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 4\n}\n"
  },
  {
    "path": "docs/source/getting_started/getting_started_index.rst",
    "content": "Getting Started\n---------------\n\nFor a quick introduction to Featuretools, check out our :ref:`5 minute quick start guide <quick-start>`.\n\nHow to start working with Featuretools; the main concepts:\n\n.. toctree::\n   :maxdepth: 1\n\n   using_entitysets\n   afe\n   primitives\n   woodwork_types\n   handling_time\n"
  },
  {
    "path": "docs/source/getting_started/handling_time.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"a8104f18\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Handling Time\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"When performing feature engineering with temporal data, carefully selecting the data that is used for any calculation is paramount. By annotating dataframes with a Woodwork **time index** column and providing a **cutoff time** during feature calculation, Featuretools will automatically filter out any data after the cutoff time before running any calculations.\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"id\": \"9cd9cb82\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \".. note::\\n\",\n    \"        This guide focuses on performing feature engineering on temporal data, but it is not specific to feature engineering for time series problems, which are their own class of machine learning problems. A guide on **using Featuretools for time series feature engineering** can be found `here <../guides/time_series.ipynb>`_.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"32c2ae4d\",\n   \"metadata\": {},\n   \"source\": [\n    \"## What is the Time Index?\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"The time index is the column in the data that specifies when the data in each row became known. For example, let's examine a table of customer transactions:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"ebbcb40b\",\n   \"metadata\": {\n    \"nbsphinx\": \"hidden\"\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import pandas as pd\\n\",\n    \"\\n\",\n    \"pd.options.display.max_columns = 200\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"8202f11a\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import featuretools as ft\\n\",\n    \"\\n\",\n    \"es = ft.demo.load_mock_customer(return_entityset=True, random_seed=0)\\n\",\n    \"es[\\\"transactions\\\"].head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"cd26087b\",\n   \"metadata\": {},\n   \"source\": [\n    \"In this table, there is one row for every transaction and a ``transaction_time`` column that specifies when the transaction took place. This means that ``transaction_time`` is the time index because it indicates when the information in each row became known and available for feature calculations. For now, ignore the ``_ft_last_time`` column. That is a featuretools-generated column that will be discussed later on.\\n\",\n    \"\\n\",\n    \"However, not every datetime column is a time index. Consider the ``customers`` dataframe:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"87dd0a0d\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es[\\\"customers\\\"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"c89d548d\",\n   \"metadata\": {},\n   \"source\": [\n    \"Here, we have two time columns, ``join_date`` and ``birthday``. While either column might be useful for making features, the ``join_date`` should be used as the time index because it indicates when that customer first became available in the dataset.\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"id\": \"85b51512\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \".. important::\\n\",\n    \"\\n\",\n    \"    The **time index** is defined as the first time that any information from a row can be used. If a cutoff time is specified when calculating features, rows that have a later value for the time index are automatically ignored.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"00e3c365\",\n   \"metadata\": {},\n   \"source\": [\n    \"# What is the Cutoff Time?\\n\",\n    \"The **cutoff_time** specifies the last point in time that a row’s data can be used for a feature calculation. Any data after this point in time will be filtered out before calculating features.\\n\",\n    \"\\n\",\n    \"For example, let's consider a dataset of timestamped customer transactions, where we want to predict whether customers ``1``, ``2`` and ``3`` will spend $500 between ``04:00`` on January 1 and the end of the day. When building features for this prediction problem, we need to ensure that no data after ``04:00`` is used in our calculations.\\n\",\n    \"\\n\",\n    \"<img src=\\\"../_static/images/retail_ct.png\\\" width=\\\"400\\\" align=\\\"center\\\" alt=\\\"retail cutoff time diagram\\\">\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"id\": \"19855e77\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \"We pass the cutoff time to :func:`featuretools.dfs` or :func:`featuretools.calculate_feature_matrix` using the ``cutoff_time`` argument like this:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"a0717f7d\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"fm, features = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    cutoff_time=pd.Timestamp(\\\"2014-1-1 04:00\\\"),\\n\",\n    \"    instance_ids=[1, 2, 3],\\n\",\n    \"    cutoff_time_in_index=True,\\n\",\n    \")\\n\",\n    \"fm\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"feafa08d\",\n   \"metadata\": {},\n   \"source\": [\n    \"Even though the entityset contains the complete transaction history for each customer, only data with a time index up to and including the cutoff time was used to calculate the features above.\\n\",\n    \"\\n\",\n    \"## Using a Cutoff Time DataFrame\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Oftentimes, the training examples for machine learning will come from different points in time. To specify a unique cutoff time for each row of the resulting feature matrix, we can pass a dataframe which includes one column for the instance id and another column for the corresponding cutoff time. These columns can be in any order, but they must be named properly. The column with the instance ids must either be named ``instance_id`` or have the same name as the target dataframe ``index``. The column with the cutoff time values must either be named ``time`` or have the same name as the target dataframe ``time_index``.\\n\",\n    \"\\n\",\n    \"The column names for the instance ids and the cutoff time values should be unambiguous. Passing a dataframe that contains both a column with the same name as the target dataframe ``index`` and a column named ``instance_id`` will result in an error. Similarly, if the cutoff time dataframe contains both a column with the same name as the target dataframe ``time_index`` and a column named ``time`` an error will be raised.\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"id\": \"6ffaffd0\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \".. note::\\n\",\n    \"\\n\",\n    \"    Only the columns corresponding to the instance ids and the cutoff times are used to calculate features. Any additional columns passed through are appended to the resulting feature matrix. This is typically used to pass through machine learning labels to ensure that they stay aligned with the feature matrix.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"fa5cc115\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"cutoff_times = pd.DataFrame()\\n\",\n    \"cutoff_times[\\\"customer_id\\\"] = [1, 2, 3, 1]\\n\",\n    \"cutoff_times[\\\"time\\\"] = pd.to_datetime(\\n\",\n    \"    [\\\"2014-1-1 04:00\\\", \\\"2014-1-1 05:00\\\", \\\"2014-1-1 06:00\\\", \\\"2014-1-1 08:00\\\"]\\n\",\n    \")\\n\",\n    \"cutoff_times[\\\"label\\\"] = [True, True, False, True]\\n\",\n    \"cutoff_times\\n\",\n    \"fm, features = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    cutoff_time=cutoff_times,\\n\",\n    \"    cutoff_time_in_index=True,\\n\",\n    \")\\n\",\n    \"fm\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"6185bb0d\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can now see that every row of the feature matrix is calculated at the corresponding time in the cutoff time dataframe. Because we calculate each row at a different time, it is possible to have a repeat customer. In this case, we calculated the feature vector for customer 1 at both ``04:00`` and ``08:00``.\\n\",\n    \"\\n\",\n    \"Training Window\\n\",\n    \"---------------\\n\",\n    \"\\n\",\n    \"By default, all data up to and including the cutoff time is used. We can restrict the amount of historical data that is selected for calculations using a \\\"training window.\\\"\\n\",\n    \"\\n\",\n    \"Here's an example of using a two hour training window:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"e321d463\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"window_fm, window_features = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    cutoff_time=cutoff_times,\\n\",\n    \"    cutoff_time_in_index=True,\\n\",\n    \"    training_window=\\\"2 hour\\\",\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"window_fm\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"4ee67c4d\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can see that that the counts for the same feature are lower after we shorten the training window:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"93d6b9ae\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"fm[[\\\"COUNT(transactions)\\\"]]\\n\",\n    \"\\n\",\n    \"window_fm[[\\\"COUNT(transactions)\\\"]]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"ad7c73c4\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Setting a Last Time Index\\n\",\n    \"\\n\",\n    \"The training window in Featuretools limits the amount of past data that can be used while calculating a particular feature vector. A row in the dataframe is filtered out if the value of its time index is either before or after the training window. This works for dataframes where a row occurs at a single point in time. However, a row can sometimes exist for a duration.\\n\",\n    \"\\n\",\n    \"For example, a customer's session has multiple transactions which can happen at different points in time. If we are trying to count the number of sessions a user has in a given time period, we often want to count all the sessions that had *any* transaction during the training window. To accomplish this, we need to not only know when a session starts, but also when it ends. The last time that an instance appears in the data is stored in the `_ft_last_time` column on the dataframe. We can compare the time index and the last time index of the ``sessions`` dataframe above:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"493c8193\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"last_time_index_col = es[\\\"sessions\\\"].ww.metadata.get(\\\"last_time_index\\\")\\n\",\n    \"es[\\\"sessions\\\"][[\\\"session_start\\\", last_time_index_col]].head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"b7f1c5cb\",\n   \"metadata\": {},\n   \"source\": [\n    \"Featuretools can automatically add last time indexes to every DataFrame in an ``Entityset`` by running ``EntitySet.add_last_time_indexes()``. When using a training window, if a `last_time_index has` been set, Featuretools will check to see if the `last_time_index` is after the start of the training window. That, combined with the cutoff time, allows DFS to discover which data is relevant for a given training window.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Excluding data at cutoff times\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"id\": \"b44bee57\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \"The ``cutoff_time`` is the last point in time where data can be used for feature\\n\",\n    \"calculation. If you don't want to use the data at the cutoff time in feature\\n\",\n    \"calculation, you can exclude that data by setting ``include_cutoff_time`` to\\n\",\n    \"``False`` in :func:`featuretools.dfs` or :func:`featuretools.calculate_feature_matrix`.\\n\",\n    \"If you set it to ``True`` (the default behavior), data from the cutoff time point\\n\",\n    \"will be used.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"2e92d895\",\n   \"metadata\": {},\n   \"source\": [\n    \"Setting ``include_cutoff_time`` to ``False`` also impacts how data at the edges\\n\",\n    \"of training windows are included or excluded.  Take this slice of data as an example:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"76f9676f\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df = es[\\\"transactions\\\"]\\n\",\n    \"df[df[\\\"session_id\\\"] == 1].head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"ce77f6fd\",\n   \"metadata\": {},\n   \"source\": [\n    \"Looking at the data, transactions occur every 65 seconds.  To check how ``include_cutoff_time``\\n\",\n    \"effects training windows, we can calculate features at the time of a transaction\\n\",\n    \"while using a 65 second training window.  This creates a training window with a\\n\",\n    \"transaction at both endpoints of the window.  For this example, we'll find the sum\\n\",\n    \"of all transactions for session id 1 that are in the training window.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"1841d78b\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from featuretools.primitives import Sum\\n\",\n    \"\\n\",\n    \"sum_log = ft.Feature(\\n\",\n    \"    es[\\\"transactions\\\"].ww[\\\"amount\\\"],\\n\",\n    \"    parent_dataframe_name=\\\"sessions\\\",\\n\",\n    \"    primitive=Sum,\\n\",\n    \")\\n\",\n    \"cutoff_time = pd.DataFrame(\\n\",\n    \"    {\\n\",\n    \"        \\\"session_id\\\": [1],\\n\",\n    \"        \\\"time\\\": [\\\"2014-01-01 00:04:20\\\"],\\n\",\n    \"    }\\n\",\n    \").astype({\\\"time\\\": \\\"datetime64[ns]\\\"})\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"3c15be10\",\n   \"metadata\": {},\n   \"source\": [\n    \"With ``include_cutoff_time=True``, the oldest point in the training window\\n\",\n    \"(``2014-01-01 00:03:15``) is excluded and the cutoff time point is included. This\\n\",\n    \"means only transaction 371 is in the training window, so the sum of all transaction\\n\",\n    \"amounts is 31.54\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"f782683a\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Case1. include_cutoff_time = True\\n\",\n    \"actual = ft.calculate_feature_matrix(\\n\",\n    \"    features=[sum_log],\\n\",\n    \"    entityset=es,\\n\",\n    \"    cutoff_time=cutoff_time,\\n\",\n    \"    cutoff_time_in_index=True,\\n\",\n    \"    training_window=\\\"65 seconds\\\",\\n\",\n    \"    include_cutoff_time=True,\\n\",\n    \")\\n\",\n    \"actual\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"324246db\",\n   \"metadata\": {},\n   \"source\": [\n    \"Whereas with ``include_cutoff_time=False``, the oldest point in the window is\\n\",\n    \"included and the cutoff time point is excluded.  So in this case transaction 116\\n\",\n    \"is included and transaction 371 is exluded, and the sum is 78.92\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"9b63bc68\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Case2. include_cutoff_time = False\\n\",\n    \"actual = ft.calculate_feature_matrix(\\n\",\n    \"    features=[sum_log],\\n\",\n    \"    entityset=es,\\n\",\n    \"    cutoff_time=cutoff_time,\\n\",\n    \"    cutoff_time_in_index=True,\\n\",\n    \"    training_window=\\\"65 seconds\\\",\\n\",\n    \"    include_cutoff_time=False,\\n\",\n    \")\\n\",\n    \"actual\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"4329314f\",\n   \"metadata\": {},\n   \"source\": [\n    \"Approximating Features by Rounding Cutoff Times\\n\",\n    \"-----------------------------------------------\\n\",\n    \"\\n\",\n    \"For each unique cutoff time, Featuretools must perform operations to select the data that’s valid for computations. If there are a large number of unique cutoff times relative to the number of instances for which we are calculating features, the time spent filtering data can add up. By reducing the number of unique cutoff times, we minimize the overhead from searching for and extracting data for feature calculations.\\n\",\n    \"\\n\",\n    \"One way to decrease the number of unique cutoff times is to round cutoff times to an earlier point in time. An earlier cutoff time is always valid for predictive modeling — it just means we’re not using some of the data we could potentially use while calculating that feature. So, we gain computational speed by losing a small amount of information.\\n\",\n    \"\\n\",\n    \"To understand when an approximation is useful, consider calculating features for a model to predict fraudulent credit card transactions. In this case, an important feature might be, \\\"the average transaction amount for this card in the past\\\". While this value can change every time there is a new transaction, updating it less frequently might not impact accuracy.\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"id\": \"3628cc1c\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \".. note::\\n\",\n    \"\\n\",\n    \"    The bank BBVA used approximation when building a predictive model for credit card fraud using Featuretools. For more details, see the \\\"Real-time deployment considerations\\\" section of the `white paper <https://arxiv.org/abs/1710.07709>`_ describing the work involved.\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"id\": \"4bf10090\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \"The frequency of approximation is controlled using the ``approximate`` parameter to :func:`featuretools.dfs` or :func:`featuretools.calculate_feature_matrix`. For example, the following code would approximate aggregation features at 1 day intervals::\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"641981d0\",\n   \"metadata\": {},\n   \"source\": [\n    \"    fm = ft.calculate_feature_matrix(features=features,\\n\",\n    \"                                     entityset=es_transactions,\\n\",\n    \"                                     cutoff_time=ct_transactions,\\n\",\n    \"                                     approximate=\\\"1 day\\\")\\n\",\n    \"\\n\",\n    \"In this computation, features that can be approximated will be calculated at 1 day intervals, while features that cannot be approximated (e.g \\\"where did this transaction occur?\\\") will be calculated at the exact cutoff time.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Secondary Time Index\\n\",\n    \"\\n\",\n    \"It is sometimes the case that information in a dataset is updated or added after a row has been created. This means that certain columns may actually become known after the time index for a row. Rather than drop those columns to avoid leaking information, we can create a secondary time index to indicate when those columns become known.\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"id\": \"6f8197f9\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \"The :func:`Flights <featuretools.demo.load_flight>` entityset is a good example of a dataset where column values in a row become known at different times. Each trip is recorded in the ``trip_logs`` dataframe, and has many times associated with it.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"d6043477\",\n   \"metadata\": {\n    \"nbsphinx\": \"hidden\"\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import urllib.request as urllib2\\n\",\n    \"\\n\",\n    \"opener = urllib2.build_opener()\\n\",\n    \"opener.addheaders = [(\\\"Testing\\\", \\\"True\\\")]\\n\",\n    \"urllib2.install_opener(opener)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"abf92463\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es_flight = ft.demo.load_flight(nrows=100)\\n\",\n    \"es_flight\\n\",\n    \"es_flight[\\\"trip_logs\\\"].head(3)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"36827ff9\",\n   \"metadata\": {},\n   \"source\": [\n    \"For every trip log, the time index is ``date_scheduled``, which is when the airline decided on the scheduled departure and arrival times, as well as what route will be flown. We don't know the rest of the information about the actual departure/arrival times and the details of any delay at this time. However, it is possible to know everything about how a trip went after it has arrived, so we can use that information at any time after the flight lands.\\n\",\n    \"\\n\",\n    \"Using a secondary time index, we can indicate to Featuretools which columns in our flight logs are known at the time the flight is scheduled, plus which are known at the time the flight lands.\\n\",\n    \"\\n\",\n    \"<img src=\\\"../_static/images/flight_ti_2.png\\\" width=\\\"400\\\" align=\\\"center\\\" alt=\\\"flight secondary time index diagram\\\">\\n\",\n    \"\\n\",\n    \"In Featuretools, when adding the dataframe to the ``EntitySet``, we set the secondary time index to be the arrival time like this:\\n\",\n    \"\\n\",\n    \"    es = ft.EntitySet('Flight Data')\\n\",\n    \"    arr_time_columns = ['arr_delay', 'dep_delay', 'carrier_delay', 'weather_delay',\\n\",\n    \"                        'national_airspace_delay', 'security_delay',\\n\",\n    \"                        'late_aircraft_delay', 'canceled', 'diverted',\\n\",\n    \"                        'taxi_in', 'taxi_out', 'air_time', 'dep_time']\\n\",\n    \"\\n\",\n    \"    es.add_dataframe(\\n\",\n    \"        dataframe_name='trip_logs',\\n\",\n    \"        dataframe=data,\\n\",\n    \"        index='trip_log_id',\\n\",\n    \"        make_index=True,\\n\",\n    \"        time_index='date_scheduled',\\n\",\n    \"        secondary_time_index={'arr_time': arr_time_columns})\\n\",\n    \"\\n\",\n    \"By setting a secondary time index, we can still use the delay information from a row, but only when it becomes known.\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"id\": \"eaef7ec8\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \".. hint::\\n\",\n    \"\\n\",\n    \"    It's often a good idea to use a secondary time index if your entityset has inline labels. If you know when the label would be valid for use, it's possible to automatically create very predictive features using historical labels.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"03448def\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Flight Predictions\\n\",\n    \"\\n\",\n    \"Let's make some features at varying times using the flight example described above. Trip ``14`` is a flight from CLT to PHX on January 31, 2017 and trip ``92`` is a flight from PIT to DFW on January 1. We can set any cutoff time before the flight is scheduled to depart, emulating how we would make the prediction at that point in time.\\n\",\n    \"\\n\",\n    \"We set two cutoff times for trip ``14`` at two different times: one which is more than a month before the flight and another which is only 5 days before. For trip ``92``, we'll only set one cutoff time, three days before it is scheduled to leave.\\n\",\n    \"\\n\",\n    \"<img src=\\\"../_static/images/flight_ct.png\\\" width=\\\"500\\\" align=\\\"center\\\" alt=\\\"flight cutoff time diagram\\\">\\n\",\n    \"\\n\",\n    \"Our cutoff time dataframe looks like this:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"c338105b\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ct_flight = pd.DataFrame()\\n\",\n    \"ct_flight[\\\"trip_log_id\\\"] = [14, 14, 92]\\n\",\n    \"ct_flight[\\\"time\\\"] = pd.to_datetime([\\\"2016-12-28\\\", \\\"2017-1-25\\\", \\\"2016-12-28\\\"])\\n\",\n    \"ct_flight[\\\"label\\\"] = [True, True, False]\\n\",\n    \"ct_flight\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"f26db5dd\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's calculate the feature matrix:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"bd56c24e\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"fm, features = ft.dfs(\\n\",\n    \"    entityset=es_flight,\\n\",\n    \"    target_dataframe_name=\\\"trip_logs\\\",\\n\",\n    \"    cutoff_time=ct_flight,\\n\",\n    \"    cutoff_time_in_index=True,\\n\",\n    \"    agg_primitives=[\\\"max\\\"],\\n\",\n    \"    trans_primitives=[\\\"month\\\"],\\n\",\n    \")\\n\",\n    \"fm[\\n\",\n    \"    [\\n\",\n    \"        \\\"flights.origin\\\",\\n\",\n    \"        \\\"flights.dest\\\",\\n\",\n    \"        \\\"label\\\",\\n\",\n    \"        \\\"flights.MAX(trip_logs.arr_delay)\\\",\\n\",\n    \"        \\\"MONTH(scheduled_dep_time)\\\",\\n\",\n    \"    ]\\n\",\n    \"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"f367279c\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's understand the output:\\n\",\n    \"\\n\",\n    \"1. A row was made for every id-time pair in ``ct_flight``, which is returned as the index of the feature matrix.\\n\",\n    \"\\n\",\n    \"2. The output was sorted by cutoff time. Because of the sorting, it's often helpful to pass in a label with the cutoff time dataframe so that it will remain sorted in the same fashion as the feature matrix. Any additional columns beyond ``id`` and ``cutoff_time`` will not be used for making features.\\n\",\n    \"\\n\",\n    \"3. The column ``flights.MAX(trip_logs.arr_delay)`` is not always defined. It can only have any real values when there are historical flights to aggregate. Notice that, for trip ``14``, there wasn't any historical data when we made the feature a month in advance, but there **were** flights to aggregate when we shortened it to 5 days. These are powerful features that are often excluded in manual processes because of how hard they are to make.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Creating and Flattening a Feature Tensor\\n\",\n    \"----------------------------------------\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"id\": \"3d5f23cc\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \"The :func:`featuretools.make_temporal_cutoffs` function generates a series of equally spaced cutoff times from a given set of cutoff times and instance ids.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"a7b677e7\",\n   \"metadata\": {},\n   \"source\": [\n    \"This function can be paired with DFS to create and flatten a feature tensor rather than making multiple feature matrices at different delays.\\n\",\n    \"\\n\",\n    \"The function\\n\",\n    \"takes in the the following parameters:\\n\",\n    \"\\n\",\n    \" * ``instance_ids (list, pd.Series, or np.ndarray)``: A list of instances.\\n\",\n    \" * ``cutoffs (list, pd.Series, or np.ndarray)``: An associated list of cutoff times.\\n\",\n    \" * ``window_size (str or pandas.DateOffset)``: The amount of time between each cutoff time in the created time series.\\n\",\n    \" * ``start (datetime.datetime or pd.Timestamp)``: The first cutoff time in the created time series.\\n\",\n    \" * ``num_windows (int)``: The number of cutoff times to create in the created time series.\\n\",\n    \"\\n\",\n    \"Only two of the three options ``window_size``, ``start``, and ``num_windows`` need to be specified to uniquely determine an equally-spaced set of cutoff times at which to compute each instance.\\n\",\n    \"\\n\",\n    \"If your cutoff times are the ones used above:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"e7648a9d\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"cutoff_times\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"9bda6ff4\",\n   \"metadata\": {},\n   \"source\": [\n    \"Then passing in ``window_size='1h'`` and ``num_windows=2`` makes one row an hour over the last two hours to produce the following new dataframe. The result can be directly passed into DFS to make features at the different time points.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"b4204f47\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"temporal_cutoffs = ft.make_temporal_cutoffs(\\n\",\n    \"    cutoff_times[\\\"customer_id\\\"], cutoff_times[\\\"time\\\"], window_size=\\\"1h\\\", num_windows=2\\n\",\n    \")\\n\",\n    \"temporal_cutoffs\\n\",\n    \"fm, features = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    cutoff_time=temporal_cutoffs,\\n\",\n    \"    cutoff_time_in_index=True,\\n\",\n    \")\\n\",\n    \"fm\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"celltoolbar\": \"Raw Cell Format\",\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}\n"
  },
  {
    "path": "docs/source/getting_started/primitives.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"raw\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \".. _primitives:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Feature primitives\\n\",\n    \"Feature primitives are the building blocks of Featuretools. They define individual computations that can be applied to raw datasets to create new features. Because a primitive only constrains the input and output data types, they can be applied across datasets and can stack to create new calculations.\\n\",\n    \"\\n\",\n    \"## Why primitives?\\n\",\n    \"The space of potential functions that humans use to create a feature is expansive. By breaking common feature engineering calculations down into primitive components, we are able to capture the underlying structure of the features humans create today.\\n\",\n    \"\\n\",\n    \"A primitive only constrains the input and output data types. This means they can be used to transfer calculations known in one domain to another. Consider a feature which is often calculated by data scientists for transactional or event logs data: *average time between events*. This feature is incredibly valuable in predicting fraudulent behavior or future customer engagement.\\n\",\n    \"\\n\",\n    \"DFS achieves the same feature by stacking two primitives `\\\"time_since_previous\\\"` and `\\\"mean\\\"`\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import featuretools as ft\\n\",\n    \"\\n\",\n    \"es = ft.demo.load_mock_customer(return_entityset=True)\\n\",\n    \"\\n\",\n    \"feature_defs = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    agg_primitives=[\\\"mean\\\"],\\n\",\n    \"    trans_primitives=[\\\"time_since_previous\\\"],\\n\",\n    \"    features_only=True,\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"feature_defs\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \".. note:: \\n\",\n    \"\\n\",\n    \"    The primitive arguments to DFS (eg. ``agg_primitives`` and ``trans_primitives`` in the example above) accept ``snake_case``, ``camelCase``, or ``TitleCase`` strings of included Featuretools primitives (ie. ``time_since_previous``,  ``timeSincePrevious``, and  ``TimeSincePrevious`` are all acceptable inputs).\\n\",\n    \"\\n\",\n    \".. note::\\n\",\n    \"\\n\",\n    \"    When ``dfs`` is called with ``features_only=True``, only feature definitions are returned as output. By default this parameter is set to ``False``. This parameter is used quickly inspect the feature definitions before the spending time calculating the feature matrix.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"A second advantage of primitives is that they can be used to quickly enumerate many interesting features in a parameterized way. This is used by Deep Feature Synthesis to get several different ways of summarizing the time since the previous event.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    agg_primitives=[\\\"mean\\\", \\\"max\\\", \\\"min\\\", \\\"std\\\", \\\"skew\\\"],\\n\",\n    \"    trans_primitives=[\\\"time_since_previous\\\"],\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"feature_matrix[\\n\",\n    \"    [\\n\",\n    \"        \\\"MEAN(sessions.TIME_SINCE_PREVIOUS(session_start))\\\",\\n\",\n    \"        \\\"MAX(sessions.TIME_SINCE_PREVIOUS(session_start))\\\",\\n\",\n    \"        \\\"MIN(sessions.TIME_SINCE_PREVIOUS(session_start))\\\",\\n\",\n    \"        \\\"STD(sessions.TIME_SINCE_PREVIOUS(session_start))\\\",\\n\",\n    \"        \\\"SKEW(sessions.TIME_SINCE_PREVIOUS(session_start))\\\",\\n\",\n    \"    ]\\n\",\n    \"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Aggregation vs Transform Primitive\\n\",\n    \"\\n\",\n    \"In the example above, we use two types of primitives.\\n\",\n    \"\\n\",\n    \"**Aggregation primitives:** These primitives take related instances as an input and output a single value. They are applied across a parent-child relationship in an EntitySet. E.g: `\\\"count\\\"`, `\\\"sum\\\"`, `\\\"avg_time_between\\\"`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \".. graphviz:: graphs/agg_feat.dot\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"**Transform primitives:** These primitives take one or more columns from a dataframe as an input and output a new column for that dataframe. They are applied to a single dataframe. E.g: `\\\"hour\\\"`, `\\\"time_since_previous\\\"`, `\\\"absolute\\\"`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \".. graphviz:: graphs/trans_feat.dot\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"The above graphs were generated using the :func:`graph_feature <featuretools.graph_feature>` function. These feature lineage graphs help to visually show how primitives were stacked to generate a feature.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"For a DataFrame that lists and describes each built-in primitive in Featuretools, call `ft.list_primitives()`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ft.list_primitives().head(5)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"For a DataFrame of metrics that summarizes various properties and capabilities of all of the built-in primitives in Featuretools, call `ft.summarize_primitives()`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ft.summarize_primitives()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining Custom Primitives\\n\",\n    \"\\n\",\n    \"The library of primitives in Featuretools is constantly expanding.  Users can define their own primitive using the APIs below.  To define a primitive, a user will\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"  * Specify the type of primitive `Aggregation` or `Transform`\\n\",\n    \"  * Define the input and output data types\\n\",\n    \"  * Write a function in python to do the calculation\\n\",\n    \"  * Annotate with attributes to constrain how it is applied\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Once a primitive is defined, it can stack with existing primitives to generate complex patterns. This enables primitives known to be important for one domain to automatically be transfered to another.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import pandas as pd\\n\",\n    \"from woodwork.column_schema import ColumnSchema\\n\",\n    \"from woodwork.logical_types import Datetime, NaturalLanguage\\n\",\n    \"\\n\",\n    \"from featuretools.primitives import AggregationPrimitive, TransformPrimitive\\n\",\n    \"from featuretools.tests.testing_utils import make_ecommerce_entityset\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/markdown\"\n   },\n   \"source\": [\n    \"### Simple Custom Primitives\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class Absolute(TransformPrimitive):\\n\",\n    \"    name = \\\"absolute\\\"\\n\",\n    \"    input_types = [ColumnSchema(semantic_tags={\\\"numeric\\\"})]\\n\",\n    \"    return_type = ColumnSchema(semantic_tags={\\\"numeric\\\"})\\n\",\n    \"\\n\",\n    \"    def get_function(self):\\n\",\n    \"        def absolute(column):\\n\",\n    \"            return abs(column)\\n\",\n    \"\\n\",\n    \"        return absolute\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/markdown\"\n   },\n   \"source\": [\n    \"Above, we created a new transform primitive that can be used with Deep Feature Synthesis by deriving a new primitive class using `TransformPrimitive` as a base and overriding `get_function` to return a function that calculates the feature. Additionally, we set the input data types that the primitive applies to and the return data type. Input and return data types are defined using a Woodwork ColumnSchema. A full guide on Woodwork logical types and semantic tags can be found in the Woodwork [Understanding Logical Types and Semantic Tags](https://woodwork.alteryx.com/en/stable/guides/logical_types_and_semantic_tags.html) guide.\\n\",\n    \"\\n\",\n    \"Similarly, we can make a new aggregation primitive using `AggregationPrimitive`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class Maximum(AggregationPrimitive):\\n\",\n    \"    name = \\\"maximum\\\"\\n\",\n    \"    input_types = [ColumnSchema(semantic_tags={\\\"numeric\\\"})]\\n\",\n    \"    return_type = ColumnSchema(semantic_tags={\\\"numeric\\\"})\\n\",\n    \"\\n\",\n    \"    def get_function(self):\\n\",\n    \"        def maximum(column):\\n\",\n    \"            return max(column)\\n\",\n    \"\\n\",\n    \"        return maximum\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/markdown\"\n   },\n   \"source\": [\n    \"Because we defined an aggregation primitive, the function takes in a list of values but only returns one.\\n\",\n    \"\\n\",\n    \"Now that we've defined two primitives, we can use them with the dfs function as if they were built-in primitives.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"sessions\\\",\\n\",\n    \"    agg_primitives=[Maximum],\\n\",\n    \"    trans_primitives=[Absolute],\\n\",\n    \"    max_depth=2,\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"feature_matrix.head(5)[\\n\",\n    \"    [\\n\",\n    \"        \\\"customers.MAXIMUM(transactions.amount)\\\",\\n\",\n    \"        \\\"MAXIMUM(transactions.ABSOLUTE(amount))\\\",\\n\",\n    \"    ]\\n\",\n    \"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \"### Word Count Example\\n\",\n    \"\\n\",\n    \"Here we define a transform primitive, `WordCount`, which counts the number of words in each row of an input and returns a list of the counts.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class WordCount(TransformPrimitive):\\n\",\n    \"    \\\"\\\"\\\"\\n\",\n    \"    Counts the number of words in each row of the column. Returns a list\\n\",\n    \"    of the counts for each row.\\n\",\n    \"    \\\"\\\"\\\"\\n\",\n    \"\\n\",\n    \"    name = \\\"word_count\\\"\\n\",\n    \"    input_types = [ColumnSchema(logical_type=NaturalLanguage)]\\n\",\n    \"    return_type = ColumnSchema(semantic_tags={\\\"numeric\\\"})\\n\",\n    \"\\n\",\n    \"    def get_function(self):\\n\",\n    \"        def word_count(column):\\n\",\n    \"            word_counts = []\\n\",\n    \"            for value in column:\\n\",\n    \"                words = value.split(None)\\n\",\n    \"                word_counts.append(len(words))\\n\",\n    \"            return word_counts\\n\",\n    \"\\n\",\n    \"        return word_count\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es = make_ecommerce_entityset()\\n\",\n    \"\\n\",\n    \"feature_matrix, features = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"sessions\\\",\\n\",\n    \"    agg_primitives=[\\\"sum\\\", \\\"mean\\\", \\\"std\\\"],\\n\",\n    \"    trans_primitives=[WordCount],\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"feature_matrix[\\n\",\n    \"    [\\n\",\n    \"        \\\"customers.WORD_COUNT(favorite_quote)\\\",\\n\",\n    \"        \\\"STD(log.WORD_COUNT(comments))\\\",\\n\",\n    \"        \\\"SUM(log.WORD_COUNT(comments))\\\",\\n\",\n    \"        \\\"MEAN(log.WORD_COUNT(comments))\\\",\\n\",\n    \"    ]\\n\",\n    \"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/markdown\"\n   },\n   \"source\": [\n    \"By adding some aggregation primitives as well, Deep Feature Synthesis was able to make four new features from one new primitive.\\n\",\n    \"\\n\",\n    \"### Multiple Input Types\\n\",\n    \"\\n\",\n    \"If a primitive requires multiple features as input, `input_types` has multiple elements, eg `[ColumnSchema(semantic_tags={'numeric'}), ColumnSchema(semantic_tags={'numeric'})]` would mean the primitive requires two columns with the semantic tag `numeric` as input. Below is an example of a primitive that has multiple input features.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class MeanSunday(AggregationPrimitive):\\n\",\n    \"    \\\"\\\"\\\"\\n\",\n    \"    Finds the mean of non-null values of a feature that occurred on Sundays\\n\",\n    \"    \\\"\\\"\\\"\\n\",\n    \"\\n\",\n    \"    name = \\\"mean_sunday\\\"\\n\",\n    \"    input_types = [\\n\",\n    \"        ColumnSchema(semantic_tags={\\\"numeric\\\"}),\\n\",\n    \"        ColumnSchema(logical_type=Datetime),\\n\",\n    \"    ]\\n\",\n    \"    return_type = ColumnSchema(semantic_tags={\\\"numeric\\\"})\\n\",\n    \"\\n\",\n    \"    def get_function(self):\\n\",\n    \"        def mean_sunday(numeric, datetime):\\n\",\n    \"            days = pd.DatetimeIndex(datetime).weekday.values\\n\",\n    \"            df = pd.DataFrame({\\\"numeric\\\": numeric, \\\"time\\\": days})\\n\",\n    \"            return df[df[\\\"time\\\"] == 6][\\\"numeric\\\"].mean()\\n\",\n    \"\\n\",\n    \"        return mean_sunday\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix, features = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"sessions\\\",\\n\",\n    \"    agg_primitives=[MeanSunday],\\n\",\n    \"    trans_primitives=[],\\n\",\n    \"    max_depth=1,\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"feature_matrix[\\n\",\n    \"    [\\n\",\n    \"        \\\"MEAN_SUNDAY(log.value, datetime)\\\",\\n\",\n    \"        \\\"MEAN_SUNDAY(log.value_2, datetime)\\\",\\n\",\n    \"    ]\\n\",\n    \"]\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"celltoolbar\": \"Raw Cell Format\",\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 4\n}\n"
  },
  {
    "path": "docs/source/getting_started/using_entitysets.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Representing Data with EntitySets\\n\",\n    \"\\n\",\n    \"An ``EntitySet`` is a collection of dataframes and the relationships between them. They are useful for preparing raw, structured datasets for feature engineering. While many functions in Featuretools  take ``dataframes`` and ``relationships`` as separate arguments, it is recommended to create an ``EntitySet``, so you can more easily manipulate your data as needed.\\n\",\n    \"\\n\",\n    \"## The Raw Data\\n\",\n    \"\\n\",\n    \"Below we have two tables of data (represented as Pandas DataFrames) related to customer transactions. The first is a merge of transactions, sessions, and customers so that the result looks like something you might see in a log file:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import featuretools as ft\\n\",\n    \"\\n\",\n    \"data = ft.demo.load_mock_customer()\\n\",\n    \"transactions_df = data[\\\"transactions\\\"].merge(data[\\\"sessions\\\"]).merge(data[\\\"customers\\\"])\\n\",\n    \"\\n\",\n    \"transactions_df.sample(10)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"And the second dataframe is a list of products involved in those transactions.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"products_df = data[\\\"products\\\"]\\n\",\n    \"products_df\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Creating an EntitySet\\n\",\n    \"\\n\",\n    \"First, we initialize an ``EntitySet``. If you'd like to give it a name, you can optionally provide an ``id`` to the constructor.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es = ft.EntitySet(id=\\\"customer_data\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Adding dataframes\\n\",\n    \"\\n\",\n    \"To get started, we add the transactions dataframe to the `EntitySet`. In the call to ``add_dataframe``, we specify three important parameters:\\n\",\n    \"\\n\",\n    \"* The ``index`` parameter specifies the column that uniquely identifies rows in the dataframe.\\n\",\n    \"* The ``time_index`` parameter tells Featuretools when the data was created.\\n\",\n    \"* The ``logical_types`` parameter indicates that \\\"product_id\\\" should be interpreted as a Categorical column, even though it is just an integer in the underlying data.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from woodwork.logical_types import Categorical, PostalCode\\n\",\n    \"\\n\",\n    \"es = es.add_dataframe(\\n\",\n    \"    dataframe_name=\\\"transactions\\\",\\n\",\n    \"    dataframe=transactions_df,\\n\",\n    \"    index=\\\"transaction_id\\\",\\n\",\n    \"    time_index=\\\"transaction_time\\\",\\n\",\n    \"    logical_types={\\n\",\n    \"        \\\"product_id\\\": Categorical,\\n\",\n    \"        \\\"zip_code\\\": PostalCode,\\n\",\n    \"    },\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"es\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"You can also use a setter on the ``EntitySet`` object to add dataframes\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \".. currentmodule:: featuretools\\n\",\n    \"\\n\",\n    \"\\n\",\n    \".. note ::\\n\",\n    \"\\n\",\n    \"    You can also use a setter on the ``EntitySet`` object to add dataframes\\n\",\n    \"\\n\",\n    \"    ``es[\\\"transactions\\\"] = transactions_df``\\n\",\n    \"\\n\",\n    \"    that this will use the default implementation of `add_dataframe`, notably the following:\\n\",\n    \"\\n\",\n    \"    * if the DataFrame does not have `Woodwork <https://woodwork.alteryx.com/>`_ initialized, the first column will be the index column\\n\",\n    \"    * if the DataFrame does not have Woodwork initialized, all columns will be inferred by Woodwork.\\n\",\n    \"    * if control over the time index column and logical types is needed, Woodwork should be initialized before adding the dataframe.\\n\",\n    \"\\n\",\n    \".. note ::\\n\",\n    \"\\n\",\n    \"    You can also display your `EntitySet` structure graphically by calling :meth:`.EntitySet.plot`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"This method associates each column in the dataframe to a [Woodwork](https://woodwork.alteryx.com/) logical type. Each logical type can have an associated standard semantic tag that helps define the column data type. If you don't specify the logical type for a column, it gets inferred based on the underlying data. The logical types and semantic tags are listed in the schema of the dataframe. For more information on working with logical types and semantic tags, take a look at the [Woodwork documention](https://woodwork.alteryx.com/).\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es[\\\"transactions\\\"].ww.schema\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can do that same thing with our products dataframe.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es = es.add_dataframe(\\n\",\n    \"    dataframe_name=\\\"products\\\", dataframe=products_df, index=\\\"product_id\\\"\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"es\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"With two dataframes in our `EntitySet`, we can add a relationship between them.\\n\",\n    \"\\n\",\n    \"## Adding a Relationship\\n\",\n    \"\\n\",\n    \"We want to relate these two dataframes by the columns called \\\"product_id\\\" in each dataframe. Each product has multiple transactions associated with it, so it is called the **parent dataframe**, while the transactions dataframe is known as the **child dataframe**. When specifying relationships, we need four parameters: the parent dataframe name, the parent column name, the child dataframe name, and the child column name. Note that each relationship must denote a one-to-many relationship rather than a relationship which is one-to-one or many-to-many.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es = es.add_relationship(\\\"products\\\", \\\"product_id\\\", \\\"transactions\\\", \\\"product_id\\\")\\n\",\n    \"es\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we see the relationship has been added to our `EntitySet`.\\n\",\n    \"\\n\",\n    \"## Creating a dataframe from an existing table\\n\",\n    \"\\n\",\n    \"When working with raw data, it is common to have sufficient information to justify the creation of new dataframes. In order to create a new dataframe and relationship for sessions, we \\\"normalize\\\" the transaction dataframe.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es = es.normalize_dataframe(\\n\",\n    \"    base_dataframe_name=\\\"transactions\\\",\\n\",\n    \"    new_dataframe_name=\\\"sessions\\\",\\n\",\n    \"    index=\\\"session_id\\\",\\n\",\n    \"    make_time_index=\\\"session_start\\\",\\n\",\n    \"    additional_columns=[\\n\",\n    \"        \\\"device\\\",\\n\",\n    \"        \\\"customer_id\\\",\\n\",\n    \"        \\\"zip_code\\\",\\n\",\n    \"        \\\"session_start\\\",\\n\",\n    \"        \\\"join_date\\\",\\n\",\n    \"    ],\\n\",\n    \")\\n\",\n    \"es\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Looking at the output above, we see this method did two operations:\\n\",\n    \"\\n\",\n    \"1. It created a new dataframe called \\\"sessions\\\" based on the \\\"session_id\\\" and \\\"session_start\\\" columns in \\\"transactions\\\"\\n\",\n    \"2. It added a relationship connecting \\\"transactions\\\" and \\\"sessions\\\"\\n\",\n    \"\\n\",\n    \"If we look at the schema from the transactions dataframe and the new sessions dataframe, we see two more operations that were performed automatically:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es[\\\"transactions\\\"].ww.schema\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es[\\\"sessions\\\"].ww.schema\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"1. It removed \\\"device\\\", \\\"customer_id\\\", \\\"zip_code\\\" and \\\"join_date\\\" from \\\"transactions\\\" and created a new columns in the sessions dataframe. This reduces redundant information as the those properties of a session don't change between transactions.\\n\",\n    \"2. It copied and marked \\\"session_start\\\" as a time index column into the new sessions dataframe to indicate the beginning of a session. If the base dataframe has a time index and ``make_time_index`` is not set, ``normalize_dataframe`` will create a time index for the new dataframe. In this case it would create a new time index called \\\"first_transactions_time\\\" using the time of the first transaction of each session. If we don't want this time index to be created, we can set ``make_time_index=False``.\\n\",\n    \"\\n\",\n    \"If we look at the dataframes, we can see what ``normalize_dataframe`` did to the actual data.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es[\\\"sessions\\\"].head(5)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es[\\\"transactions\\\"].head(5)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"To finish preparing this dataset, create a \\\"customers\\\" dataframe using the same method call.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es = es.normalize_dataframe(\\n\",\n    \"    base_dataframe_name=\\\"sessions\\\",\\n\",\n    \"    new_dataframe_name=\\\"customers\\\",\\n\",\n    \"    index=\\\"customer_id\\\",\\n\",\n    \"    make_time_index=\\\"join_date\\\",\\n\",\n    \"    additional_columns=[\\\"zip_code\\\", \\\"join_date\\\"],\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"es\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Using the EntitySet\\n\",\n    \"\\n\",\n    \"Finally, we are ready to use this EntitySet with any functionality within Featuretools. For example, let's build a feature matrix for each product in our dataset.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name=\\\"products\\\")\\n\",\n    \"\\n\",\n    \"feature_matrix\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\",\n    \"vscode\": {\n     \"languageId\": \"raw\"\n    }\n   },\n   \"source\": [\n    \"As we can see, the features from DFS use the relational structure of our `EntitySet`. Therefore it is important to think carefully about the dataframes that we create.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"celltoolbar\": \"Raw Cell Format\",\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 4\n}\n"
  },
  {
    "path": "docs/source/getting_started/woodwork_types.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"b95b28c1\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Woodwork Typing in Featuretools\\n\",\n    \"\\n\",\n    \"Featuretools relies on having consistent typing across the creation of EntitySets, Primitives, Features, and feature matrices. Previously, Featuretools used its own type system that contained objects called Variables. Now and moving forward, Featuretools will use an external data typing library for its typing: [Woodwork](https://woodwork.alteryx.com/en/stable/index.html).\\n\",\n    \"\\n\",\n    \"Understanding the Woodwork types that exist and how Featuretools uses Woodwork's type system will allow users to:\\n\",\n    \"    - build EntitySets that best represent their data\\n\",\n    \"    - understand the possible input and return types for Featuretools' Primitives\\n\",\n    \"    - understand what features will get generated from a given set of data and primitives.\\n\",\n    \"\\n\",\n    \"Read the [Understanding Woodwork Logical Types and Semantic Tags](https://woodwork.alteryx.com/en/stable/guides/logical_types_and_semantic_tags.html) guide for an in-depth walkthrough of the available Woodwork types that are outlined below.\\n\",\n    \"\\n\",\n    \"For users that are familiar with the old `Variable` objects, the [Transitioning to Featuretools Version 1.0](../resources/transition_to_ft_v1.0.ipynb) guide will be useful for converting Variable types to Woodwork types.\\n\",\n    \"\\n\",\n    \"## Physical Types \\n\",\n    \"Physical types define how the data in a Woodwork DataFrame is stored on disk or in memory. You might also see the physical type for a column referred to as the column’s `dtype`.\\n\",\n    \"\\n\",\n    \"Knowing a Woodwork DataFrame's physical types is important because Pandas relies on these types when performing DataFrame operations. Each Woodwork `LogicalType` class has a single physical type associated with it.\\n\",\n    \"\\n\",\n    \"## Logical Types\\n\",\n    \"Logical types add additional information about how data should be interpreted or parsed beyond what can be contained in a physical type. In fact, multiple logical types have the same physical type, each imparting a different meaning that's not contained in the physical type alone.\\n\",\n    \"\\n\",\n    \"In Featuretools, a column's logical type informs how data is read into an EntitySet and how it gets used down the line in Deep Feature Synthesis.\\n\",\n    \"\\n\",\n    \"Woodwork provides many different logical types, which can be seen with the `list_logical_types` function.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"497712b0\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import featuretools as ft\\n\",\n    \"\\n\",\n    \"ft.list_logical_types()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"cfe99d0f\",\n   \"metadata\": {},\n   \"source\": [\n    \"Featuretools will perform type inference to assign logical types to the data in EntitySets if none are provided, but it is also possible to specify which logical types should be set for any column (provided that the data in that column is compatible with the logical type).\\n\",\n    \"\\n\",\n    \"To learn more about how logical types are used in EntitySets, see the [Creating EntitySets](using_entitysets.ipynb) guide.\\n\",\n    \"\\n\",\n    \"To learn more about setting logical types directly on a DataFrame, see the Woodwork guide on [working with Logical Types](https://woodwork.alteryx.com/en/stable/guides/working_with_types_and_tags.html#Working-with-Logical-Types). \\n\",\n    \"\\n\",\n    \"## Semantic Tags\\n\",\n    \"Semantic tags provide additional information to columns about the meaning or potential uses of data. Columns can have many or no semantic tags. Some tags are added by Woodwork, some are added by Featuretools, and users can add additional tags as they see fit.\\n\",\n    \"\\n\",\n    \"To learn more about setting semantic tags directly on a DataFrame, see the Woodwork guide on [working with Semantic Tags](https://woodwork.alteryx.com/en/stable/guides/working_with_types_and_tags.html#Working-with-Semantic-Tags). \\n\",\n    \"\\n\",\n    \"### Woodwork-defined Semantic Tags\\n\",\n    \"\\n\",\n    \"Woodwork will add certain semantic tags to columns at initialization. These can be standard tags that may be associated with different sets of logical types or index tags. There are also tags that users can add to confer a suggested meaning to columns in Woodwork.\\n\",\n    \"\\n\",\n    \"To get a list of these tags, you can use the `list_semantic_tags` function.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"11f25bd9\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ft.list_semantic_tags()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"29222810\",\n   \"metadata\": {},\n   \"source\": [\n    \"Above we see the semantic tags that are defined within Woodwork. These tags inform how Featuretools is able to interpret data, an example of which can be seen in the `Age` primitive, which requires that the `date_of_birth` semantic tag be present on a column.\\n\",\n    \"\\n\",\n    \"The `date_of_birth` tag will not get automatically added by Woodwork, so in order for Featuretools to be able to use the `Age` primitive, the `date_of_birth` tag must be manually added to any columns to which it applies.\\n\",\n    \"\\n\",\n    \"### Featuretools-defined Semantic Tags\\n\",\n    \"\\n\",\n    \"Just like Woodwork specifies semantic tags internally, Featuretools also defines a few tags of its own that allow the full set of Features to be generated. These tags have specific meanings when they are present on a column.\\n\",\n    \"\\n\",\n    \"- `'last_time_index'` - added by Featuretools to the last time index column of a DataFrame. Indicates that this column has been created by Featuretools.\\n\",\n    \"- `'foreign_key'` - used to indicate that this column is the child column of a relationship, meaning that this column is related to a corresponding index column of another dataframe in the EntitySet.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Woodwork Throughout Featuretools\\n\",\n    \"\\n\",\n    \"Now that we've described the elements that make up Woodwork's type system, lets see them in action in Featuretools.\\n\",\n    \"\\n\",\n    \"### Woodwork in EntitySets\\n\",\n    \"For more information on building EntitySets using Woodwork, see the [EntitySet guide](using_entitysets.ipynb).\\n\",\n    \"\\n\",\n    \"Let's look at the Woodwork typing information as it's stored in a demo EntitySet of retail data:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"bd9c1ec9\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es = ft.demo.load_retail()\\n\",\n    \"es\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"267880c4\",\n   \"metadata\": {},\n   \"source\": [\n    \"Woodwork typing information is not stored in the EntitySet object, but rather is stored in the individual DataFrames that make up the EntitySet. To look at the Woodwork typing information, we first select a single DataFrame from the EntitySet, and then access the Woodwork information via the `ww` namespace:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"aa1966fd\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df = es[\\\"products\\\"]\\n\",\n    \"df.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"164b1138\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df.ww\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"4bffac54\",\n   \"metadata\": {},\n   \"source\": [\n    \"Notice how the three columns showing this DataFrame's typing information are the three elements of typing information outlined at the beginning of this guide. To reiterate: By defining physical types, logical types, and semantic tags for each column in a DataFrame, we've defined a DataFrame's Woodwork schema, and with it, we can gain an understanding of the contents of each column.\\n\",\n    \"\\n\",\n    \"This column-specific typing information that exists for every column in every DataFrame in an EntitySet is an integral part of Deep Feature Synthesis' ability to generate features for an EntitySet.\\n\",\n    \"\\n\",\n    \"### Woodwork in DFS\\n\",\n    \"As the units of computation in Featuretools, Primitives need to be able to specify the input types that they allow as well as have a predictable return type. For an in-depth explanation of Primitives in Featuretools, see the [Feature Primitives](primitives.ipynb) guide. Here, we'll look at how the Woodwork types come together into a `ColumnSchema` object to describe Primitive input and return types.\\n\",\n    \"\\n\",\n    \"Below is a Woodwork `ColumnSchema` that we've obtained from the `'product_id'` column in the `products` DataFrame in the retail EntitySet.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"349e5274\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"products_df = es[\\\"products\\\"]\\n\",\n    \"product_ids_series = products_df.ww[\\\"product_id\\\"]\\n\",\n    \"column_schema = product_ids_series.ww.schema\\n\",\n    \"column_schema\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"8e8c0ccf\",\n   \"metadata\": {},\n   \"source\": [\n    \"This combination of logical type and semantic tag typing information is a `ColumnSchema`. In the case above, the `ColumnSchema` describes the **type definition** for a single column of data. \\n\",\n    \"\\n\",\n    \"Notice that there is no physical type in a `ColumnSchema`. This is because a `ColumnSchema` is a collection of Woodwork types that doesn't have any data tied to it and therefore has no physical representation. Because a `ColumnSchema` object is not tied to any data, it can also be used to describe a **type space** into which other columns may or may not fall.\\n\",\n    \"\\n\",\n    \"This flexibility of the `ColumnSchema` class allows `ColumnSchema` objects to be used both as type definitions for every column in an EntitySet as well as input and return type spaces for every Primitive in Featuretools.\\n\",\n    \"\\n\",\n    \"Let's look at a different column in a different DataFrame to see how this works:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"f3bb3ffe\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"order_products_df = es[\\\"order_products\\\"]\\n\",\n    \"order_products_df.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"1aae3378\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"quantity_series = order_products_df.ww[\\\"quantity\\\"]\\n\",\n    \"column_schema = quantity_series.ww.schema\\n\",\n    \"column_schema\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"f067db9a\",\n   \"metadata\": {},\n   \"source\": [\n    \"The `ColumnSchema` above has been pulled from the `'quantity'` column in the `order_products` DataFrame in the retail EntitySet. This is a **type definition**. \\n\",\n    \"\\n\",\n    \"If we look at the Woodwork typing information for the `order_products` DataFrame, we can see that there are several columns that will have similar `ColumnSchema` type definitions. If we wanted to describe subsets of those columns, we could define several `ColumnSchema` **type spaces**\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"bc2bfae6\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es[\\\"order_products\\\"].ww\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"73257dcf\",\n   \"metadata\": {},\n   \"source\": [\n    \"Below are several `ColumnSchema`s that all would include our `quantity` column, but each of them describes a different type space. These `ColumnSchema`s get more restrictive as we go down:\\n\",\n    \"\\n\",\n    \"##### Entire DataFrame\\n\",\n    \"No restrictions have been placed; any column falls into this definition. This would include the whole DataFrame.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"f6614c98\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from woodwork.column_schema import ColumnSchema\\n\",\n    \"\\n\",\n    \"ColumnSchema()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"299fc7d2\",\n   \"metadata\": {},\n   \"source\": [\n    \"An example of a Primitive with this `ColumnSchema` as its input type is the `IsNull` transform primitive.\\n\",\n    \"\\n\",\n    \"##### By Semantic Tag\\n\",\n    \"Only columns with the `numeric` tag apply. This can include Double, Integer, and Age logical type columns as well. It will not include the `index` column which, despite containing integers, has had its standard tags replaced by the `'index'` tag.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"16c1a5a9\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ColumnSchema(semantic_tags={\\\"numeric\\\"})\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"0932d05d\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df = es[\\\"order_products\\\"].ww.select(include=\\\"numeric\\\")\\n\",\n    \"df.ww\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"a5ec95c8\",\n   \"metadata\": {},\n   \"source\": [\n    \"And example of a Primitive with this `ColumnSchema` as its input type is the `Mean` aggregation primitive.\\n\",\n    \"\\n\",\n    \"##### By Logical Type\\n\",\n    \"Only columns with logical type of `Integer` are included in this definition. Does not require the `numeric` tag, so an index column (which has its standard tags removed) would still apply.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"79bd3d4f\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from woodwork.logical_types import Integer\\n\",\n    \"\\n\",\n    \"ColumnSchema(logical_type=Integer)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"e905229e\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df = es[\\\"order_products\\\"].ww.select(include=\\\"Integer\\\")\\n\",\n    \"df.ww\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"2f752200\",\n   \"metadata\": {},\n   \"source\": [\n    \"##### By Logical Type and Semantic Tag\\n\",\n    \"The column must have logical type `Integer` and have the `numeric` semantic tag, excluding index columns.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"6da51b75\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ColumnSchema(logical_type=Integer, semantic_tags={\\\"numeric\\\"})\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"a96d92f6\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df = es[\\\"order_products\\\"].ww.select(include=\\\"numeric\\\")\\n\",\n    \"df = df.ww.select(include=\\\"Integer\\\")\\n\",\n    \"df.ww\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"71e0359b\",\n   \"metadata\": {},\n   \"source\": [\n    \"In this way, a `ColumnSchema` can define a type space under which columns in a Woodwork DataFrame can fall. This is how Featuretools determines which columns in a DataFrame are valid for a Primitive in building Features during DFS.\\n\",\n    \"\\n\",\n    \"Each Primitive has `input_types` and a `return_type` that are described by a Woodwork `ColumnSchema`. Every DataFrame in an EntitySet has Woodwork initialized on it. This means that when an EntitySet is passed into DFS, Featuretools can select the relevant columns in the DataFrame that are valid for the Primitive's `input_types`. We then get a Feature that has a `column_schema` property that indicates what that Feature's typing definition is in a way that lets DFS stack features on top of one another.\\n\",\n    \"\\n\",\n    \"In this way, Featuretools is able to leverage the base unit of Woodwork typing information, the `ColumnSchema`, and use it in concert with an EntitySet of Woodwork DataFrames in order to build Features with Deep Feature Synthesis.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}\n"
  },
  {
    "path": "docs/source/guides/advanced_custom_primitives.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Advanced Custom Primitives Guide\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import re\\n\",\n    \"\\n\",\n    \"import numpy as np\\n\",\n    \"from woodwork.column_schema import ColumnSchema\\n\",\n    \"from woodwork.logical_types import Datetime, NaturalLanguage\\n\",\n    \"\\n\",\n    \"import featuretools as ft\\n\",\n    \"from featuretools.primitives import TransformPrimitive\\n\",\n    \"from featuretools.tests.testing_utils import make_ecommerce_entityset\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Primitives with Additional Arguments\\n\",\n    \"\\n\",\n    \"Some features require more advanced calculations than others. Advanced features usually entail additional arguments to help output the desired value. With custom primitives, you can use primitive arguments to help you create advanced features.\\n\",\n    \"\\n\",\n    \"### String Count Example\\n\",\n    \"\\n\",\n    \"In this example, you will learn how to make custom primitives that take in additional arguments. You will create a primitive to count the number of times a specific string value occurs inside a text.\\n\",\n    \"\\n\",\n    \"First, derive a new transform primitive class using `TransformPrimitive` as a base. The primitive will take in a text column as the input and return a numeric column as the output, so set the input type to a Woodwork `ColumnSchema` with logical type `NaturalLanguage` and the return type to a Woodwork `ColumnSchema` with the semantic tag `'numeric'`. The specific string value is the additional argument, so define it as a *keyword* argument inside `__init__`. Then, override `get_function` to return a primitive function that will calculate the feature.\\n\",\n    \"\\n\",\n    \"Featuretools' primitives use Woodwork's `ColumnSchema` to control the input and return types of columns for the primitive. For more information about using the Woodwork typing system in Featuretools, see the [Woodwork Typing in Featuretools](../getting_started/woodwork_types.ipynb) guide.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class StringCount(TransformPrimitive):\\n\",\n    \"    \\\"\\\"\\\"Count the number of times the string value occurs.\\\"\\\"\\\"\\n\",\n    \"\\n\",\n    \"    name = \\\"string_count\\\"\\n\",\n    \"    input_types = [ColumnSchema(logical_type=NaturalLanguage)]\\n\",\n    \"    return_type = ColumnSchema(semantic_tags={\\\"numeric\\\"})\\n\",\n    \"\\n\",\n    \"    def __init__(self, string=None):\\n\",\n    \"        self.string = string\\n\",\n    \"\\n\",\n    \"    def get_function(self):\\n\",\n    \"        def string_count(column):\\n\",\n    \"            assert self.string is not None, \\\"string to count needs to be defined\\\"\\n\",\n    \"            # this is a naive implementation used for clarity\\n\",\n    \"            counts = [text.lower().count(self.string) for text in column]\\n\",\n    \"            return counts\\n\",\n    \"\\n\",\n    \"        return string_count\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now you have a primitive that is reusable for different string values. For example, you can create features based on the number of times the word \\\"the\\\" appears in a text. Create an instance of the primitive where the string value is \\\"the\\\" and pass the primitive into DFS to generate the features. The feature name will automatically reflect the string value of the primitive.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es = make_ecommerce_entityset()\\n\",\n    \"\\n\",\n    \"feature_matrix, features = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"sessions\\\",\\n\",\n    \"    agg_primitives=[\\\"sum\\\", \\\"mean\\\", \\\"std\\\"],\\n\",\n    \"    trans_primitives=[StringCount(string=\\\"the\\\")],\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"feature_matrix[\\n\",\n    \"    [\\n\",\n    \"        \\\"STD(log.STRING_COUNT(comments, string=the))\\\",\\n\",\n    \"        \\\"SUM(log.STRING_COUNT(comments, string=the))\\\",\\n\",\n    \"        \\\"MEAN(log.STRING_COUNT(comments, string=the))\\\",\\n\",\n    \"    ]\\n\",\n    \"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Features with Multiple Outputs\\n\",\n    \"\\n\",\n    \"Some calculations output more than a single value. With custom primitives, you can make the most of these calculations by creating a feature for each output value.\\n\",\n    \"\\n\",\n    \"### Case Count Example\\n\",\n    \"\\n\",\n    \"In this example, you will learn how to make custom primitives that output multiple features. You will create a primitive that outputs the count of upper case and lower case letters of a text.\\n\",\n    \"\\n\",\n    \"First, derive a new transform primitive class using `TransformPrimitive` as a base. The primitive will take in a text column as the input and return two numeric columns as the output, so set the input type to a Woodwork `ColumnSchema` with logical type `NaturalLanguage` and the return type to a Woodwork `ColumnSchema` with semantic tag `'numeric'`. Since this primitive returns two columns, also set `number_output_features` to two. Then, override `get_function` to return a primitive function that will calculate the feature and return a list of columns.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class CaseCount(TransformPrimitive):\\n\",\n    \"    \\\"\\\"\\\"Return the count of upper case and lower case letters of a text.\\\"\\\"\\\"\\n\",\n    \"\\n\",\n    \"    name = \\\"case_count\\\"\\n\",\n    \"    input_types = [ColumnSchema(logical_type=NaturalLanguage)]\\n\",\n    \"    return_type = ColumnSchema(semantic_tags={\\\"numeric\\\"})\\n\",\n    \"    number_output_features = 2\\n\",\n    \"\\n\",\n    \"    def get_function(self):\\n\",\n    \"        def case_count(array):\\n\",\n    \"            # this is a naive implementation used for clarity\\n\",\n    \"            upper = np.array([len(re.findall(\\\"[A-Z]\\\", i)) for i in array])\\n\",\n    \"            lower = np.array([len(re.findall(\\\"[a-z]\\\", i)) for i in array])\\n\",\n    \"            return upper, lower\\n\",\n    \"\\n\",\n    \"        return case_count\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now you have a primitive that outputs two columns. One column contains the count for the upper case letters. The other column contains the count for the lower case letters. Pass the primitive into DFS to generate features. By default, the feature name will reflect the index of the output.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix, features = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"sessions\\\",\\n\",\n    \"    agg_primitives=[],\\n\",\n    \"    trans_primitives=[CaseCount],\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"feature_matrix[\\n\",\n    \"    [\\n\",\n    \"        \\\"customers.CASE_COUNT(favorite_quote)[0]\\\",\\n\",\n    \"        \\\"customers.CASE_COUNT(favorite_quote)[1]\\\",\\n\",\n    \"    ]\\n\",\n    \"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Custom Naming for Multiple Outputs\\n\",\n    \"\\n\",\n    \"When you create a primitive that outputs multiple features, you can also define custom naming for each of those features.\\n\",\n    \"\\n\",\n    \"### Hourly Sine and Cosine Example\\n\",\n    \"\\n\",\n    \"In this example, you will learn how to apply custom naming for multiple outputs. You will create a primitive that outputs the sine and cosine of the hour.\\n\",\n    \"\\n\",\n    \"First, derive a new transform primitive class using `TransformPrimitive` as a base. The primitive will take in the time index as the input and return two numeric columns as the output. Set the input type to a Woodwork `ColumnSchema` with a logical type of `Datetime` and the semantic tag `'time_index'`. Next, set the return type to a Woodwork `ColumnSchema` with semantic tag `'numeric'` and set `number_output_features` to two. Then, override `get_function` to return a primitive function that will calculate the feature and return a list of columns. Also, override `generate_names` to return a list of the feature names that you define.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"class HourlySineAndCosine(TransformPrimitive):\\n\",\n    \"    \\\"\\\"\\\"Returns the sine and cosine of the hour.\\\"\\\"\\\"\\n\",\n    \"\\n\",\n    \"    name = \\\"hourly_sine_and_cosine\\\"\\n\",\n    \"    input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={\\\"time_index\\\"})]\\n\",\n    \"    return_type = ColumnSchema(semantic_tags={\\\"numeric\\\"})\\n\",\n    \"\\n\",\n    \"    number_output_features = 2\\n\",\n    \"\\n\",\n    \"    def get_function(self):\\n\",\n    \"        def hourly_sine_and_cosine(column):\\n\",\n    \"            sine = np.sin(column.dt.hour)\\n\",\n    \"            cosine = np.cos(column.dt.hour)\\n\",\n    \"            return sine, cosine\\n\",\n    \"\\n\",\n    \"        return hourly_sine_and_cosine\\n\",\n    \"\\n\",\n    \"    def generate_names(self, base_feature_names):\\n\",\n    \"        name = self.generate_name(base_feature_names)\\n\",\n    \"        return f\\\"{name}[sine]\\\", f\\\"{name}[cosine]\\\"\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now you have a primitive that outputs two columns. One column contains the sine of the hour. The other column contains the cosine of the hour. Pass the primitive into DFS to generate features. The feature name will reflect the custom naming you defined.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix, features = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"log\\\",\\n\",\n    \"    agg_primitives=[],\\n\",\n    \"    trans_primitives=[HourlySineAndCosine],\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"feature_matrix.head()[\\n\",\n    \"    [\\n\",\n    \"        \\\"HOURLY_SINE_AND_COSINE(datetime)[sine]\\\",\\n\",\n    \"        \\\"HOURLY_SINE_AND_COSINE(datetime)[cosine]\\\",\\n\",\n    \"    ]\\n\",\n    \"]\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"celltoolbar\": \"Raw Cell Format\",\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 4\n}\n"
  },
  {
    "path": "docs/source/guides/deployment.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"92a0dab5\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Deployment\\n\",\n    \"\\n\",\n    \"Deployment of machine learning models requires repeating feature engineering steps on new data. In some cases, these steps need to be performed in near real-time. Featuretools has capabilities to ease the deployment of feature engineering.\\n\",\n    \"\\n\",\n    \"## Saving Features\\n\",\n    \"\\n\",\n    \"First, let's build some generate some training and test data in the same format. We use a random seed to generate different data for the test.\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"id\": \"129c8011\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \".. note ::\\n\",\n    \"\\n\",\n    \"    Features saved in one version of Featuretools are not guaranteed to load in another. This means the features might need to be re-created after upgrading Featuretools.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"01c19e97\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import featuretools as ft\\n\",\n    \"\\n\",\n    \"es_train = ft.demo.load_mock_customer(return_entityset=True)\\n\",\n    \"es_test = ft.demo.load_mock_customer(return_entityset=True, random_seed=33)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"042f8c02\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now let's build some features definitions using DFS. Because we have categorical features, we also encode them with one hot encoding based on the values in the training data.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"6bcc87a0\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    entityset=es_train, target_dataframe_name=\\\"customers\\\"\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)\\n\",\n    \"feature_matrix_enc\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"03ffe00a\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can use [featuretools.save_features](../generated/featuretools.save_features.rst#featuretools.save_features) to save a list features to a json file\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"79d4ff65\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ft.save_features(features_enc, \\\"feature_definitions.json\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"67723f25\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Calculating Feature Matrix for New Data\\n\",\n    \"\\n\",\n    \"We can use [featuretools.load_features](../generated/featuretools.load_features.rst#featuretools.load_features) to read in a list of saved features to calculate for our new entity set.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"a8f728c0\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"saved_features = ft.load_features(\\\"feature_definitions.json\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"1624ea4d\",\n   \"metadata\": {},\n   \"source\": [\n    \"After we load the features back in, we can calculate the feature matrix.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"f37f61e0\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix = ft.calculate_feature_matrix(saved_features, es_test)\\n\",\n    \"feature_matrix\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"c9f39b54\",\n   \"metadata\": {},\n   \"source\": [\n    \"As you can see above, we have the exact same features as before, but calculated using the test data.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"42a47ad9\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Exporting Feature Matrix\\n\",\n    \"\\n\",\n    \"### Save as csv\\n\",\n    \"\\n\",\n    \"The feature matrix is a pandas DataFrame that we can save to disk\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"570c69fa\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix.to_csv(\\\"feature_matrix.csv\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"f0fc5342\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also read it back in as follows:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"297db0a6\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import pandas as pd\\n\",\n    \"\\n\",\n    \"saved_fm = pd.read_csv(\\\"feature_matrix.csv\\\", index_col=\\\"customer_id\\\")\\n\",\n    \"saved_fm\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"1b84dc51\",\n   \"metadata\": {\n    \"nbsphinx\": \"hidden\"\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import os\\n\",\n    \"\\n\",\n    \"os.remove(\\\"feature_definitions.json\\\")\\n\",\n    \"os.remove(\\\"feature_matrix.csv\\\")\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}\n"
  },
  {
    "path": "docs/source/guides/feature_descriptions.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"1557274d\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Generating Feature Descriptions\\n\",\n    \"\\n\",\n    \"As features become more complicated, their names can become harder to understand. Both the [describe_feature](https://featuretools.alteryx.com/en/latest/generated/featuretools.graph_feature.html) function and the [graph_feature](https://featuretools.alteryx.com/en/latest/generated/featuretools.describe_feature.html) function can help explain what a feature is and the steps Featuretools took to generate it. Additionally, the ``describe_feature`` function can be augmented by providing custom definitions and templates to improve the resulting descriptions. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"cdb8b3eb\",\n   \"metadata\": {\n    \"nbsphinx\": \"hidden\"\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import featuretools as ft\\n\",\n    \"\\n\",\n    \"es = ft.demo.load_mock_customer(return_entityset=True)\\n\",\n    \"\\n\",\n    \"feature_defs = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    agg_primitives=[\\\"mean\\\", \\\"sum\\\", \\\"mode\\\", \\\"n_most_common\\\"],\\n\",\n    \"    trans_primitives=[\\\"month\\\", \\\"hour\\\"],\\n\",\n    \"    max_depth=2,\\n\",\n    \"    features_only=True,\\n\",\n    \")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"01f8209c\",\n   \"metadata\": {},\n   \"source\": [\n    \"By default, ``describe_feature`` uses the existing column and DataFrame names and the default primitive description templates to generate feature descriptions. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"35b86722\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_defs[9]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"e24bee8d\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ft.describe_feature(feature_defs[9])\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"5402e848\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_defs[14]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"ac22c09c\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ft.describe_feature(feature_defs[14])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"ff9b7b35\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Improving Descriptions\\n\",\n    \"\\n\",\n    \"While the default descriptions can be helpful, they can also be further improved by providing custom definitions of columns and features, and by providing alternative templates for primitive descriptions. \\n\",\n    \"\\n\",\n    \"#### Feature Descriptions\\n\",\n    \"Custom feature definitions will get used in the description in place of the automatically generated description. This can be used to better explain what a `ColumnSchema` or feature is, or to provide descriptions that take advantage of a user's existing knowledge about the data or domain. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"33b2f8e5\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_descriptions = {\\\"customers: join_date\\\": \\\"the date the customer joined\\\"}\\n\",\n    \"\\n\",\n    \"ft.describe_feature(feature_defs[9], feature_descriptions=feature_descriptions)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"218147f4\",\n   \"metadata\": {},\n   \"source\": [\n    \"For example, the above replaces the column name, ``\\\"join_date\\\"``, with a more descriptive definition of what that column represents in the dataset. Descriptions can also be set directly on a column in a DataFrame by going through the Woodwork typing information to access the ``description`` attribute present on each `ColumnSchema`:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"597e20a6\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"join_date_column_schema = es[\\\"customers\\\"].ww.columns[\\\"join_date\\\"]\\n\",\n    \"join_date_column_schema.description = \\\"the date the customer joined\\\"\\n\",\n    \"\\n\",\n    \"es[\\\"customers\\\"].ww.columns[\\\"join_date\\\"].description\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"6c013615\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature = ft.TransformFeature(es[\\\"customers\\\"].ww[\\\"join_date\\\"], ft.primitives.Hour)\\n\",\n    \"feature\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"03e828b4\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ft.describe_feature(feature)\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"id\": \"689cbd98\",\n   \"metadata\": {},\n   \"source\": [\n    \".. note::\\n\",\n    \"\\n\",\n    \"    When setting a description on a column in a DataFrame as described above, be careful to avoid setting the description via ``df.ww[col_name].ww.description``. The use of ``df.ww[col_name]`` creates an entirely new Series object that is not related to the EntitySet from which feature descriptions are built. Therefore, setting the description in any way other than going through the ``columns`` attribute will not set the column's description in a way that will be propogated to the feature description. \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"10e779f5\",\n   \"metadata\": {},\n   \"source\": [\n    \"Descriptions must be set for a column in a DataFrame before the feature is created in order for descriptions to propagate. Note that if a description is both set directly on a column and passed to ``describe_feature`` with ``feature_descriptions``, the description in the `feature_descriptions` parameter will take presedence.\\n\",\n    \"\\n\",\n    \"Feature descriptions can also be provided for generated features.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"5d1f8667\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_descriptions = {\\n\",\n    \"    \\\"sessions: SUM(transactions.amount)\\\": \\\"the total transaction amount for a session\\\"\\n\",\n    \"}\\n\",\n    \"\\n\",\n    \"feature_defs[14]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"b90b8e4e\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ft.describe_feature(feature_defs[14], feature_descriptions=feature_descriptions)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"83217b19\",\n   \"metadata\": {},\n   \"source\": [\n    \"Here, we create and pass in a custom description of the intermediate feature ``SUM(transactions.amount)``. The description for ``MEAN(sessions.SUM(transactions.amount))``, which is built on top of ``SUM(transactions.amount)``, uses the custom description in place of the automatically generated one. Feature descriptions can be passed in as a dictionary that maps the custom descriptions to either the feature object itself or the unique feature name in the form ``\\\"[dataframe_name]: [feature_name]\\\"``, as shown above.\\n\",\n    \"\\n\",\n    \"#### Primitive Templates\\n\",\n    \"Primitives descriptions are generated using primitive templates. By default, these are defined using the ``description_template`` attribute on the primitive. Primitives without a template default to using the ``name`` attribute of the primitive if it is defined, or the class name if it is not. Primitive description templates are string templates that take input feature descriptions as the positional arguments. These can be overwritten by mapping primitive instances or primitive names to custom templates and passing them into ``describe_feature`` through the ``primitive_templates`` argument. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"50f1bfb8\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"primitive_templates = {\\\"sum\\\": \\\"the total of {}\\\"}\\n\",\n    \"\\n\",\n    \"feature_defs[6]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"c1fb53a3\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ft.describe_feature(feature_defs[6], primitive_templates=primitive_templates)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"9b9cceca\",\n   \"metadata\": {},\n   \"source\": [\n    \"In this example, we override the default template of ``'the sum of {}'`` with our custom template ``'the total of {}'``. The description uses our custom template instead of the default.\\n\",\n    \"\\n\",\n    \"Multi-output primitives can use a list of primitive description templates to differentiate between the generic multi-output feature description and the feature slice descriptions. The first primitive template is always the generic overall feature. If only one other template is provided, it is used as the template for all slices. The slice number converted to the \\\"nth\\\" form is available through the ``nth_slice`` keyword.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"15ed472c\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature = feature_defs[5]\\n\",\n    \"feature\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"54a5a6fd\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"primitive_templates = {\\n\",\n    \"    \\\"n_most_common\\\": [\\n\",\n    \"        \\\"the 3 most common elements of {}\\\",  # generic multi-output feature\\n\",\n    \"        \\\"the {nth_slice} most common element of {}\\\",\\n\",\n    \"    ]\\n\",\n    \"}  # template for each slice\\n\",\n    \"\\n\",\n    \"ft.describe_feature(feature, primitive_templates=primitive_templates)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"49aae7d2\",\n   \"metadata\": {},\n   \"source\": [\n    \"Notice how the multi-output feature uses the first template for its description. Each slice of this feature will use the second slice template:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"1bd3a3cf\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ft.describe_feature(feature[0], primitive_templates=primitive_templates)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"607299ff\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ft.describe_feature(feature[1], primitive_templates=primitive_templates)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"30f4235f\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ft.describe_feature(feature[2], primitive_templates=primitive_templates)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"17953d54\",\n   \"metadata\": {},\n   \"source\": [\n    \"Alternatively, instead of supplying a single template for all slices, templates can be provided for each slice to further customize the output. Note that in this case, each slice must get its own template.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"bad05646\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"primitive_templates = {\\n\",\n    \"    \\\"n_most_common\\\": [\\n\",\n    \"        \\\"the 3 most common elements of {}\\\",\\n\",\n    \"        \\\"the most common element of {}\\\",\\n\",\n    \"        \\\"the second most common element of {}\\\",\\n\",\n    \"        \\\"the third most common element of {}\\\",\\n\",\n    \"    ]\\n\",\n    \"}\\n\",\n    \"\\n\",\n    \"ft.describe_feature(feature, primitive_templates=primitive_templates)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"fdad1868\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ft.describe_feature(feature[0], primitive_templates=primitive_templates)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"90a85bd0\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ft.describe_feature(feature[1], primitive_templates=primitive_templates)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"b63d47a7\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ft.describe_feature(feature[2], primitive_templates=primitive_templates)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"1942ea49\",\n   \"metadata\": {},\n   \"source\": [\n    \"Custom feature descriptions and primitive templates can also be seperately defined in a JSON file and passed to the ``describe_feature`` function using the ``metadata_file`` keyword argument. Descriptions passed in directly through the ``feature_descriptions`` and ``primitive_templates`` keyword arguments will take precedence over any descriptions provided in the JSON metadata file.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"celltoolbar\": \"Raw Cell Format\",\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}\n"
  },
  {
    "path": "docs/source/guides/feature_selection.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Feature Selection\\n\",\n    \"\\n\",\n    \"Featuretools provides users with the ability to remove features that are unlikely to be useful in building an effective machine learning model. Reducing the number of features in the feature matrix can both produce better results in the model as well as reduce the computational cost involved in prediction.\\n\",\n    \"\\n\",\n    \"Featuretools enables users to perform feature selection on the results of Deep Feature Synthesis with three functions:\\n\",\n    \"\\n\",\n    \"- `ft.selection.remove_highly_null_features`\\n\",\n    \"- `ft.selection.remove_single_value_features`\\n\",\n    \"- `ft.selection.remove_highly_correlated_features`\\n\",\n    \"\\n\",\n    \"We will describe each of these functions in depth, but first we must create an entity set with which we can run `ft.dfs`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import pandas as pd\\n\",\n    \"\\n\",\n    \"import featuretools as ft\\n\",\n    \"from featuretools.demo.flight import load_flight\\n\",\n    \"from featuretools.selection import (\\n\",\n    \"    remove_highly_correlated_features,\\n\",\n    \"    remove_highly_null_features,\\n\",\n    \"    remove_single_value_features,\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"es = load_flight(nrows=50)\\n\",\n    \"es\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Remove Highly Null Features\\n\",\n    \"\\n\",\n    \"We might have a dataset with columns that have many null values. Deep Feature Synthesis might build features off of those null columns, creating even more highly null features. In this case, we might want to remove any features whose null values pass a certain threshold. Below is our feature matrix with such a case:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"fm, features = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"trip_logs\\\",\\n\",\n    \"    cutoff_time=pd.DataFrame(\\n\",\n    \"        {\\n\",\n    \"            \\\"trip_log_id\\\": [30, 1, 2, 3, 4],\\n\",\n    \"            \\\"time\\\": pd.to_datetime([\\\"2016-09-22 00:00:00\\\"] * 5),\\n\",\n    \"        }\\n\",\n    \"    ),\\n\",\n    \"    trans_primitives=[],\\n\",\n    \"    agg_primitives=[],\\n\",\n    \"    max_depth=2,\\n\",\n    \")\\n\",\n    \"fm\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We look at the above feature matrix and decide to remove the highly null features\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ft.selection.remove_highly_null_features(fm)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Notice that calling `remove_highly_null_features` didn't remove every feature that contains a null value. By default, we only remove features where the percentage of null values in the calculated feature matrix is above 95%. If we want to lower that threshold, we can set the `pct_null_threshold` paramter ourselves.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"remove_highly_null_features(fm, pct_null_threshold=0.2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Remove Single Value Features\\n\",\n    \"\\n\",\n    \"Another situation we might run into is one where our calculated features don't have any variance. In those cases, we are likely to want to remove the uninteresting features. For that, we use `remove_single_value_features`.\\n\",\n    \"\\n\",\n    \"Let's see what happens when we remove the single value features of the feature matrix below.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"fm\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \".. note ::\\n\",\n    \"    A list of feature definitions such as those created by `dfs` can be provided to the feature selection functions.\\n\",\n    \"    Doing this will change the outputs to include an updated list of feature definitions.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"new_fm, new_features = remove_single_value_features(fm, features=features)\\n\",\n    \"new_fm\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now that we have the features definitions for the updated feature matrix, we can see that the features that were removed are:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"set(features) - set(new_features)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"With the function used as it is above, null values are not considered when counting a feature's unique values. If we'd like to consider `NaN` its own value, we can set `count_nan_as_value` to `True` and we'll see `flights.carrier` and `flights.flight_num` back in the matrix.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"new_fm, new_features = remove_single_value_features(\\n\",\n    \"    fm, features=features, count_nan_as_value=True\\n\",\n    \")\\n\",\n    \"new_fm\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The features that were removed are:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"set(features) - set(new_features)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Remove Highly Correlated Features\\n\",\n    \"\\n\",\n    \"The last feature selection function we have allows us to remove features that would likely be redundant to the model we're attempting to build by considering the correlation between pairs of calculated features.\\n\",\n    \"\\n\",\n    \"When two features are determined to be highly correlated, we remove the more complex of the two. For example, say we have two features: `col` and `-(col)`.\\n\",\n    \"\\n\",\n    \"We can see that `-(col)` is just the negation of `col`, and so we can guess those features are going to be highly correlated. `-(col)` has has the `Negate` primitive applied to it, so it is more complex than the identity feature `col`. Therefore, if we only want one of `col` and `-(col)`, we should keep the identity feature. For features that don't have an obvious difference in complexity, we discard the feature that comes later in the feature matrix. \\n\",\n    \"\\n\",\n    \"Let's try this out on our data:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"fm, features = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"trip_logs\\\",\\n\",\n    \"    trans_primitives=[\\\"negate\\\"],\\n\",\n    \"    agg_primitives=[],\\n\",\n    \"    max_depth=3,\\n\",\n    \")\\n\",\n    \"fm.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Note that we have some pretty clear correlations here between all the features and their negations.\\n\",\n    \"\\n\",\n    \"Now, using `remove_highly_correlated_features`, our default threshold for correlation is 95% correlated, and we get all of the obviously correlated features removed, leaving just the less complex features.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"new_fm, new_features = remove_highly_correlated_features(fm, features=features)\\n\",\n    \"new_fm.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The features that were removed are:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"set(features) - set(new_features)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Change the correlation threshold\\n\",\n    \"\\n\",\n    \"We can lower the threshold at which to remove correlated features if we'd like to be more restrictive by using the `pct_corr_threshold` parameter.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"new_fm, new_features = remove_highly_correlated_features(\\n\",\n    \"    fm, features=features, pct_corr_threshold=0.9\\n\",\n    \")\\n\",\n    \"new_fm.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The features that were removed are:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"set(features) - set(new_features)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Check a Subset of Features\\n\",\n    \"\\n\",\n    \"If we only want to check a subset of features, we can set `features_to_check` to the list of features whose correlation we'd like to check, and no features outside of that list will be removed.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"new_fm, new_features = remove_highly_correlated_features(\\n\",\n    \"    fm,\\n\",\n    \"    features=features,\\n\",\n    \"    features_to_check=[\\\"air_time\\\", \\\"distance\\\", \\\"flights.distance_group\\\"],\\n\",\n    \")\\n\",\n    \"new_fm.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The features that were removed are:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"set(features) - set(new_features)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Protect Features from Removal\\n\",\n    \"\\n\",\n    \"To protect specific features from being removed from the feature matrix, we can include a list of `features_to_keep`, and these features will not be removed\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"new_fm, new_features = remove_highly_correlated_features(\\n\",\n    \"    fm,\\n\",\n    \"    features=features,\\n\",\n    \"    features_to_keep=[\\\"air_time\\\", \\\"distance\\\", \\\"flights.distance_group\\\"],\\n\",\n    \")\\n\",\n    \"new_fm.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The features that were removed are:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"set(features) - set(new_features)\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"celltoolbar\": \"Raw Cell Format\",\n  \"interpreter\": {\n   \"hash\": \"eadebc3a8a3dd54e52de25d3077ea0e41c7a462ff73c567da199d6de4c02ed7d\"\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3 (ipykernel)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "docs/source/guides/guides_index.rst",
    "content": "Guides\n---------------\n\nGuides on more advanced Featuretools functionality\n\n.. toctree::\n   :maxdepth: 1\n\n   tuning_dfs\n   specifying_primitive_options\n   performance\n   deployment\n   advanced_custom_primitives\n   feature_descriptions\n   feature_selection\n   time_series\n   sql_database_integration\n"
  },
  {
    "path": "docs/source/guides/performance.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"raw\",\n   \"id\": \"2c5291f3\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \".. _performance:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"9dab133a\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Improving Computational Performance\\n\",\n    \"\\n\",\n    \"Feature engineering is a computationally expensive task. While Featuretools comes with reasonable default settings for feature calculation, there are a number of built-in approaches to improve computational performance based on dataset and problem specific considerations.\\n\",\n    \"\\n\",\n    \"## Reduce number of unique cutoff times\\n\",\n    \"Each row in a feature matrix created by Featuretools is calculated at a specific cutoff time that represents the last point in time that data from any dataframe in an entityset can be used to calculate the feature. As a result, calculations incur an overhead in finding the subset of allowed data for each distinct time in the calculation.\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"id\": \"6ab1a83a\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \".. note::\\n\",\n    \"\\n\",\n    \"    Featuretools is very precise in how it deals with time. For more information, see :doc:`/getting_started/handling_time`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"051fbaba\",\n   \"metadata\": {},\n   \"source\": [\n    \"If there are many unique cutoff times, it is often worthwhile to figure out how to have fewer. This can be done manually by figuring out which unique times are necessary for the prediction problem or automatically using [approximate](../getting_started/handling_time.ipynb#Approximating-Features-by-Rounding-Cutoff-Times).\\n\",\n    \"\\n\",\n    \"## Parallel Feature Computation\\n\",\n    \"\\n\",\n    \"Computational performance can often be improved by parallelizing the feature calculation process. There are several different approaches that can be used to perform parallel feature computation with Featuretools. An overview of the most commonly used approaches is provided below.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"b47e770f\",\n   \"metadata\": {},\n   \"source\": [\n    \"\\n\",\n    \"### Simple Parallel Feature Computation\\n\",\n    \"If using a pandas `EntitySet`, Featuretools can optionally compute features on multiple cores. The simplest way to control the amount of parallelism is to specify the `n_jobs` parameter:\\n\",\n    \"\\n\",\n    \"```python3\\n\",\n    \"fm = ft.calculate_feature_matrix(features=features,\\n\",\n    \"                                 entityset=entityset,\\n\",\n    \"                                 cutoff_time=cutoff_time,\\n\",\n    \"                                 n_jobs=2,\\n\",\n    \"                                 verbose=True)\\n\",\n    \"```\\n\",\n    \"The above command will start 2 processes to compute chunks of the feature matrix in parallel. Each process receives its own copy of the entityset, so memory use will be proportional to the number of parallel processes. Because the entityset has to be copied to each process, there is overhead to perform this operation before calculation can begin. To avoid this overhead on successive calls to `calculate_feature_matrix`, read the section below on using a persistent cluster.\\n\",\n    \"\\n\",\n    \"#### Adjust chunk size\\n\",\n    \"By default, Featuretools calculates rows with the same cutoff time simultaneously. The *chunk_size* parameter limits the maximum number of rows that will be grouped and then calculated together. If calculation is done using parallel processing, the default chunk size is set to be `1 / n_jobs` to ensure the computation can be spread across available workers. Normally, this behavior works well, but if there are only a few unique cutoff times it can lead to higher peak memory usage (due to more intermediate calculations stored in memory) or limited parallelism (if the number of chunks is less than *n_jobs*).\\n\",\n    \"\\n\",\n    \"By setting `chunk_size`, we can limit the maximum number of rows in each group to specific number or a percentage of the overall data when calling `ft.dfs` or `ft.calculate_feature_matrix`:\\n\",\n    \"\\n\",\n    \"```python3\\n\",\n    \"# use maximum  100 rows per chunk\\n\",\n    \"feature_matrix, features_list = ft.dfs(entityset=es,\\n\",\n    \"                                       target_dataframe_name=\\\"customers\\\",\\n\",\n    \"                                       chunk_size=100)\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"We can also set chunk size to be a percentage of total rows:\\n\",\n    \"\\n\",\n    \"```python3\\n\",\n    \"# use maximum 5% of all rows per chunk\\n\",\n    \"feature_matrix, features_list = ft.dfs(entityset=es,\\n\",\n    \"                                       target_dataframe_name=\\\"customers\\\",\\n\",\n    \"                                       chunk_size=.05)\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"#### Using persistent cluster\\n\",\n    \"Behind the scenes, Featuretools uses [Dask's](http://dask.pydata.org/) distributed scheduler to implement multiprocessing. When you only specify the `n_jobs` parameter, a cluster will be created for that specific feature matrix calculation and destroyed once calculations have finished. A drawback of this is that each time a feature matrix is calculated, the entityset has to be transmitted to the workers again. To avoid this, we would like to reuse the same cluster between calls. The way to do this is by creating a cluster first and telling featuretools to use it with the `dask_kwargs` parameter:\\n\",\n    \"\\n\",\n    \"```python3\\n\",\n    \"import featuretools as ft\\n\",\n    \"from dask.distributed import LocalCluster\\n\",\n    \"\\n\",\n    \"cluster = LocalCluster()\\n\",\n    \"fm_1 = ft.calculate_feature_matrix(features=features_1,\\n\",\n    \"                                   entityset=entityset,\\n\",\n    \"                                   cutoff_time=cutoff_time,\\n\",\n    \"                                   dask_kwargs={'cluster': cluster},\\n\",\n    \"                                   verbose=True)\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"The 'cluster' value can either be the actual cluster object or a string of the address the cluster's scheduler can be reached at. The call below would also work. This second feature matrix calculation will not need to resend the entityset data to the workers because it has already been saved on the cluster.\\n\",\n    \"\\n\",\n    \"```python3\\n\",\n    \"fm_2 = ft.calculate_feature_matrix(features=features_2,\\n\",\n    \"                                   entityset=entityset,\\n\",\n    \"                                   cutoff_time=cutoff_time,\\n\",\n    \"                                   dask_kwargs={'cluster': cluster.scheduler.address},\\n\",\n    \"                                   verbose=True)\\n\",\n    \"```\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"id\": \"57aaa835\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \".. note::\\n\",\n    \"\\n\",\n    \"    When using a persistent cluster, Featuretools publishes a copy of the ``EntitySet`` to the cluster the first time it calculates a feature matrix. Based on the ``EntitySet``'s metadata the cluster will reuse it for successive computations. This means if two ``EntitySets`` have the same metadata but different row values (e.g. new data is added to the ``EntitySet``), Featuretools won’t recopy the second ``EntitySet`` in later calls. A simple way to avoid this scenario is to use a unique ``EntitySet`` id.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"cdecad1d\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Using the distributed dashboard\\n\",\n    \"Dask.distributed has a web-based diagnostics dashboard that can be used to analyze the state of the workers and tasks. It can also be useful for tracking memory use or visualizing task run-times. An in-depth description of the web interface can be found [here](https://distributed.readthedocs.io/en/latest/web.html).\\n\",\n    \"\\n\",\n    \"![Distributed dashboard image](../_static/images/dashboard.png)\\n\",\n    \"\\n\",\n    \"The dashboard requires an additional python package, bokeh, to work. Once bokeh is installed, the web interface will be launched by default when a LocalCluster is created. The cluster created by featuretools when using `n_jobs` does not enable the web interface automatically. To do so, the port to launch the main web interface on must be specified in `dask_kwargs`:\\n\",\n    \"\\n\",\n    \"```python3\\n\",\n    \"fm = ft.calculate_feature_matrix(features=features,\\n\",\n    \"                                 entityset=entityset,\\n\",\n    \"                                 cutoff_time=cutoff_time,\\n\",\n    \"                                 n_jobs=2,\\n\",\n    \"                                 dask_kwargs={'diagnostics_port': 8787}\\n\",\n    \"                                 verbose=True)\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"### Parallel Computation by Partitioning Data\\n\",\n    \"As an alternative to Featuretools' parallelization, the data can be partitioned and the feature calculations run on multiple cores or a cluster using Dask or Apache Spark with PySpark. This approach may be necessary with a large pandas `EntitySet` because the current parallel implementation sends the entire `EntitySet` to each worker which may exhaust the worker memory. Dask and Spark allow Featuretools to scale to multiple cores on a single machine or multiple machines on a cluster.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"795cc323\",\n   \"metadata\": {},\n   \"source\": [\n    \"When an entire dataset is not required to calculate the features for a given set of instances, we can split the data into independent partitions and calculate on each partition. For example, imagine we are calculating features for customers and the features are \\\"number of other customers in this zip code\\\" or \\\"average age of other customers in this zip code\\\". In this case, we can load in data partitioned by zip code. As long as we have all of the data for a zip code when calculating, we can calculate all features for a subset of customers.\\n\",\n    \"\\n\",\n    \"An example of this approach can be seen in the [Predict Next Purchase demo notebook](https://github.com/featuretools/predict_next_purchase). In this example, we partition data by customer and only load a fixed number of customers into memory at any given time. We implement this easily using [Dask](https://dask.pydata.org/), which could also be used to scale the computation to a cluster of computers. A framework like [Spark](https://spark.apache.org/) could be used similarly.\\n\",\n    \"\\n\",\n    \"An additional example of partitioning data to distribute on multiple cores or a cluster using Dask can be seen in the [Featuretools on Dask notebook](https://github.com/Featuretools/Automated-Manual-Comparison/blob/main/Loan%20Repayment/notebooks/Featuretools%20on%20Dask.ipynb). This approach is detailed in the [Parallelizing Feature Engineering with Dask article](https://medium.com/feature-labs-engineering/scaling-featuretools-with-dask-ce46f9774c7d) on the Feature Labs engineering blog. Dask allows for simple scaling to multiple cores on a single computer or multiple machines on a cluster.\\n\",\n    \"\\n\",\n    \"For a similar partition and distribute implementation using Apache Spark with PySpark, refer to the [Feature Engineering on Spark notebook](https://github.com/Featuretools/predict-customer-churn/blob/main/churn/4.%20Feature%20Engineering%20on%20Spark.ipynb). This implementation shows how to carry out feature engineering on a cluster of EC2 instances using Spark as the distributed framework.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"celltoolbar\": \"Raw Cell Format\",\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}\n"
  },
  {
    "path": "docs/source/guides/specifying_primitive_options.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"ba92172a\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Specifying Primitive Options\\n\",\n    \"\\n\",\n    \"By default, DFS will apply primitives across all dataframes and columns. This behavior can be altered through a few different parameters. Dataframes and columns can be optionally ignored or included for an entire DFS run or on a per-primitive basis, enabling greater control over features and less run time overhead.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"106d36a3\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import featuretools as ft\\n\",\n    \"from featuretools.tests.testing_utils import make_ecommerce_entityset\\n\",\n    \"\\n\",\n    \"es = make_ecommerce_entityset()\\n\",\n    \"\\n\",\n    \"features_list = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    agg_primitives=[\\\"mode\\\"],\\n\",\n    \"    trans_primitives=[\\\"weekday\\\"],\\n\",\n    \"    features_only=True,\\n\",\n    \")\\n\",\n    \"features_list\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"29ae225d\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Specifying Options for an Entire Run\\n\",\n    \"\\n\",\n    \"The `ignore_dataframes` and `ignore_columns` parameters of DFS control dataframes and columns that should be ignored for all primitives. This is useful for ignoring columns or dataframes that don't relate to the problem or otherwise shouldn't be included in the DFS run.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"2d481527\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# ignore the 'log' and 'cohorts' dataframes entirely\\n\",\n    \"# ignore the 'birthday' column in 'customers' and the 'device_name' column in 'sessions'\\n\",\n    \"features_list = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    agg_primitives=[\\\"mode\\\"],\\n\",\n    \"    trans_primitives=[\\\"weekday\\\"],\\n\",\n    \"    ignore_dataframes=[\\\"log\\\", \\\"cohorts\\\"],\\n\",\n    \"    ignore_columns={\\\"sessions\\\": [\\\"device_name\\\"], \\\"customers\\\": [\\\"birthday\\\"]},\\n\",\n    \"    features_only=True,\\n\",\n    \")\\n\",\n    \"features_list\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"4a9bd7e2\",\n   \"metadata\": {},\n   \"source\": [\n    \"DFS completely ignores the `log` and `cohorts` dataframes when creating features. It also ignores the columns `device_name` and `birthday` in `sessions` and `customers` respectively. However, both of these options can be overridden by individual primitive options in the `primitive_options` parameter.\\n\",\n    \"\\n\",\n    \"## Specifying for Individual Primitives\\n\",\n    \"Options for individual primitives or groups of primitives are set by the `primitive_options` parameter of DFS. This parameter maps any desired options to specific primitives. In the case of conflicting options, options set at this level will override options set at the entire DFS run level, and the include options will always take priority over their ignore counterparts.\\n\",\n    \"\\n\",\n    \"Using the string primitive name or the primitive type will apply the options to all primitives of the same name. You can also set options for a specific instance of a primitive by using the primitive instance as a key in the `primitive_options` dictionary. Note, however, that specifying options for a specific instance will result in that instance ignoring any options set for the generic primitive through options with the primitive name or class as the key. \\n\",\n    \"\\n\",\n    \"### Specifying Dataframes for Individual Primitives\\n\",\n    \"Which dataframes to include/ignore can also be specified for a single primitive or a group of primitives. Dataframes can be ignored using the `ignore_dataframes` option in `primitive_options`, while dataframes to explicitly include are set by the ``include_dataframes`` option. When ``include_dataframes`` is given, all dataframes not listed are ignored by the primitive. No columns from any excluded dataframe will be used to generate features with the given primitive.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"8bcbf11a\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# ignore the 'cohorts' and 'log' dataframes, but only for the primitive 'mode'\\n\",\n    \"# include only the 'customers' dataframe for the primitives 'weekday' and 'day'\\n\",\n    \"features_list = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    agg_primitives=[\\\"mode\\\"],\\n\",\n    \"    trans_primitives=[\\\"weekday\\\", \\\"day\\\"],\\n\",\n    \"    primitive_options={\\n\",\n    \"        \\\"mode\\\": {\\\"ignore_dataframes\\\": [\\\"cohorts\\\", \\\"log\\\"]},\\n\",\n    \"        (\\\"weekday\\\", \\\"day\\\"): {\\\"include_dataframes\\\": [\\\"customers\\\"]},\\n\",\n    \"    },\\n\",\n    \"    features_only=True,\\n\",\n    \")\\n\",\n    \"features_list\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"b5cbbff0\",\n   \"metadata\": {},\n   \"source\": [\n    \"In this example, DFS would only use the `customers` dataframe for both `weekday` and `day`, and would use all dataframes except `cohorts` and `log` for `mode`.\\n\",\n    \"\\n\",\n    \"### Specifying Columns for Individual Primitives\\n\",\n    \"\\n\",\n    \"Specific columns can also be explicitly included/ignored for a primitive or group of primitives. Columns to\\n\",\n    \"ignore is set by the `ignore_columns` option, while columns to include are set by `include_columns`. When the\\n\",\n    \"`include_columns` option is set, no other columns from that dataframe will be used to make features with the given primitive.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"f9e42358\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"# Include the columns 'product_id' and 'zipcode', 'device_type', and 'cancel_reason' for 'mean'\\n\",\n    \"# Ignore the columns 'signup_date' and 'cancel_date' for 'weekday'\\n\",\n    \"features_list = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    agg_primitives=[\\\"mode\\\"],\\n\",\n    \"    trans_primitives=[\\\"weekday\\\"],\\n\",\n    \"    primitive_options={\\n\",\n    \"        \\\"mode\\\": {\\n\",\n    \"            \\\"include_columns\\\": {\\n\",\n    \"                \\\"log\\\": [\\\"product_id\\\", \\\"zipcode\\\"],\\n\",\n    \"                \\\"sessions\\\": [\\\"device_type\\\"],\\n\",\n    \"                \\\"customers\\\": [\\\"cancel_reason\\\"],\\n\",\n    \"            }\\n\",\n    \"        },\\n\",\n    \"        \\\"weekday\\\": {\\\"ignore_columns\\\": {\\\"customers\\\": [\\\"signup_date\\\", \\\"cancel_date\\\"]}},\\n\",\n    \"    },\\n\",\n    \"    features_only=True,\\n\",\n    \")\\n\",\n    \"features_list\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"88ea7094\",\n   \"metadata\": {},\n   \"source\": [\n    \"Here, `mode` will only use the columns `product_id` and `zipcode` from the dataframe `log`, `device_type`\\n\",\n    \"from the dataframe `sessions`, and `cancel_reason` from `customers`. For any other dataframe, `mode` will use all\\n\",\n    \"columns. The `weekday` primitive will use all columns in all dataframes except for `signup_date` and `cancel_date`\\n\",\n    \"from the `customers` dataframe.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"### Specifying GroupBy Options\\n\",\n    \"\\n\",\n    \"GroupBy Transform Primitives also have the additional options `include_groupby_dataframes`, `ignore_groupby_dataframes`, `include_groupby_columns`, and `ignore_groupby_columns`. These options are used to specify dataframes and columns to include/ignore as groupings for inputs. By default, DFS only groups by foreign key columns. Specifying `include_groupby_columns` overrides this default, and will only group by columns given. On the other hand, `ignore_groupby_columns` will continue to use only the foreign key columns, ignoring any columns specified that are also foreign key columns. Note that if including non-foreign key columns to group by, the included columns must be categorical columns. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"1c1046b5\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"features_list = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"log\\\",\\n\",\n    \"    agg_primitives=[],\\n\",\n    \"    trans_primitives=[],\\n\",\n    \"    groupby_trans_primitives=[\\\"cum_sum\\\", \\\"cum_count\\\"],\\n\",\n    \"    primitive_options={\\n\",\n    \"        \\\"cum_sum\\\": {\\\"ignore_groupby_columns\\\": {\\\"log\\\": [\\\"product_id\\\"]}},\\n\",\n    \"        \\\"cum_count\\\": {\\n\",\n    \"            \\\"include_groupby_columns\\\": {\\\"log\\\": [\\\"product_id\\\", \\\"priority_level\\\"]},\\n\",\n    \"            \\\"ignore_groupby_dataframes\\\": [\\\"sessions\\\"],\\n\",\n    \"        },\\n\",\n    \"    },\\n\",\n    \"    features_only=True,\\n\",\n    \")\\n\",\n    \"features_list\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"10616725\",\n   \"metadata\": {},\n   \"source\": [\n    \"We ignore `product_id` as a groupby for `cum_sum` but still use any other foreign key columns in that or any other dataframe. For `cum_count`, we use only `product_id` and `priority_level` as groupbys. Note that `cum_sum` doesn't use\\n\",\n    \"`priority_level` because it's not a foreign key column, but we explicitly include it for `cum_count`. Finally, note that specifying groupby options doesn't affect what features the primitive is applied to. For example, `cum_count` ignores the dataframe `sessions` for groupbys, but the feature `<Feature: CUM_COUNT(sessions.device_name) by product_id>` is still made. The groupby is from the target dataframe `log`, so the feature is valid given the associated options. To ignore the `sessions` dataframe for `cum_count`,  the `ignore_dataframes` option for `cum_count` would need to include `sessions`.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Specifying for each Input for Multiple Input Primitives\\n\",\n    \"\\n\",\n    \"For primitives that take multiple columns as input, such as `Trend`, the above options can be specified for each input by passing them in as a list. If only one option dictionary is given, it is used for all inputs. The length of the list provided must match the number of inputs the primitive takes.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"2e808749\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"features_list = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    agg_primitives=[\\\"trend\\\"],\\n\",\n    \"    trans_primitives=[],\\n\",\n    \"    primitive_options={\\n\",\n    \"        \\\"trend\\\": [\\n\",\n    \"            {\\\"ignore_columns\\\": {\\\"log\\\": [\\\"value_many_nans\\\"]}},\\n\",\n    \"            {\\\"include_columns\\\": {\\\"customers\\\": [\\\"signup_date\\\"], \\\"log\\\": [\\\"datetime\\\"]}},\\n\",\n    \"        ]\\n\",\n    \"    },\\n\",\n    \"    features_only=True,\\n\",\n    \")\\n\",\n    \"features_list\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"53d5d207\",\n   \"metadata\": {},\n   \"source\": [\n    \"Here, we pass in a list of primitive options for trend.  We ignore the column `value_many_nans` for the first input\\n\",\n    \"to `trend`, and include the column `signup_date` from `customers` for the second input.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}\n"
  },
  {
    "path": "docs/source/guides/sql_database_integration.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# SQL Database Integration \\n\",\n    \"\\n\",\n    \"`featuretools_sql` is an add-on library that supports automatic `EntitySet` creation from a relational database.\\n\",\n    \"\\n\",\n    \"Currently, `featuretools_sql` is compatible with the following systems:\\n\",\n    \"\\n\",\n    \"* `MySQL` \\n\",\n    \"* `PostgreSQL`\\n\",\n    \"* `Snowflake`\\n\",\n    \"\\n\",\n    \"The `DBConnector` object exposed by the `featuretools_sql` library provides the interface to connecting to the DBMS.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Installing featuretools_sql \\n\",\n    \"\\n\",\n    \"Install with pip\\n\",\n    \"\\n\",\n    \"```\\n\",\n    \"python -m pip install \\\"featuretools[sql]\\\" \\n\",\n    \"``` \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Connecting to your database instance \\n\",\n    \"\\n\",\n    \"Depending on your choice of DBMS, you may have to provide different pieces of information to the `DBConnector` object.\\n\",\n    \"\\n\",\n    \"If you want to connect to a `MySQL` instance, you must pass the string `\\\"mysql\\\"` into the `system_name` argument.\\n\",\n    \"\\n\",\n    \"If you want to connect to a `PostgreSQL` instance, you must pass the string `\\\"postgresql\\\"` into the `system_name` argument.\\n\",\n    \"\\n\",\n    \"If you want to connect to a `Snowflake` instance, you must pass the string `\\\"snowflake\\\"` into the `system_name` argument.\\n\",\n    \"\\n\",\n    \"Here is an example call to the constructor of the object, connecting to a `PostgreSQL` database:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"```python \\n\",\n    \"from featuretools_sql.connector import DBConnector\\n\",\n    \"\\n\",\n    \"connector_object = DBConnector(\\n\",\n    \"    system_name=\\\"postgresql\\\",\\n\",\n    \"    user=\\\"postgres\\\",\\n\",\n    \"    host=\\\"localhost\\\",\\n\",\n    \"    port=\\\"5432\\\",\\n\",\n    \"    database=\\\"postgres\\\",\\n\",\n    \"    schema=\\\"public\\\",\\n\",\n    \")\\n\",\n    \"```\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Note that the choice of RDBMS does affect the required arguments -- for example, if you were connecting to a `MySQL` instance, you would not need a `schema` argument.  \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Converting to an EntitySet \\n\",\n    \"\\n\",\n    \"You can call the `get_entityset` method to instruct the `DBConnector` object to build an EntitySet. \\n\",\n    \"\\n\",\n    \"This method will loop through all the tables in the database and copy them into dataframes. Then it will populate the relationships data structure. It will finally pass those two arguments into the EntitySet constructor in Featuretools, and return the object.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"```python \\n\",\n    \"es = connector_object.get_entityset()\\n\",\n    \"``` \\n\",\n    \"\\n\",\n    \"Optionally, you can pass in table names to the `select_only` parameter if you only want to include a subset of the tables in the database. \\n\",\n    \"\\n\",\n    \"```python \\n\",\n    \"es = connector_object.get_entityset(select_only=[\\\"Products\\\", \\\"Transactions\\\"])\\n\",\n    \"```\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Examining the EntitySet's member data \\n\",\n    \"\\n\",\n    \"You can examine the member data of the `DBConnector` object to ensure that it imported data correctly.\\n\",\n    \"\\n\",\n    \"To access the dataframes it imported, access the `.dataframes` attribute. To access the relationships data structure, access the `.relationships` attribute.\\n\",\n    \"\\n\",\n    \"If you would like to visualize the EntitySet as a graph, you can call `es.plot()`. \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Calling DFS \\n\",\n    \"\\n\",\n    \"The EntitySet object is ready to be passed into Featuretools's `DFS` algorithm! Read more about `DFS` [here]([https://featuretools.alteryx.com/en/stable/getting_started/afe.html#running-dfs). \"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3.8.12 64-bit ('venv_x86')\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.2\"\n  },\n  \"vscode\": {\n   \"interpreter\": {\n    \"hash\": \"3f6b062a214ec48d1657976024d6bc68979519d14a33afb6ad033fc2e4189514\"\n   }\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "docs/source/guides/time_series.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"17f894b5\",\n   \"metadata\": {\n    \"nbsphinx\": \"hidden\"\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import warnings\\n\",\n    \"\\n\",\n    \"warnings.filterwarnings(\\\"ignore\\\")\\n\",\n    \"import pandas as pd\\n\",\n    \"\\n\",\n    \"import featuretools as ft\\n\",\n    \"from featuretools.demo.weather import load_weather\\n\",\n    \"from featuretools.primitives import Lag, RollingMean, RollingMin\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"a8104f18\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Feature Engineering for Time Series Problems\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"id\": \"9cd9cb82\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \".. note::\\n\",\n    \"        This guide focuses on feature engineering for single-table time series problems; it does not cover how to handle temporal multi-table data for other machine learning problem types. A more general guide on handling time in Featuretools can be found `here <../getting_started/handling_time.ipynb>`_.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"0cf3cebc\",\n   \"metadata\": {},\n   \"source\": [\n    \"Time series forecasting consists of predicting future values of a target using earlier observations. In datasets that are used in time series problems, there is an inherent temporal ordering to the data (determined by a time index), and  the sequential target values we're predicting are highly dependent on one another. Feature engineering for time series problems exploits the fact that more recent observations are more predictive than more distant ones.\\n\",\n    \"\\n\",\n    \"This guide will explore how to use Featuretools for automating feature engineering for univariate time series problems, or problems in which only the time index and target column are included.\\n\",\n    \" \\n\",\n    \"We'll be working with a temperature demo EntitySet that contains one DataFrame, `temperatures`. The `temperatures` dataframe contains the minimum daily temperatures that we will be predicting. In total, it has three columns: `id`, `Temp`, and `Date`. The `id` column is the index that is necessary for Featuretools' purposes. The other two are important for univariate time series problems: `Date` is our time index, and `Temp` is our target column. The engineered features will be built from these two columns.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"862e46da\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es = load_weather()\\n\",\n    \"\\n\",\n    \"es[\\\"temperatures\\\"].head(10)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"90242e31\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es[\\\"temperatures\\\"][\\\"Temp\\\"].plot(ylabel=\\\"Temp (C)\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"060eb035\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Understanding The Feature Engineering Window\\n\",\n    \"\\n\",\n    \"In multi-table datasets, a feature engineering window for a single row in the target DataFrame extends forward in time over observations in child DataFrames starting at the time index and ending when either the cutoff time or last time index is reached. \\n\",\n    \"\\n\",\n    \"![Multi Table Timeline](../_static/images/multi_table_FE_timeline.png)\\n\",\n    \"\\n\",\n    \"In single-table time series datasets, the feature engineering window for a single value extends backwards in time within the same column. Because of this, the concepts of cutoff time and last time index are not relevant in the same way.\\n\",\n    \"\\n\",\n    \"For example: The cutoff time for a single-table time series dataset would create the training and test data split. During DFS, features would not be calculated after the cutoff time. This same behavior can often times be achieved more simply by splitting the data prior to creating the EntitySet, since filtering the data at feature matrix calculation is more computationally intensive than splitting the data ahead of time.\\n\",\n    \"\\n\",\n    \"```\\n\",\n    \"split_point = int(df.shape[0]*.7)\\n\",\n    \"\\n\",\n    \"training_data = df[:split_point]\\n\",\n    \"test_data = df[split_point:]\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"So, since we can't use the existing parameters for defining each observation's feature engineering window, we'll need to define new the concepts of `gap` and `window_length`. These will allow us to set a feature engineering window that exists prior to each observation.\\n\",\n    \"\\n\",\n    \"## Gap and Window Length\\n\",\n    \"\\n\",\n    \"Note that we will be using integers when defining the gap and window length. This implies that our data occurs at evenly spaced intervals--in this case daily--so a number `n` corresponds to `n` days. Support for unevenly spaced intervals is ongoing and can be explored with the Woodwork method [df.ww.infer_temporal_frequencies](https://woodwork.alteryx.com/en/stable/generated/woodwork.table_accessor.WoodworkTableAccessor.infer_temporal_frequencies.html#woodwork.table_accessor.WoodworkTableAccessor.infer_temporal_frequencies).\\n\",\n    \"\\n\",\n    \"If we are at a point in time `t`, we have access to information from times less than `t` (past values), and we do not have information from times greater than `t` (future values). Our limitations in feature engineering, then, will come from when exactly before `t` we have access to the data. \\n\",\n    \"\\n\",\n    \"Consider an example where we're recording data that takes a week to ingest; the earliest data we have access to is from seven days ago, or `t - 7`. We'll call this our `gap`. A `gap` of 0 would include the instance itself, which we must be careful to avoid in time series problems, as this exposes our target.\\n\",\n    \"\\n\",\n    \"We also need to determine how far back in time before `t - 7` we can go. Too far back, and we may lose the potency of our recent observations, but too recent, and we may not capture the full spectrum of behaviors displayed by the data. In this example, let's say that we only want to look at 5 days worth of data at a time. We'll call this our `window_length`. \\n\",\n    \"\\n\",\n    \"![Time Series Timeline](../_static/images/time_series_FE_timeline.png)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"a90799f1\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"gap = 7\\n\",\n    \"window_length = 5\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"460b4c49\",\n   \"metadata\": {},\n   \"source\": [\n    \"With these two parameters (`gap` and `window_length`) set, we have defined our feature engineering window. Now, we can move onto defining our feature primitives.\\n\",\n    \"\\n\",\n    \"## Time Series Primitives\\n\",\n    \"\\n\",\n    \"There are three types of primitives we'll focus on for time series problems. One of them will extract features from the time index, and the other two types will extract features from our target column. \\n\",\n    \"\\n\",\n    \"### Datetime Transform Primitives\\n\",\n    \"\\n\",\n    \"We need a way of implicating time in our time series features. Yes, using recent temperatures is incredibly predictive in determining future temperatures, but there is also a whole host of historical data suggesting that the month of the year is a pretty good indicator for the temperature outside. However, if we look at the data, we'll see that, though the day changes, the observations are always taken at the same hour, so the `Hour` primitive will not likely be useful. Of course, in a dataset that is measured at an hourly frequency or one more granular, `Hour` may be incrediby predictive. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"65246092\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"datetime_primitives = [\\\"Day\\\", \\\"Year\\\", \\\"Weekday\\\", \\\"Month\\\"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"95d8c86a\",\n   \"metadata\": {},\n   \"source\": [\n    \"The full list of datetime transform primitives can be seen [here](https://featuretools.alteryx.com/en/latest/api_reference.html#datetime-transform-primitives).\\n\",\n    \"\\n\",\n    \"### Delaying Primitives\\n\",\n    \"\\n\",\n    \"The simplest thing we can do with our target column is to build features that are delayed (or lagging) versions of the target column. We'll make one feature per observation in our feature engineering windows, so we'll range over time from `t - gap - window_length` to `t - gap`. \\n\",\n    \"\\n\",\n    \"For this purpose, we can use our `Lag` primitive and create one primitive for each instance in our window. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"b9e1fa8f\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"delaying_primitives = [Lag(periods=i + gap) for i in range(window_length)]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"03cd4474\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Rolling Transform Primitives\\n\",\n    \"\\n\",\n    \"Since we have access to the entire feature engineering window, we can aggregate over that window. Featuretools has several rolling primitives with which we can achieve this. Here, we'll use the `RollingMean` and `RollingMin` primitives, setting the `gap` and `window_length` accordingly. Here, the gap is incredibly important, because when the gap is zero, it means the current observation's taret value is present in the window, which exposes our target.\\n\",\n    \"\\n\",\n    \"This concern also exists for other primitives that reference earlier values in the dataframe. Because of this, when using primitives for time series feature engineering, one must be incredibly careful to not use primitives on the target column that incorporate the current observation when calculating a feature value.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"ed6cc722\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"rolling_mean_primitive = RollingMean(\\n\",\n    \"    window_length=window_length, gap=gap, min_periods=window_length\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"rolling_min_primitive = RollingMin(\\n\",\n    \"    window_length=window_length, gap=gap, min_periods=window_length\\n\",\n    \")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"1eb2a6e1\",\n   \"metadata\": {},\n   \"source\": [\n    \"The full list of rolling transform primitives can be seen [here](https://featuretools.alteryx.com/en/latest/api_reference.html#rolling-transform-primitives).\\n\",\n    \"\\n\",\n    \"## Run DFS\\n\",\n    \"\\n\",\n    \"Now that we've definied our time series primitives, we can pass them into DFS and get our feature matrix! \\n\",\n    \"\\n\",\n    \"Let's take a look at an actual feature engineering window as we defined with `gap` and `window_length` above. Below is an example of how we can extract many features using the same feature engineering window without exposing our target value.\\n\",\n    \"\\n\",\n    \"![FE Window](../_static/images/window_calculations.png)\\n\",\n    \"\\n\",\n    \"With the image above, we see how all of our defined primitives get used to create many features from just the two columns we have access to.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"42f52b73\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"fm, f = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"temperatures\\\",\\n\",\n    \"    trans_primitives=(\\n\",\n    \"        datetime_primitives\\n\",\n    \"        + delaying_primitives\\n\",\n    \"        + [rolling_mean_primitive, rolling_min_primitive]\\n\",\n    \"    ),\\n\",\n    \"    cutoff_time=pd.Timestamp(\\\"1987-1-30\\\"),\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"f\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"9e8ce29d\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"fm.iloc[:, [0, 2, 6, 7, 8, 9]].head(15)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"b984ff57\",\n   \"metadata\": {},\n   \"source\": [\n    \"Above is our time series feature matrix! The rolling and delayed features are built from our target column, but do not expose it. We can now use the feature matrix to create a machine learning model that predicts future minimum daily temperatures.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"celltoolbar\": \"Raw Cell Format\",\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}\n"
  },
  {
    "path": "docs/source/guides/tuning_dfs.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"a4329c7d\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Tuning Deep Feature Synthesis\\n\",\n    \"\\n\",\n    \"There are several parameters that can be tuned to change the output of DFS. We'll explore these parameters using the following `transactions` EntitySet.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"12607fd8\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import featuretools as ft\\n\",\n    \"\\n\",\n    \"es = ft.demo.load_mock_customer(return_entityset=True)\\n\",\n    \"es\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"6ef15160\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Using \\\"Seed Features\\\"\\n\",\n    \"\\n\",\n    \"Seed features are manually defined and problem specific features that a user provides to DFS. Deep Feature Synthesis will then automatically stack new features on top of these features when it can.\\n\",\n    \"\\n\",\n    \"By using seed features, we can include domain specific knowledge in feature engineering automation. For the seed feature below, the domain knowlege may be that, for a specific retailer, a transaction above $125 would be considered an expensive purchase.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"b35f388e\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"expensive_purchase = ft.Feature(es[\\\"transactions\\\"].ww[\\\"amount\\\"]) > 125\\n\",\n    \"\\n\",\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    agg_primitives=[\\\"percent_true\\\"],\\n\",\n    \"    seed_features=[expensive_purchase],\\n\",\n    \")\\n\",\n    \"feature_matrix[[\\\"PERCENT_TRUE(transactions.amount > 125)\\\"]]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"8703d4b3\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can now see that the ``PERCENT_TRUE`` primitive was automatically applied to the boolean `expensive_purchase` feature from the `transactions` table. The feature produced as a result can be understood as the percentage of transactions for a customer that are considered expensive.\\n\",\n    \"\\n\",\n    \"## Add \\\"interesting\\\" values to columns\\n\",\n    \"\\n\",\n    \"Sometimes we want to create features that are conditioned on a second value before calculations are performed. We call this extra filter a \\\"where clause\\\". Where clauses are used in Deep Feature Synthesis by including primitives in the `where_primitives` parameter to DFS.\\n\",\n    \"\\n\",\n    \"By default, where clauses are built using the ``interesting_values`` of a column.\\n\",\n    \"\\n\",\n    \"Interesting values can be automatically determined and added for each DataFrame in a pandas EntitySet by calling `es.add_interesting_values()`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"b6e88923\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"values_dict = {\\\"device\\\": [\\\"desktop\\\", \\\"mobile\\\", \\\"tablet\\\"]}\\n\",\n    \"es.add_interesting_values(dataframe_name=\\\"sessions\\\", values=values_dict)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"beee9073\",\n   \"metadata\": {},\n   \"source\": [\n    \"Interesting values are stored in the DataFrame's Woodwork typing information.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"c70ff02e\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es[\\\"sessions\\\"].ww.columns[\\\"device\\\"].metadata\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"ddec8e5a\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now that interesting values are set for the `device` column in the `sessions` table, we can specify the aggregation primitives for which we want where clauses using the ``where_primitives`` parameter to DFS.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"6eaabad8\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    agg_primitives=[\\\"count\\\", \\\"avg_time_between\\\"],\\n\",\n    \"    where_primitives=[\\\"count\\\", \\\"avg_time_between\\\"],\\n\",\n    \"    trans_primitives=[],\\n\",\n    \")\\n\",\n    \"feature_matrix\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"681a19db\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we have several new potentially useful features. Here are two of them that are built off of the where clause \\\"where the device used was a tablet\\\":\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"31a2a94e\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix[\\n\",\n    \"    [\\n\",\n    \"        \\\"COUNT(sessions WHERE device = tablet)\\\",\\n\",\n    \"        \\\"AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)\\\",\\n\",\n    \"    ]\\n\",\n    \"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"7b43a4a5\",\n   \"metadata\": {},\n   \"source\": [\n    \"The first geature, `COUNT(sessions WHERE device = tablet)`, can be understood as indicating *how many sessions a customer completed on a tablet*.\\n\",\n    \"\\n\",\n    \"The second feature, `AVG_TIME_BETWEEN(sessions.session_start WHERE device = tablet)`, calculates *the time between those sessions*.\\n\",\n    \"\\n\",\n    \"We can see that customer who only had 0 or 1 sessions on a tablet had ``NaN`` values for average time between such sessions.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## Encoding categorical features\\n\",\n    \"\\n\",\n    \"Machine learning algorithms typically expect all numeric data or data that has defined numeric representations, like boolean values corresponding to `0` and `1`. When Deep Feature Synthesis generates categorical features, we can encode them using Featureools.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"a2ccb27b\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    agg_primitives=[\\\"mode\\\"],\\n\",\n    \"    trans_primitives=[\\\"time_since\\\"],\\n\",\n    \"    max_depth=1,\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"feature_matrix\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"a50adb54\",\n   \"metadata\": {},\n   \"source\": [\n    \"This feature matrix contains 2 columns that are categorical in nature, ``zip_code`` and ``MODE(sessions.device)``. We can use the feature matrix and feature definitions to encode these categorical values into boolean values. Featuretools offers functionality to apply one hot encoding to the output of DFS.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"088672ac\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix_enc, features_enc = ft.encode_features(feature_matrix, feature_defs)\\n\",\n    \"feature_matrix_enc\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"54076098\",\n   \"metadata\": {},\n   \"source\": [\n    \"The returned feature matrix is now encoded in a way that is interpretable to machine learning algorithms. Notice how the columns that did not need encoding are still included. Additionally, we get a new set of feature definitions that contain the encoded values.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"db8dd84b\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"features_enc\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"b4bda3a2\",\n   \"metadata\": {},\n   \"source\": [\n    \"These features can be used to calculate the same encoded values on new data. For more information on feature engineering in production, read the [Deployment](deployment.ipynb) guide.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3 (ipykernel)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}\n"
  },
  {
    "path": "docs/source/index.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"raw\",\n   \"id\": \"25bd9564\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \".. _quick-start:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"4746904c\",\n   \"metadata\": {},\n   \"source\": [\n    \"# What is Featuretools?\\n\",\n    \"<img src=\\\"_static/images/featuretools_nav2.svg\\\" width=\\\"500\\\" align=\\\"center\\\" alt=\\\"Featuretools\\\">\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"**Featuretools** is a framework to perform automated feature engineering. It excels at transforming temporal and relational datasets into feature matrices for machine learning.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"## 5 Minute Quick Start\\n\",\n    \"\\n\",\n    \"Below is an example of using Deep Feature Synthesis (DFS) to perform automated feature engineering. In this example, we apply DFS to a multi-table dataset consisting of timestamped customer transactions.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"2ed1924f\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import featuretools as ft\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"3bc51d89\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Load Mock Data\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"be39a49a\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"data = ft.demo.load_mock_customer()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"eb2552f2\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Prepare data\\n\",\n    \"\\n\",\n    \"In this toy dataset, there are 3 DataFrames.\\n\",\n    \"\\n\",\n    \"- **customers**: unique customers who had sessions\\n\",\n    \"- **sessions**: unique sessions and associated attributes\\n\",\n    \"- **transactions**: list of events in this session\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"9bb55d86\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"customers_df = data[\\\"customers\\\"]\\n\",\n    \"customers_df\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"2054eb2a\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"sessions_df = data[\\\"sessions\\\"]\\n\",\n    \"sessions_df.sample(5)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"348e7614\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"transactions_df = data[\\\"transactions\\\"]\\n\",\n    \"transactions_df.sample(5)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"59fc2126\",\n   \"metadata\": {},\n   \"source\": [\n    \"First, we specify a dictionary with all the DataFrames in our dataset. The DataFrames are passed in with their index column and time index column if one exists for the DataFrame.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"b3fdc96a\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"dataframes = {\\n\",\n    \"    \\\"customers\\\": (customers_df, \\\"customer_id\\\"),\\n\",\n    \"    \\\"sessions\\\": (sessions_df, \\\"session_id\\\", \\\"session_start\\\"),\\n\",\n    \"    \\\"transactions\\\": (transactions_df, \\\"transaction_id\\\", \\\"transaction_time\\\"),\\n\",\n    \"}\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"e0d84890\",\n   \"metadata\": {},\n   \"source\": [\n    \"Second, we specify how the DataFrames are related. When two DataFrames have a one-to-many relationship, we call the \\\"one\\\" DataFrame, the \\\"parent DataFrame\\\". A relationship between a parent and child is defined like this:\\n\",\n    \"    \\n\",\n    \"    (parent_dataframe, parent_column, child_dataframe, child_column)\\n\",\n    \"\\n\",\n    \"In this dataset we have two relationships\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"fc4366dc\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"relationships = [\\n\",\n    \"    (\\\"sessions\\\", \\\"session_id\\\", \\\"transactions\\\", \\\"session_id\\\"),\\n\",\n    \"    (\\\"customers\\\", \\\"customer_id\\\", \\\"sessions\\\", \\\"customer_id\\\"),\\n\",\n    \"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"id\": \"758f8fd4\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \".. note::\\n\",\n    \"\\n\",\n    \"    To manage setting up DataFrames and relationships, we recommend using the :class:`EntitySet <featuretools.EntitySet>` class which offers convenient APIs for managing data like this. See :doc:`getting_started/using_entitysets` for more information.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"330d66b0\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Run Deep Feature Synthesis\\n\",\n    \"\\n\",\n    \"A minimal input to DFS is a dictionary of DataFrames, a list of relationships, and the name of the target DataFrame whose features we want to calculate. The ouput of DFS is a feature matrix and the corresponding list of feature definitions.\\n\",\n    \"\\n\",\n    \"Let's first create a feature matrix for each customer in the data\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"13cae382\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix_customers, features_defs = ft.dfs(\\n\",\n    \"    dataframes=dataframes,\\n\",\n    \"    relationships=relationships,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \")\\n\",\n    \"feature_matrix_customers\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"71628a1c\",\n   \"metadata\": {},\n   \"source\": [\n    \"We now have dozens of new features to describe a customer's behavior.\\n\",\n    \"\\n\",\n    \"#### Change target DataFrame\\n\",\n    \"One of the reasons DFS is so powerful is that it can create a feature matrix for *any* DataFrame in our EntitySet. For example, if we wanted to build features for sessions.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"4cfe1aca\",\n   \"metadata\": {\n    \"nbsphinx\": \"hidden\"\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"dataframes = {\\n\",\n    \"    \\\"customers\\\": (customers_df.copy(), \\\"customer_id\\\"),\\n\",\n    \"    \\\"sessions\\\": (sessions_df.copy(), \\\"session_id\\\", \\\"session_start\\\"),\\n\",\n    \"    \\\"transactions\\\": (transactions_df.copy(), \\\"transaction_id\\\", \\\"transaction_time\\\"),\\n\",\n    \"}\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"84fec203\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix_sessions, features_defs = ft.dfs(\\n\",\n    \"    dataframes=dataframes, relationships=relationships, target_dataframe_name=\\\"sessions\\\"\\n\",\n    \")\\n\",\n    \"feature_matrix_sessions.head(5)\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"id\": \"a67d574e\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \"Understanding Feature Output\\n\",\n    \"~~~~~~~~~~~~~~~~~~~~~~~~~~~~\\n\",\n    \"\\n\",\n    \"In general, Featuretools references generated features through the feature name. In order to make features easier to understand, Featuretools offers two additional tools, :func:`featuretools.graph_feature` and :func:`featuretools.describe_feature`, to help explain what a feature is and the steps Featuretools took to generate it. Let's look at this example feature:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"9c791dda\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature = features_defs[18]\\n\",\n    \"feature\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"84b5be0f\",\n   \"metadata\": {},\n   \"source\": [\n    \"##### Feature lineage graphs\\n\",\n    \"\\n\",\n    \"Feature lineage graphs visually walk through feature generation. Starting from the base data, they show step by step the primitives applied and intermediate features generated to create the final feature.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"0cd93f3d\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ft.graph_feature(feature)\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"id\": \"d6e5e0a1\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \".. graphviz:: getting_started/graphs/demo_feat.dot\\n\",\n    \"\\n\",\n    \"Feature descriptions\\n\",\n    \"\\\"\\\"\\\"\\\"\\\"\\\"\\\"\\\"\\\"\\\"\\\"\\\"\\\"\\\"\\\"\\\"\\\"\\\"\\\"\\\"\\n\",\n    \"\\n\",\n    \"Featuretools can also automatically generate English sentence descriptions of features. Feature descriptions help to explain what a feature is, and can be further improved by including manually defined custom definitions. See :doc:`/guides/feature_descriptions` for more details on how to customize automatically generated feature descriptions.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"3bdbe1c0\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"ft.describe_feature(feature)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"44635e1f\",\n   \"metadata\": {},\n   \"source\": [\n    \"## What's next?\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"* Learn about [Representing Data with EntitySets](getting_started/using_entitysets.ipynb)\\n\",\n    \"* Apply automated feature engineering with [Deep Feature Synthesis](getting_started/afe.ipynb)\\n\",\n    \"* Explore [runnable demos](https://www.featuretools.com/demos) based on real world use cases\\n\",\n    \"* Can't find what you're looking for? Ask for [help](resources/help.rst)\"\n   ]\n  },\n  {\n   \"cell_type\": \"raw\",\n   \"id\": \"cb2d443c\",\n   \"metadata\": {\n    \"raw_mimetype\": \"text/restructuredtext\"\n   },\n   \"source\": [\n    \"Table of contents\\n\",\n    \"-----------------\\n\",\n    \"\\n\",\n    \".. toctree::\\n\",\n    \"   :maxdepth: 1\\n\",\n    \"\\n\",\n    \"   install\\n\",\n    \"\\n\",\n    \".. toctree::\\n\",\n    \"   :maxdepth: 2\\n\",\n    \"\\n\",\n    \"   getting_started/getting_started_index\\n\",\n    \"   guides/guides_index\\n\",\n    \"\\n\",\n    \".. toctree::\\n\",\n    \"   :maxdepth: 1\\n\",\n    \"   :caption: Resources and References\\n\",\n    \"\\n\",\n    \"   resources/resources_index\\n\",\n    \"   api_reference\\n\",\n    \"   release_notes\\n\",\n    \"\\n\",\n    \"Other links\\n\",\n    \"------------\\n\",\n    \"* :ref:`genindex`\\n\",\n    \"* :ref:`search`\\n\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"celltoolbar\": \"Raw Cell Format\",\n  \"kernelspec\": {\n   \"display_name\": \"Python 3 (ipykernel)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}\n"
  },
  {
    "path": "docs/source/install.md",
    "content": "# Install\n\nFeaturetools is available for Python 3.9 - 3.12. It can be installed from [pypi](https://pypi.org/project/featuretools/), [conda-forge](https://anaconda.org/conda-forge/featuretools), or from [source](https://github.com/alteryx/featuretools).\n\nTo install Featuretools, run the following command:\n\n````{tab} PyPI\n```console\n$ python -m pip install featuretools\n```\n````\n\n````{tab} Conda\n```console\n$ conda install -c conda-forge featuretools\n```\n````\n\n## Add-ons\n\nFeaturetools allows users to install add-ons individually or all at once:\n\n````{tab} PyPI\n```{tab} All Add-ons\n```console\n$ python -m pip install \"featuretools[complete]\"\n```\n```{tab} Dask\n```console\n$ python -m pip install \"featuretools[dask]\"\n```\n```{tab} NLP Primitives\n```console\n$ python -m pip install \"featuretools[nlp]\"\n```\n```{tab} Premium Primitives\n```console\n$ python -m pip install \"featuretools[premium]\"\n```\n\n````\n````{tab} Conda\n```{tab} All Add-ons\n```console\n$ conda install -c conda-forge nlp-primitives dask distributed\n```\n```{tab} NLP Primitives\n```console\n$ conda install -c conda-forge nlp-primitives\n```\n```{tab} Dask\n```console\n$ conda install -c conda-forge dask distributed\n```\n````\n\n- **NLP Primitives**: Use Natural Language Processing Primitives in Featuretools\n- **Premium Primitives**: Use primitives from Premium Primitives in Featuretools\n- **Dask**: Use to run `calculate_feature_matrix` in parallel with `n_jobs`\n\n## Installing Graphviz\n\nIn order to use `EntitySet.plot` or `featuretools.graph_feature` you will need to install the graphviz library.\n\n````{tab} macOS (Intel, M1)\n:new-set:\n```{tab} pip\n```console\n$ brew install graphviz\n$ python -m pip install graphviz\n```\n```{tab} conda\n```console\n$ brew install graphviz\n$ conda install -c conda-forge python-graphviz\n```\n````\n\n````{tab} Ubuntu\n```{tab} pip\n```console\n$ sudo apt install graphviz\n$ python -m pip install graphviz\n```\n```{tab} conda\n```console\n$ sudo apt install graphviz\n$ conda install -c conda-forge python-graphviz\n```\n````\n\n````{tab} Windows\n```{tab} pip\n```console\n$ python -m pip install graphviz\n```\n```{tab} conda\n```console\n$ conda install -c conda-forge python-graphviz\n```\n````\n\nIf you installed graphviz for **Windows** with `pip`, install graphviz.exe from the [official source](https://graphviz.org/download/#windows).\n\n## Source\n\nTo install Featuretools from source, clone the repository from [GitHub](https://github.com/alteryx/featuretools), and install the dependencies.\n\n```bash\ngit clone https://github.com/alteryx/featuretools.git\ncd featuretools\npython -m pip install .\n```\n\n## Docker\n\nIt is also possible to run Featuretools inside a Docker container.\nYou can do so by installing it as a package inside a container (following the normal install guide) or\ncreating a new image with Featuretools pre-installed, using the following commands in your `Dockerfile`:\n\n```dockerfile\nFROM --platform=linux/x86_64 python:3.9-slim-buster\nRUN apt update && apt -y update\nRUN apt install -y build-essential\nRUN pip3 install --upgrade --quiet pip\nRUN pip3 install featuretools\n```\n\n# Development\n\nTo make contributions to the codebase, please follow the guidelines [here](https://github.com/alteryx/featuretools/blob/main/contributing.md).\n"
  },
  {
    "path": "docs/source/release_notes.rst",
    "content": ".. _release_notes:\n\nRelease Notes\n-------------\n\nFuture Release\n==============\n    * Enhancements\n    * Fixes\n    * Changes\n        * Restrict numpy to <2.0.0 (:pr:`2743`)\n    * Documentation Changes\n        * Update API Docs to include previously missing primitives (:pr:`2737`)\n    * Testing Changes\n\n    Thanks to the following people for contributing to this release:\n    :user:`thehomebrewnerd`\n\nv1.31.0 May 14, 2024\n====================\n    * Enhancements\n        * Add support for Python 3.12 (:pr:`2713`)\n    * Fixes\n        * Move ``flatten_list`` util function into ``feature_discovery`` module to fix import bug (:pr:`2702`)\n    * Changes\n        * Temporarily restrict Dask version (:pr:`2694`)\n        * Remove support for creating ``EntitySets`` from Dask or Pyspark dataframes (:pr:`2705`)\n        * Bump minimum versions of ``tqdm`` and ``pip`` in requirements files (:pr:`2716`)\n        * Use ``filter`` arg in call to ``tarfile.extractall`` to safely deserialize EntitySets (:pr:`2722`)\n    * Testing Changes\n        * Fix serialization test to work with pytest 8.1.1 (:pr:`2694`)\n        * Update to allow minimum dependency checker to run properly (:pr:`2709`)\n        * Update pull request check CI action (:pr:`2720`)\n        * Update release notes updated check CI action (:pr:`2726`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`thehomebrewnerd`\n\nBreaking Changes\n++++++++++++++++\n* With this release of Featuretools, EntitySets can no longer be created from Dask or Pyspark dataframes. The behavior when using pandas\n  dataframes to create EntitySets remains unchanged.\n\n\nv1.30.0 Feb 26, 2024\n====================\n    * Changes\n        * Update min requirements for numpy, pandas and Woodwork (:pr:`2681`)\n        * Update release notes version for release(:pr:`2689`)\n    * Testing Changes\n        * Update ``make_ecommerce_entityset`` to work without Dask (:pr:`2677`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`tamargrey`, :user:`thehomebrewnerd`\n\nv1.29.0 Feb 16, 2024\n====================\n    .. warning::\n        This release of Featuretools will not support Python 3.8\n\n    * Fixes\n        * Fix dependency issues (:pr:`2644`, :pr:`2656`)\n        * Add workaround for pandas 2.2.0 bug with nunique and unpin pandas deps (:pr:`2657`)\n    * Changes\n        * Fix deprecation warnings with is_categorical_dtype (:pr:`2641`)\n        * Remove woodwork, pyarrow, numpy, and pandas pins for spark installation (:pr:`2661`)\n    * Documentation Changes\n        * Update Featuretools logo to display properly in dark mode (:pr:`2632`)\n        * Remove references to premium primitives while release isnt possible (:pr:`2674`)\n    * Testing Changes\n        * Update tests for compatibility with new versions of ``holidays`` (:pr:`2636`)\n        * Update ruff to 0.1.6 and use ruff linter/formatter (:pr:`2639`)\n        * Update ``release.yaml`` to use trusted publisher for PyPI releases (:pr:`2646`, :pr:`2653`, :pr:`2654`)\n        * Update dependency checkers and tests to include Dask (:pr:`2658`)\n        * Fix the tests that run with Woodwork main so they can be triggered (:pr:`2657`)\n        * Fix minimum dependency checker action (:pr:`2664`)\n        * Fix Slack alert for tests with Woodwork main branch (:pr:`2668`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`thehomebrewnerd`, :user:`tamargrey`, :user:`LakshmanKishore`\n\n\nv1.28.0 Oct 26, 2023\n====================\n    * Fixes\n        * Fix bug with default value in ``PercentTrue`` primitive (:pr:`2627`)\n    * Changes\n        * Refactor ``featuretools/tests/primitive_tests/utils.py`` to leverage list comprehensions for improved Pythonic quality (:pr:`2607`)\n        * Refactor ``can_stack_primitive_on_inputs`` (:pr:`2522`)\n        * Update s3 bucket for docs image (:pr:`2593`)\n        * Temporarily restrict pandas max version to ``<2.1.0`` and pyarrow to ``<13.0.0`` (:pr:`2609`)\n        * Update for compatibility with pandas version ``2.1.0`` and remove pandas upper version restriction (:pr:`2616`)\n    * Documentation Changes\n        * Fix badge on README for tests (:pr:`2598`)\n        * Update readthedocs config to use build.os (:pr:`2601`)\n    * Testing Changes\n        * Update airflow looking glass performance tests workflow (:pr:`2615`)\n        * Removed old performance testing workflow (:pr:`2620`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`petejanuszewski1`, :user:`thehomebrewnerd`, :user:`tosemml`\n\nv1.27.0 Jul 24, 2023\n====================\n    * Enhancements\n        * Add support for Python 3.11 (:pr:`2583`)\n        * Add support for ``pandas`` v2.0 (:pr:`2585`)\n    * Changes\n        * Remove natural language primitives add-on (:pr:`2570`)\n        * Updates to address various warnings (:pr:`2589`)\n    * Testing Changes\n        * Run looking glass performance tests on merge via Airflow (:pr:`2575`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`petejanuszewski1`, :user:`sbadithe`, :user:`thehomebrewnerd`\n\nv1.26.0 Apr 27, 2023\n====================\n    * Enhancements\n        * Introduce New Single-Table DFS Algorithm (:pr:`2516`). This includes **experimental** functionality and is not officially supported.\n        * Add premium primitives install command (:pr:`2545`)\n    * Fixes\n        * Fix Description of ``DaysInMonth`` (:pr:`2547`)\n    * Changes\n        * Make Dask an optional dependency (:pr:`2560`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`dvreed77`, :user:`gsheni`, :user:`thehomebrewnerd`\n\nBreaking Changes\n++++++++++++++++\n* Dask is now an optional dependency of Featuretools. Users that run ``calculate_feature_matrix`` with ``n_jobs`` set\n  to anything other than 1, will now need to install Dask prior to running ``calculate_feature_matrix``. The required Dask\n  dependencies can be installed with ``pip install \"featuretools[dask]\"``.\n\nv1.25.0 Apr 13, 2023\n====================\n    * Enhancements\n        * Add ``MaxCount``, ``MedianCount``, ``MaxMinDelta``, ``NUniqueDays``, ``NMostCommonFrequency``,\n            ``NUniqueDaysOfCalendarYear``, ``NUniqueDaysOfMonth``, ``NUniqueMonths``,\n            ``NUniqueWeeks``, ``IsFirstWeekOfMonth`` (:pr:`2533`)\n        * Add ``HasNoDuplicates``, ``NthWeekOfMonth``, ``IsMonotonicallyDecreasing``, ``IsMonotonicallyIncreasing``,\n            ``IsUnique`` (:pr:`2537`)\n    * Fixes\n        * Fix release notes header version (:pr:`2544`)\n    * Changes\n        * Restrict pandas to < 2.0.0 (:pr:`2533`)\n        * Upgrade minimum pandas to 1.5.0 (:pr:`2537`)\n        * Removed the ``Correlation`` and ``AutoCorrelation`` primitive as these could lead to data leakage (:pr:`2537`)\n        * Remove IntegerNullable support for ``Kurtosis`` primitive  (:pr:`2537`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`\n\nv1.24.0 Mar 28, 2023\n====================\n    * Enhancements\n        * Add ``AverageCountPerUnique``, ``CountryCodeToContinent``, ``FileExtension``, ``FirstLastTimeDelta``, ``SavgolFilter``,\n            ``CumulativeTimeSinceLastFalse``, ``CumulativeTimeSinceLastTrue``, ``PercentChange``, ``PercentUnique`` (:pr:`2485`)\n        * Add ``FullNameToFirstName``, ``FullNameToLastName``, ``FullNameToTitle``, ``AutoCorrelation``,\n            ``Correlation``, ``DateFirstEvent`` (:pr:`2507`)\n        * Add ``Kurtosis``, ``MinCount``, ``NumFalseSinceLastTrue``, ``NumPeaks``,\n            ``NumTrueSinceLastFalse``, ``NumZeroCrossings`` (:pr:`2514`)\n    * Fixes\n        * Pin github-action-check-linked-issues to 1.4.5 (:pr:`2497`)\n        * Support Woodwork's update numeric inference (integers as strings) (:pr:`2505`)\n        * Update ``SubtractNumeric`` Primitive with commutative class property (:pr:`2527`)\n    * Changes\n        * Separate Makefile command for core requirements, test requirements and dev requirements (:pr:`2518`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`dvreed77`, :user:`gsheni`, :user:`ozzieD`\n\nv1.23.0 Feb 15, 2023\n====================\n    * Changes\n        * Change ``TotalWordLength`` and ``UpperCaseWordCount`` to return ``IntegerNullable`` (:pr:`2474`)\n    * Testing Changes\n       * Add GitHub Actions cache to speed up workflows (:pr:`2475`)\n       * Fix latest dependency checker install command (:pr:`2476`)\n       * Add pull request check for linked issues to CI workflow (:pr:`2477`, :pr:`2481`)\n       * Remove make package from lint workflow (:pr:`2479`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`dvreed77`, :user:`gsheni`, :user:`sbadithe`\n\nv1.22.0 Jan 31, 2023\n====================\n    * Enhancements\n        * Add ``AbsoluteDiff``, ``SameAsPrevious``, ``Variance``, ``Season``, ``UpperCaseWordCount`` transform primitives (:pr:`2460`)\n    * Fixes\n        * Fix bug with consecutive spaces in ``NumWords`` (:pr:`2459`)\n        * Fix for compatibility with ``holidays`` v0.19.0 (:pr:`2471`)\n    * Changes\n        * Specify black and ruff config arguments in pre-commit-config (:pr:`2456`)\n        * ``NumCharacters`` returns null given null input (:pr:`2463`)\n    * Documentation Changes\n        * Update ``release.md`` with instructions for launching Looking Glass performance test runs (:pr:`2461`)\n        * Pin ``jupyter-client==7.4.9`` to fix broken documentation build (:pr:`2463`)\n        * Unpin jupyter-client documentation requirement (:pr:`2468`)\n    * Testing Changes\n        * Add test suites for ``NumWords`` and ``NumCharacters`` primitives (:pr:`2459`, :pr:`2463`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`rwedge`, :user:`sbadithe`, :user:`thehomebrewnerd`\n\nv1.21.0 Jan 18, 2023\n====================\n    * Enhancements\n        * Add `get_recommended_primitives` function to featuretools (:pr:`2398`)\n    * Changes\n        * Update build_docs workflow to only run for Python 3.8 and Python 3.10 (:pr:`2447`)\n    * Documentation Changes\n        * Minor fix to release notes (:pr:`2444`)\n    * Testing Changes\n        * Add test that checks for Natural Language primitives timing out against edge-case input (:pr:`2429`)\n        * Fix test compatibility with composeml 0.10 (:pr:`2439`)\n        * Minimum dependency unit test jobs do not abort if one job fails (:pr:`2437`)\n        * Run Looking Glass performance tests on merge to main (:pr:`2440`, :pr:`2441`)\n        * Add ruff for linting and replace isort/flake8 (:pr:`2448`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`ozzieD`, :user:`rwedge`, :user:`sbadithe`, :user:`thehomebrewnerd`\n\nv1.20.0 Jan 5, 2023\n===================\n    * Enhancements\n        * Add ``TimeSinceLastFalse``, ``TimeSinceLastMax``, ``TimeSinceLastMin``, and ``TimeSinceLastTrue`` primitives (:pr:`2418`)\n        * Add ``MaxConsecutiveFalse``, ``MaxConsecutiveNegatives``, ``MaxConsecutivePositives``, ``MaxConsecutiveTrue``, ``MaxConsecutiveZeros``, ``NumConsecutiveGreaterMean``, ``NumConsecutiveLessMean`` (:pr:`2420`)\n    * Fixes\n        * Fix typo in ``_handle_binary_comparison`` function name and update ``set_feature_names`` docstring (:pr:`2388`)\n        * Only allow Datetime time index as input to ``RateOfChange`` primitive (:pr:`2408`)\n        * Prevent catastrophic backtracking in regex for ``NumberOfWordsInQuotes`` (:pr:`2413`)\n        * Fix to eliminate fragmentation ``PerformanceWarning`` in ``feature_set_calculator.py`` (:pr:`2424`)\n        * Fix serialization of ``NumberOfCommonWords`` feature with custom word_set (:pr:`2432`)\n        * Improve edge case handling in NaturalLanguage primitives by standardizing delimiter regex (:pr:`2423`)\n        * Remove support for ``Datetime`` and ``Ordinal`` inputs in several primitives to prevent creation of Features that cannot be calculated (:pr:`2434`)\n    * Changes\n        * Refactor ``_all_direct_and_same_path`` by deleting call to ``_features_have_same_path`` (:pr:`2400`)\n        * Refactor ``_build_transform_features`` by iterating over ``input_features`` once (:pr:`2400`)\n        * Iterate only once over ``ignore_columns`` in ``DeepFeatureSynthesis`` init (:pr:`2397`)\n        * Resolve empty Pandas series warnings (:pr:`2403`)\n        * Initialize Woodwork with ``init_with_partial_schama`` instead of ``init`` in ``EntitySet.add_last_time_indexes`` (:pr:`2409`)\n        * Updates for compatibility with numpy 1.24.0 (:pr:`2414`)\n        * The ``delimiter_regex`` parameter for ``TotalWordLength`` has been renamed to ``do_not_count`` (:pr:`2423`)\n    * Documentation Changes\n        *  Remove unused sections from 1.19.0 notes (:pr:`2396`)\n\n   Thanks to the following people for contributing to this release:\n   :user:`gsheni`, :user:`rwedge`, :user:`sbadithe`, :user:`thehomebrewnerd`\n\n\nBreaking Changes\n++++++++++++++++\n* The ``delimiter_regex`` parameter for ``TotalWordLength`` has been renamed to ``do_not_count``.\n  Old saved features that had a non-default value for the parameter will no longer load.\n* Support for ``Datetime`` and ``Ordinal`` inputs has been removed from the ``LessThanScalar``,\n  ``GreaterThanScalar``, ``LessThanEqualToScalar`` and ``GreaterThanEqualToScalar`` primitives.\n\nv1.19.0 Dec 9, 2022\n===================\n    * Enhancements\n        * Add ``OneDigitPostalCode`` and ``TwoDigitPostalCode`` primitives (:pr:`2365`)\n        * Add ``ExpandingCount``, ``ExpandingMin``, ``ExpandingMean``, ``ExpandingMax``, ``ExpandingSTD``, and ``ExpandingTrend`` primitives (:pr:`2343`)\n    * Fixes\n        * Fix DeepFeatureSynthesis to consider the ``base_of_exclude`` family of attributes when creating transform features(:pr:`2380`)\n        * Fix bug with negative version numbers in ``test_version`` (:pr:`2389`)\n        * Fix bug in ``MultiplyNumericBoolean`` primitive that can cause an error with certain input dtype combinations (:pr:`2393`)\n    * Testing Changes\n        * Fix version comparison in ``test_holiday_out_of_range`` (:pr:`2382`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`sbadithe`, :user:`thehomebrewnerd`\n\nv1.18.0 Nov 15, 2022\n====================\n    * Enhancements\n        * Add ``RollingOutlierCount`` primitive (:pr:`2129`)\n        * Add ``RateOfChange`` primitive (:pr:`2359`)\n    * Fixes\n        * Sets ``uses_full_dataframe`` for ``Rolling*`` and ``Exponential*`` primitives (:pr:`2354`)\n        * Updates for compatibility with upcoming Woodwork release 0.21.0 (:pr:`2363`)\n        * Updates demo dataset location to use new links (:pr:`2366`)\n        * Fix ``test_holiday_out_of_range`` after ``holidays`` release 0.17 (:pr:`2373`)\n    * Changes\n        * Remove click and CLI functions (``list-primitives``, ``info``) (:pr:`2353`, :pr:`2358`)\n    * Documentation Changes\n        * Build docs in parallel with Sphinx (:pr:`2351`)\n        * Use non-editable install to allow local docs build (:pr:`2367`)\n        * Remove primitives.featurelabs.com website from documentation (:pr:`2369`)\n    * Testing Changes\n        * Replace use of pytest's tmpdir fixture with tmp_path (:pr:`2344`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`rwedge`, :user:`sbadithe`, :user:`tamargrey`, :user:`thehomebrewnerd`\n\nBreaking Changes\n++++++++++++++++\n* The featuretools CLI has been completely removed.\n\nv1.17.0 Oct 31, 2022\n====================\n    * Enhancements\n        * Add featuretools-sklearn-transformer as an extra installation option (:pr:`2335`)\n        * Add CountAboveMean, CountBelowMean, CountGreaterThan, CountInsideNthSTD, CountInsideRange, CountLessThan, CountOutsideNthSTD, CountOutsideRange (:pr:`2336`)\n    * Changes\n        * Restructure primitives directory to use individual primitives files (:pr:`2331`)\n        * Restrict 2022.10.1 for dask and distributed (:pr:`2347`)\n    * Documentation Changes\n        * Add Featuretools-SQL to Install page on documentation (:pr:`2337`)\n        * Fixes broken link in Featuretools documentation (:pr:`2339`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`rwedge`, :user:`sbadithe`, :user:`thehomebrewnerd`\n\nv1.16.0 Oct 24, 2022\n====================\n    * Enhancements\n        * Add ExponentialWeighted primitives and DateToTimeZone primitive (:pr:`2318`)\n        * Add 14 natural language primitives from ``nlp_primitives`` library (:pr:`2328`)\n    * Documentation Changes\n        * Fix typos in ``aggregation_primitive_base.py`` and ``features_deserializer.py`` (:pr:`2317`) (:pr:`2324`)\n        * Update SQL integration documentation to reflect Snowflake compatibility (:pr:`2313`)\n    * Testing Changes\n        * Add Windows install test (:pr:`2330`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`sbadithe`, :user:`thehomebrewnerd`\n\nv1.15.0 Oct 6, 2022\n===================\n    * Enhancements\n        * Add ``series_library`` attribute to ``EntitySet`` dictionary (:pr:`2257`)\n        * Leverage ``Library`` Enum inheriting from ``str`` (:pr:`2275`)\n    * Changes\n        * Change default gap for Rolling* primitives from 0 to 1 to prevent accidental leakage (:pr:`2282`)\n        * Updates for pandas 1.5.0 compatibility (:pr:`2290`, :pr:`2291`, :pr:`2308`)\n        * Exclude documentation files from release workflow (:pr:`2295`)\n        * Bump requirements for optional pyspark dependency (:pr:`2299`)\n        * Bump ``scipy`` and ``woodwork[spark]`` dependencies (:pr:`2306`)\n    * Documentation Changes\n        * Add documentation describing how to use ``featuretools_sql`` with ``featuretools`` (:pr:`2262`)\n        * Remove ``featuretools_sql`` as a docs requirement (:pr:`2302`)\n        * Fix typo in ``DiffDatetime`` doctest (:pr:`2314`)\n        * Fix typo in ``EntitySet`` documentation (:pr:`2315`)\n    * Testing Changes\n        * Remove graphviz version restrictions in Windows CI tests (:pr:`2285`)\n        * Run CI tests with ``pytest -n auto`` (:pr:`2298`, :pr:`2310`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`rwedge`, :user:`sbadithe`, :user:`thehomebrewnerd`\n\nBreaking Changes\n++++++++++++++++\n* The ``EntitySet`` schema has been updated to include a ``series_library`` attribute\n* The default behavior of the ``Rolling*`` primitives has changed in this release. If this primitive was used without\n  defining the ``gap`` value, the feature values returned with this release will be different than feature values from\n  prior releases.\n\nv1.14.0 Sep 1, 2022\n===================\n    * Enhancements\n        * Replace ``NumericLag`` with ``Lag`` primitive (:pr:`2252`)\n        * Refactor build_features to speed up long running DFS calls by 50% (:pr:`2224`)\n    * Fixes\n        * Fix compatibility issues with holidays 0.15 (:pr:`2254`)\n    * Changes\n        * Update release notes to make clear conda release portion (:pr:`2249`)\n        * Use pyproject.toml only (move away from setup.cfg) (:pr:`2260`, :pr:`2263`, :pr:`2265`)\n        * Add entry point instructions for pyproject.toml project (:pr:`2272`)\n    * Documentation Changes\n        * Fix to remove warning from Using Spark EntitySets Guide (:pr:`2258`)\n    * Testing Changes\n        * Add tests/profiling/dfs_profile.py (:pr:`2224`)\n        * Add workflow to test featuretools without test dependencies (:pr:`2274`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`cp2boston`, :user:`gsheni`, :user:`ozzieD`, :user:`stefaniesmith`, :user:`thehomebrewnerd`\n\nv1.13.0 Aug 18, 2022\n====================\n    * Fixes\n        * Allow boolean columns to be included in remove_highly_correlated_features (:pr:`2231`)\n    * Changes\n        * Refactor schema version checking to use `packaging` method (:pr:`2230`)\n        * Extract duplicated logic for Rolling primitives into a general utility function (:pr:`2218`)\n        * Set pandas version to >=1.4.0 (:pr:`2246`)\n        * Remove workaround in `roll_series_with_gap` caused by pandas version < 1.4.0 (:pr:`2246`)\n    * Documentation Changes\n        * Add line breaks between sections of IsFederalHoliday primitive docstring (:pr:`2235`)\n    * Testing Changes\n        * Update create feedstock PR forked repo to use (:pr:`2223`, :pr:`2237`)\n        * Update development requirements and use latest for documentation (:pr:`2225`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`ozzieD`, :user:`sbadithe`, :user:`tamargrey`\n\nv1.12.1 Aug 4, 2022\n===================\n    * Fixes\n        * Update ``Trend`` and ``RollingTrend`` primitives to work with ``IntegerNullable`` inputs (:pr:`2204`)\n        * ``camel_and_title_to_snake`` handles snake case strings with numbers (:pr:`2220`)\n        * Change ``_get_description`` to split on blank lines to avoid truncating primitive descriptions (:pr:`2219`)\n    * Documentation Changes\n        * Add instructions to add new users to featuretools feedstock (:pr:`2215`)\n    * Testing Changes\n        * Add create feedstock PR workflow (:pr:`2181`)\n        * Add performance tests for python 3.9 and 3.10 (:pr:`2198`, :pr:`2208`)\n        * Add test to ensure primitive docstrings use standardized verbs (:pr:`2200`)\n        * Configure codecov to avoid premature PR comments (:pr:`2209`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`rwedge`, :user:`sbadithe`, :user:`tamargrey`, :user:`thehomebrewnerd`\n\nv1.12.0 Jul 19, 2022\n====================\n    .. warning::\n        This release of Featuretools will not support Python 3.7\n\n    * Enhancements\n        * Add ``IsWorkingHours`` and ``IsLunchTime`` transform primitives (:pr:`2130`)\n        * Add periods parameter to ``Diff`` and add ``DiffDatetime`` primitive (:pr:`2155`)\n        * Add ``RollingTrend`` primitive (:pr:`2170`)\n    * Fixes\n        * Resolves Woodwork integration test failure and removes Python version check for codecov (:pr:`2182`)\n    * Changes\n        * Drop Python 3.7 support (:pr:`2169`, :pr:`2186`)\n        * Add pre-commit hooks for linting (:pr:`2177`)\n    * Documentation Changes\n        * Augment single table entry in DFS to include information about passing in a dictionary for `dataframes` argument (:pr:`2160`)\n    * Testing Changes\n        * Standardize imports across test files to simplify accessing featuretools functions (:pr:`2166`)\n        * Split spark tests into multiple CI jobs to speed up runtime (:pr:`2183`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`dvreed77`, :user:`gsheni`, :user:`ozzieD`, :user:`rwedge`, :user:`sbadithe`\n\nv1.11.1 Jul 5, 2022\n===================\n    * Fixes\n        * Remove 24th hour from PartOfDay primitive and add 0th hour (:pr:`2167`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`tamargrey`\n\nv1.11.0 Jun 30, 2022\n====================\n    * Enhancements\n        * Add datetime and string types as valid arguments to dfs ``cutoff_time`` (:pr:`2147`)\n        * Add ``PartOfDay`` transform primitive (:pr:`2128`)\n        * Add ``IsYearEnd``, ``IsYearStart`` transform primitives (:pr:`2124`)\n        * Add ``Feature.set_feature_names`` method to directly set output column names for multi-output features (:pr:`2142`)\n        * Include np.nan testing for ``DayOfYear`` and ``DaysInMonth`` primitives (:pr:`2146`)\n        * Allow dfs kwargs to be passed into ``get_valid_primitives`` (:pr:`2157`)\n    * Changes\n        * Improve serialization and deserialization to reduce storage of duplicate primitive information (:pr:`2136`, :pr:`2127`, :pr:`2144`)\n        * Sort core requirements and test requirements in setup cfg (:pr:`2152`)\n    * Testing Changes\n        * Fix pandas warning and reduce dask .apply warnings (:pr:`2145`)\n        * Pin graphviz version used in windows tests (:pr:`2159`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`ozzieD`, :user:`rwedge`, :user:`sbadithe`, :user:`tamargrey`, :user:`thehomebrewnerd`\n\nv1.10.0 Jun 23, 2022\n====================\n    * Enhancements\n        * Add ``DayOfYear``, ``DaysInMonth``, ``Quarter``, ``IsLeapYear``, ``IsQuarterEnd``, ``IsQuarterStart`` transform primitives (:pr:`2110`, :pr:`2117`)\n        * Add ``IsMonthEnd``, ``IsMonthStart`` transform primitives (:pr:`2121`)\n        * Move ``Quarter`` test cases (:pr:`2123`)\n        * Add ``summarize_primitives`` function for getting metrics about available primitives (:pr:`2099`)\n    * Changes\n        * Changes for compatibility with numpy 1.23.0 (:pr:`2135`, :pr:`2137`)\n    * Documentation Changes\n        * Update contributing.md to add pandoc (:pr:`2103`, :pr:`2104`)\n        * Update NLP primitives section of API reference (:pr:`2109`)\n        * Fixing release notes formatting (:pr:`2139`)\n    * Testing Changes\n        * Latest dependency checker installs spark dependencies (:pr:`2112`)\n        * Fix test failures with pyspark v3.3.0 (:pr:`2114`, :pr:`2120`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`ozzieD`, :user:`rwedge`, :user:`sbadithe`, :user:`thehomebrewnerd`\n\nv1.9.2 Jun 10, 2022\n===================\n    * Fixes\n        * Add feature origin information to all multi-output feature columns (:pr:`2102`)\n    * Documentation Changes\n        * Update contributing.md to add pandoc (:pr:`2103`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`thehomebrewnerd`\n\nv1.9.1 May 27, 2022\n===================\n    * Enhancements\n        * Update ``DateToHoliday`` and ``DistanceToHoliday`` primitives to work with timezone-aware inputs (:pr:`2056`)\n    * Changes\n        * Delete setup.py, MANIFEST.in and move configuration to pyproject.toml (:pr:`2046`)\n    * Documentation Changes\n        * Update slack invite link to new (:pr:`2044`)\n        * Add slack and stackoverflow icon to footer (:pr:`2087`)\n        * Update dead links in docs and docstrings (:pr:`2092`, :pr:`2095`)\n    * Testing Changes\n        * Skip test for ``normalize_dataframe`` due to different error coming from Woodwork in 0.16.3 (:pr:`2052`)\n        * Fix Woodwork install in test with Woodwork main branch (:pr:`2055`)\n        * Use codecov action v3 (:pr:`2039`)\n        * Add workflow to kickoff EvalML unit tests with Featuretools main (:pr:`2072`)\n        * Rename yml to yaml for GitHub Actions workflows (:pr:`2073`, :pr:`2077`)\n        * Update Dask test fixtures to prevent flaky behavior (:pr:`2079`)\n        * Update Makefile with better pkg command (:pr:`2081`)\n        * Add scheduled workflow that checks for broken links in documentation (:pr:`2084`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`rwedge`, :user:`thehomebrewnerd`\n\nv1.9.0 Apr 27, 2022\n===================\n    * Enhancements\n        * Improve ``UnusedPrimitiveWarning`` with additional information (:pr:`2003`)\n        * Update DFS primitive matching to use all inputs defined in primitive ``input_types`` (:pr:`2019`)\n        * Add ``MultiplyNumericBoolean`` primitive (:pr:`2035`)\n    * Fixes\n        * Fix issue with Ordinal inputs to binary comparison primitives (:pr:`2024`, :pr:`2025`)\n    * Changes\n        * Updated autonormalize version requirement (:pr:`2002`)\n        * Remove extra NaN checking in LatLong primitives (:pr:`1924`)\n        * Normalize LatLong NaN values during EntitySet creation (:pr:`1924`)\n        * Pass primitive dictionaries into ``check_primitive`` to avoid repetitive calls (:pr:`2016`)\n        * Remove ``Boolean`` and ``BooleanNullable`` from ``MultiplyNumeric`` primitive inputs (:pr:`2022`)\n        * Update serialization for compatibility with Woodwork version 0.16.1 (:pr:`2030`)\n    * Documentation Changes\n        * Update README text to Alteryx (:pr:`2010`, :pr:`2015`)\n    * Testing Changes\n        * Update unit tests with Woodwork main branch workflow name (:pr:`2033`)\n        * Add slack alert for failing unit tests with Woodwork main branch (:pr:`2040`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`dvreed77`, :user:`gsheni`, :user:`ozzieD`, :user:`rwedge`, :user:`thehomebrewnerd`\n\nNote\n++++\n* The update to the DFS algorithm in this release may cause the number of features returned\n  by ``ft.dfs`` to increase in some cases.\n\nv1.8.0 Mar 31, 2022\n===================\n    * Changes\n        * Removed ``make_trans_primitive`` and ``make_agg_primitive`` utility functions (:pr:`1970`)\n    * Documentation Changes\n        * Update project urls in setup cfg to include Twitter and Slack (:pr:`1981`)\n        * Update nbconvert to version 6.4.5 to fix docs build issue (:pr:`1984`)\n        * Update ReadMe to have centered badges and add docs badge (:pr:`1993`)\n        * Add M1 installation instructions to docs and contributing (:pr:`1997`)\n    * Testing Changes\n        * Updated scheduled workflows to only run on Alteryx owned repos (:pr:`1973`)\n        * Updated minimum dependency checker to use new version with write file support (:pr:`1975`, :pr:`1976`)\n        * Add black linting package and remove autopep8 (:pr:`1978`)\n        * Update tests for compatibility with Woodwork version 0.15.0 (:pr:`1984`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`thehomebrewnerd`\n\nBreaking Changes\n++++++++++++++++\n* The utility functions ``make_trans_primitive`` and ``make_agg_primitive`` have been removed. To create custom\n  primitives, define the primitive class directly.\n\nv1.7.0 Mar 16, 2022\n===================\n    * Enhancements\n        * Add support for Python 3.10 (:pr:`1940`)\n        * Added the SquareRoot, NaturalLogarithm, Sine, Cosine and Tangent primitives (:pr:`1948`)\n    * Fixes\n        * Updated the conda install commands to specify the channel (:pr:`1917`)\n    * Changes\n        * Update error message when DFS returns an empty list of features (:pr:`1919`)\n        * Remove ``list_variable_types`` and related directories (:pr:`1929`)\n        * Transition to use pyproject.toml and setup.cfg (moving away from setup.py) (:pr:`1941`, :pr:`1950`, :pr:`1952`, :pr:`1954`, :pr:`1957`, :pr:`1964`)\n        * Replace Koalas with pandas API on Spark (:pr:`1949`)\n    * Documentation Changes\n        * Add time series guide (:pr:`1896`)\n        * Update minimum nlp_primitives requirement for docs (:pr:`1925`)\n        * Add GitHub URL for PyPi (:pr:`1928`)\n        * Add backport release support (:pr:`1932`)\n        * Update instructions in ``release.md`` (:pr:`1963`)\n    * Testing Changes\n        * Update test cases to cover __main__.py file (:pr:`1927`)\n        * Upgrade moto requirement (:pr:`1929`, :pr:`1938`)\n        * Add Python 3.9 linting, install complete, and docs build CI tests (:pr:`1934`)\n        * Add CI workflow to test with latest woodwork main branch (:pr:`1936`)\n        * Add lower bound for wheel for minimum dependency checker and limit lint CI tests to Python 3.10 (:pr:`1945`)\n        * Fix non-deterministic test in ``test_es.py`` (:pr:`1961`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`andriyor`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`kushal-gopal`, :user:`mingdavidqi`, :user:`rwedge`, :user:`tamargrey`, :user:`thehomebrewnerd`, :user:`tvdboom`\n\nBreaking Changes\n++++++++++++++++\n* The deprecated utility ``list_variable_types`` has been removed from Featuretools.\n\nv1.6.0 Feb 17, 2022\n===================\n    * Enhancements\n        * Add ``IsFederalHoliday`` transform primitive (:pr:`1912`)\n    * Fixes\n        * Fix to catch new ``NotImplementedError`` raised by ``holidays`` library for unknown country (:pr:`1907`)\n    * Changes\n        * Remove outdated pandas workaround code (:pr:`1906`)\n    * Documentation Changes\n        * Add in-line tabs and copy-paste functionality to docs (:pr:`1905`)\n    * Testing Changes\n        * Fix URL deserialization file (:pr:`1909`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`jeff-hernandez`, :user:`rwedge`, :user:`thehomebrewnerd`\n\n\nv1.5.0 Feb 14, 2022\n===================\n    .. warning::\n        Featuretools may not support Python 3.7 in next non-bugfix release.\n\n    * Enhancements\n        * Add ability to use offset alias strings as inputs to rolling primitives (:pr:`1809`)\n        * Update to add support for pandas version 1.4.0 (:pr:`1881`, :pr:`1895`)\n    * Fixes\n        * Fix ``featuretools_primitives`` entry point (:pr:`1891`)\n    * Changes\n        * Allow only snake camel and title case for primitives (:pr:`1854`)\n        * Add autonormalize as an add-on library (:pr:`1840`)\n        * Add DateToHoliday Transform Primitive (:pr:`1848`)\n        * Add DistanceToHoliday Transform Primitive (:pr:`1853`)\n        * Temporarily restrict pandas and koalas max versions (:pr:`1863`)\n        * Add ``__setitem__`` method to overload ``add_dataframe`` method on EntitySet (:pr:`1862`)\n        * Add support for woodwork 0.12.0 (:pr:`1872`, :pr:`1897`)\n        * Split Datetime and LatLong primitives into separate files (:pr:`1861`)\n        * Null values will not be included in index of normalized dataframe (:pr:`1897`)\n    * Documentation Changes\n        * Bump ipython version (:pr:`1857`)\n        * Update README.md with Alteryx link (:pr:`1886`)\n    * Testing Changes\n        * Add check for package conflicts with install workflow (:pr:`1843`)\n        * Change auto approve workflow to use assignee (:pr:`1843`)\n        * Update auto approve workflow to delete branch and change on trigger (:pr:`1852`)\n        * Upgrade tests to use compose version 0.8.0 (:pr:`1856`)\n        * Updated deep feature synthesis and feature serialization tests to use new primitive files (:pr:`1861`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`dvreed77`, :user:`gsheni`, :user:`jacobboney`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`tamargrey`, :user:`thehomebrewnerd`, :user:`tuethan1999`\n\nBreaking Changes\n++++++++++++++++\n* When using ``normalize_dataframe`` to create a new dataframe, the new dataframe's index will not include a null value.\n\nv1.4.0 Jan 10, 2022\n===================\n    * Enhancements\n        * Add LatLong transform primitives - GeoMidpoint, IsInGeoBox, CityblockDistance (:pr:`1814`)\n        * Add issue templates for bugs, feature requests and documentation improvements (:pr:`1834`)\n    * Fixes\n        * Fix bug where Woodwork initialization could fail on feature matrix if cutoff times caused null values to be introduced (:pr:`1810`)\n    * Changes\n        * Skip code coverage for specific dask usage lines (:pr:`1829`)\n        * Increase minimum required numpy version to 1.21.0, scipy to 1.3.3, koalas to 1.8.1 (:pr:`1833`)\n        * Remove pyyaml as a requirement (:pr:`1833`)\n    * Documentation Changes\n        * Remove testing on conda forge in release.md (:pr:`1811`)\n    * Testing Changes\n        * Enable auto-merge for minimum and latest dependency merge requests (:pr:`1818`, :pr:`1821`, :pr:`1822`)\n        * Change auto approve workfow to use PR number and run every 30 minutes (:pr:`1827`)\n        * Add auto approve workflow to run when unit tests complete (:pr:`1837`)\n        * Test deserializing from S3 with mocked S3 fixtures only (:pr:`1825`)\n        * Remove fastparquet as a test requirement (:pr:`1833`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`davesque`, :user:`gsheni`, :user:`rwedge`, :user:`thehomebrewnerd`\n\nv1.3.0 Dec 2, 2021\n==================\n    * Enhancements\n        * Add ``NumericLag`` transform primitive (:pr:`1797`)\n    * Changes\n        * Update pip to 21.3.1 for test requirements (:pr:`1789`)\n    * Documentation Changes\n        * Add Docker install instructions and documentation on the install page. (:pr:`1785`)\n        * Update install page on documentation with correct python version (:pr:`1784`)\n        * Fix formatting in Improving Computational Performance guide (:pr:`1786`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`HenryRocha`, :user:`tamargrey` :user:`thehomebrewnerd`\n\nv1.2.0 Nov 15, 2021\n===================\n    * Enhancements\n        * Add Rolling Transform primitives with integer parameters (:pr:`1770`)\n    * Fixes\n        * Handle new graphviz FORMATS import (:pr:`1770`)\n    * Changes\n        * Add new version of featuretools_tsfresh_primitives as an add-on library (:pr:`1772`)\n        * Add ``load_weather`` as demo dataset for time series :pr:`1777`\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`tamargrey`\n\nv1.1.0 Nov 2, 2021\n==================\n    * Fixes\n        * Check ``base_of_exclude`` attribute on primitive instead feature class (:pr:`1749`)\n        * Pin upper bound for pyspark (:pr:`1748`)\n        * Fix ``get_unused_primitives`` only recognizes lowercase primitive strings (:pr:`1733`)\n        * Require newer versions of dask and distributed (:pr:`1762`)\n        * Fix bug with pass-through columns of cutoff_time df when n_jobs > 1 (:pr:`1765`)\n    * Changes\n        * Add new version of nlp_primitives as an add-on library (:pr:`1743`)\n        * Change name of date_of_birth (column name) to birthday in mock dataset (:pr:`1754`)\n    * Documentation Changes\n        * Upgrade Sphinx and fix docs configuration error (:pr:`1760`)\n    * Testing Changes\n        * Modify CI to run unit test with latest dependencies on python 3.9 (:pr:`1738`)\n        * Added Python version standardizer to Jupyter notebook linting (:pr:`1741`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`bchen1116`, :user:`gsheni`, :user:`HenryRocha`, :user:`jeff-hernandez`, :user:`ridicolos`, :user:`rwedge`\n\nv1.0.0 Oct 12, 2021\n===================\n    * Enhancements\n        * Add support for creating EntitySets from Woodwork DataTables (:pr:`1277`)\n        * Add ``EntitySet.__deepcopy__`` that retains Woodwork typing information (:pr:`1465`)\n        * Add ``EntitySet.__getstate__`` and ``EntitySet.__setstate__`` to preserve typing when pickling (:pr:`1581`)\n        * Returned feature matrix has woodwork typing information (:pr:`1664`)\n    * Fixes\n        * Fix ``DFSTransformer`` Documentation for Featuretools 1.0 (:pr:`1605`)\n        * Fix ``calculate_feature_matrix`` time type check and ``encode_features`` for synthesis tests (:pr:`1580`)\n        * Revert reordering of categories in ``Equal`` and ``NotEqual`` primitives (:pr:`1640`)\n        * Fix bug in ``EntitySet.add_relationship`` that caused ``foreign_key`` tag to be lost (:pr:`1675`)\n        * Update DFS to not build features on last time index columns in dataframes (:pr:`1695`)\n    * Changes\n        * Remove ``add_interesting_values`` from ``Entity`` (:pr:`1269`)\n        * Move ``set_secondary_time_index`` method from ``Entity`` to ``EntitySet`` (:pr:`1280`)\n        * Refactor Relationship creation process (:pr:`1370`)\n        * Replaced ``Entity.update_data`` with ``EntitySet.update_dataframe`` (:pr:`1398`)\n        * Move validation check for uniform time index to ``EntitySet`` (:pr:`1400`)\n        * Replace ``Entity`` objects in ``EntitySet`` with Woodwork dataframes (:pr:`1405`)\n        * Refactor ``EntitySet.plot`` to work with Woodwork dataframes (:pr:`1468`)\n        * Move ``last_time_index`` to be a column on the DataFrame (:pr:`1456`)\n        * Update serialization/deserialization to work with Woodwork (:pr:`1452`)\n        * Refactor ``EntitySet.query_by_values`` to work with Woodwork dataframes (:pr:`1467`)\n        * Replace ``list_variable_types`` with ``list_logical_types`` (:pr:`1477`)\n        * Allow deep EntitySet equality check (:pr:`1480`)\n        * Update ``EntitySet.concat`` to work with Woodwork DataFrames (:pr:`1490`)\n        * Add function to list semantic tags (:pr:`1486`)\n        * Initialize Woodwork on feature matrix in ``remove_highly_correlated_features`` if necessary (:pr:`1618`)\n        * Remove categorical-encoding as an add-on library (will be added back later) (:pr:`1632`)\n        * Remove autonormalize as an add-on library (will be added back later) (:pr:`1636`)\n        * Remove tsfresh, nlp_primitives, sklearn_transformer as an add-on library (will be added back later) (:pr:`1638`)\n        * Update input and return types for ``CumCount`` primitive (:pr:`1651`)\n        * Standardize imports of Woodwork (:pr:`1526`)\n        * Rename target entity to target dataframe (:pr:`1506`)\n        * Replace ``entity_from_dataframe`` with ``add_dataframe`` (:pr:`1504`)\n        * Create features from Woodwork columns (:pr:`1582`)\n        * Move default variable description logic to ``generate_description`` (:pr:`1403`)\n        * Update Woodwork to version 0.4.0 with ``LogicalType.transform`` and LogicalType instances (:pr:`1451`)\n        * Update Woodwork to version 0.4.1 with Ordinal order values and whitespace serialization fix (:pr:`1478`)\n        * Use ``ColumnSchema`` for primitive input and return types (:pr:`1411`)\n        * Update features to use Woodwork and remove ``Entity`` and ``Variable`` classes (:pr:`1501`)\n        * Re-add ``make_index`` functionality to EntitySet (:pr:`1507`)\n        * Use ``ColumnSchema`` in DFS primitive matching (:pr:`1523`)\n        * Updates from Featuretools v0.26.0 (:pr:`1539`)\n        * Leverage Woodwork better in ``add_interesting_values`` (:pr:`1550`)\n        * Update ``calculate_feature_matrix`` to use Woodwork (:pr:`1533`)\n        * Update Woodwork to version 0.6.0 with changed categorical inference (:pr:`1597`)\n        * Update ``nlp-primitives`` requirement for Featuretools 1.0 (:pr:`1609`)\n        * Remove remaining references to ``Entity`` and ``Variable`` in code (:pr:`1612`)\n        * Update Woodwork to version 0.7.1 with changed initialization (:pr:`1648`)\n        * Removes outdated workaround code related to a since-resolved pandas issue (:pr:`1677`)\n        * Remove unused ``_dataframes_equal`` and ``camel_to_snake`` functions (:pr:`1683`)\n        * Update Woodwork to version 0.8.0 for improved performance (:pr:`1689`)\n        * Remove redundant typecasting in ``encode_features`` (:pr:`1694`)\n        * Speed up ``encode_features`` if not inplace, some space cost (:pr:`1699`)\n        * Clean up comments and commented out code (:pr:`1701`)\n        * Update Woodwork to version 0.8.1 for improved performance (:pr:`1702`)\n    * Documentation Changes\n        * Add a Woodwork Typing in Featuretools guide (:pr:`1589`)\n        * Add a resource guide for transitioning to Featuretools 1.0 (:pr:`1627`)\n        * Update ``using_entitysets`` page to use Woodwork (:pr:`1532`)\n        * Update FAQ page to use Woodwork integration (:pr:`1649`)\n        * Update DFS page to be Jupyter notebook and use Woodwork integration (:pr:`1557`)\n        * Update Feature Primitives page to be Jupyter notebook and use Woodwork integration (:pr:`1556`)\n        * Update Handling Time page to be Jupyter notebook and use Woodwork integration (:pr:`1552`)\n        * Update Advanced Custom Primitives page to be Jupyter notebook and use Woodwork integration (:pr:`1587`)\n        * Update Deployment page to use Woodwork integration (:pr:`1588`)\n        * Update Using Dask EntitySets page to be Jupyter notebook and use Woodwork integration (:pr:`1590`)\n        * Update Specifying Primitive Options page to be Jupyter notebook and use Woodwork integration (:pr:`1593`)\n        * Update API Reference to match Featuretools 1.0 API (:pr:`1600`)\n        * Update Index page to be Jupyter notebook and use Woodwork integration (:pr:`1602`)\n        * Update Feature Descriptions page to be Jupyter notebook and use Woodwork integration (:pr:`1603`)\n        * Update Using Koalas EntitySets page to be Jupyter notebook and use Woodwork integration (:pr:`1604`)\n        * Update Glossary to use Woodwork integration (:pr:`1608`)\n        * Update Tuning DFS page to be Jupyter notebook and use Woodwork integration (:pr:`1610`)\n        * Fix small formatting issues in Documentation (:pr:`1607`)\n        * Remove Variables page and more references to variables (:pr:`1629`)\n        * Update Feature Selection page to use Woodwork integration (:pr:`1618`)\n        * Update Improving Performance page to be Jupyter notebook and use Woodwork integration (:pr:`1591`)\n        * Fix typos in transition guide (:pr:`1672`)\n        * Update installation instructions for 1.0.0rc1 announcement in docs (:pr:`1707`, :pr:`1708`, :pr:`1713`, :pr:`1716`)\n        * Fixed broken link for Demo notebook in README.md (:pr:`1728`)\n        * Update ``contributing.md`` to improve instructions for external contributors (:pr:`1723`)\n        * Manually revert changes made by :pr:`1677` and :pr:`1679`.  The related bug in pandas still exists. (:pr:`1731`)\n    * Testing Changes\n        * Remove entity tests (:pr:`1521`)\n        * Fix broken ``EntitySet`` tests (:pr:`1548`)\n        * Fix broken primitive tests (:pr:`1568`)\n        * Added Jupyter notebook cleaner to the linters (:pr:`1719`)\n        * Update reviewers for minimum and latest dependency checkers (:pr:`1715`)\n        * Full coverage for EntitySet.__eq__ method (:pr:`1725`)\n        * Add tests to verify all primitives can be initialized without parameter values (:pr:`1726`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`bchen1116`, :user:`gsheni`, :user:`HenryRocha`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`tamargrey`, :user:`thehomebrewnerd`, :user:`VaishnaviNandakumar`\n\nBreaking Changes\n++++++++++++++++\n\n* ``Entity.add_interesting_values`` has been removed. To add interesting values for a single\n  entity, call ``EntitySet.add_interesting_values`` and pass the name of the dataframe for\n  which to add interesting values in the ``dataframe_name`` parameter (:pr:`1405`, :pr:`1370`).\n* ``Entity.set_secondary_time_index`` has been removed and replaced by ``EntitySet.set_secondary_time_index``\n  with an added ``dataframe_name`` parameter to specify the dataframe on which to set the secondary time index (:pr:`1405`, :pr:`1370`).\n* ``Relationship`` initialization has been updated to accept four name values for the parent dataframe,\n  parent column, child dataframe and child column instead of accepting two ``Variable`` objects  (:pr:`1405`, :pr:`1370`).\n* ``EntitySet.add_relationship`` has been updated to accept dataframe and column name values or a\n  ``Relationship`` object. Adding a relationship from a ``Relationship`` object now requires passing\n  the relationship as a keyword argument  (:pr:`1405`, :pr:`1370`).\n* ``Entity.update_data`` has been removed. To update the dataframe, call ``EntitySet.replace_dataframe`` and use the ``dataframe_name`` parameter (:pr:`1630`, :pr:`1522`).\n* The data in an ``EntitySet`` is no longer stored in ``Entity`` objects. Instead, dataframes\n  with Woodwork typing information are used. Accordingly, most language referring to “entities”\n  will now refer to “dataframes”, references to “variables” will now refer to “columns”, and\n  “variable types” will use the Woodwork type system’s “logical types” and “semantic tags” (:pr:`1405`).\n* The dictionary of tuples passed to ``EntitySet.__init__`` has replaced the ``variable_types`` element\n  with separate ``logical_types`` and ``semantic_tags`` dictionaries (:pr:`1405`).\n* ``EntitySet.entity_from_dataframe`` no longer exists. To add new tables to an entityset, use``EntitySet.add_dataframe`` (:pr:`1405`).\n* ``EntitySet.normalize_entity`` has been renamed to ``EntitySet.normalize_dataframe`` (:pr:`1405`).\n* Instead of raising an error at ``EntitySet.add_relationship`` when the dtypes of parent and child columns\n  do not match, Featuretools will now check whether the Woodwork logical type of the parent and child columns\n  match. If they do not match, there will now be a warning raised, and Featuretools will attempt to update\n  the logical type of the child column to match the parent’s (:pr:`1405`).\n* If no index is specified at ``EntitySet.add_dataframe``, the first column will only be used as index if\n  Woodwork has not been initialized on the DataFrame. When adding a dataframe that already has Woodwork\n  initialized, if there is no index set, an error will be raised (:pr:`1405`).\n* Featuretools will no longer re-order columns in DataFrames so that the index column is the first column of the DataFrame (:pr:`1405`).\n* Type inference can now be performed on Dask and Koalas dataframes, though a warning will be issued\n  indicating that this may be computationally intensive (:pr:`1405`).\n* EntitySet.time_type is no longer stored as Variable objects. Instead, Woodwork typing is used, and a\n  numeric time type will be indicated by the ``'numeric'`` semantic tag string, and a datetime time type\n  will be indicated by the ``Datetime`` logical type (:pr:`1405`).\n* ``last_time_index``, ``secondary_time_index``, and ``interesting_values`` are no longer attributes\n  of an entityset’s tables that can be accessed directly. Now they must be accessed through the metadata\n  of the Woodwork DataFrame, which is a dictionary (:pr:`1405`).\n* The helper function ``list_variable_types`` will be removed in a future release and replaced by ``list_logical_types``.\n  In the meantime, ``list_variable_types`` will return the same output as ``list_logical_types`` (:pr:`1447`).\n\nWhat's New in this Release\n++++++++++++++++++++++++++\n\n**Adding Interesting Values**\n\nTo add interesting values for a single entity, call ``EntitySet.add_interesting_values`` passing the\nid of the dataframe for which interesting values should be added.\n\n.. code-block:: python\n\n    >>> es.add_interesting_values(dataframe_name='log')\n\n**Setting a Secondary Time Index**\n\nTo set a secondary time index for a specific dataframe, call ``EntitySet.set_secondary_time_index`` passing\nthe dataframe name for which to set the secondary time index along with the dictionary mapping the secondary time\nindex column to the for which the secondary time index applies.\n\n.. code-block:: python\n\n    >>> customers_secondary_time_index = {'cancel_date': ['cancel_reason']}\n    >>> es.set_secondary_time_index(dataframe_name='customers', customers_secondary_time_index)\n\n**Creating a Relationship and Adding to an EntitySet**\n\nRelationships are now created by passing parameters identifying the entityset along with four string values\nspecifying the parent dataframe, parent column, child dataframe and child column. Specifying parameter names\nis optional.\n\n.. code-block:: python\n\n    >>> new_relationship = Relationship(\n    ...     entityset=es,\n    ...     parent_dataframe_name='customers',\n    ...     parent_column_name='id',\n    ...     child_dataframe_name='sessions',\n    ...     child_column_name='customer_id'\n    ... )\n\nRelationships can now be added to EntitySets in one of two ways. The first approach is to pass in\nname values for the parent dataframe, parent column, child dataframe and child column. Specifying\nparameter names is optional with this approach.\n\n.. code-block:: python\n\n    >>> es.add_relationship(\n    ...     parent_dataframe_name='customers',\n    ...     parent_column_name='id',\n    ...     child_dataframe_name='sessions',\n    ...     child_column_name='customer_id'\n    ... )\n\nRelationships can also be added by passing in a previously created ``Relationship`` object. When using\nthis approach the ``relationship`` parameter name must be included.\n\n.. code-block:: python\n\n    >>> es.add_relationship(relationship=new_relationship)\n\n**Replace DataFrame**\n\nTo replace a dataframe in an EntitySet with a new dataframe, call ``EntitySet.replace_dataframe`` and pass in the name of the dataframe to replace along with the new data.\n\n.. code-block:: python\n\n    >>> es.replace_dataframe(dataframe_name='log', df=df)\n\n**List Logical Types and Semantic Tags**\n\nLogical types and semantic tags have replaced variable types to parse and interpret columns. You can list all the available logical types by calling ``featuretools.list_logical_types``.\n\n.. code-block:: python\n\n    >>> ft.list_logical_types()\n\nYou can list all the available semantic tags by calling ``featuretools.list_semantic_tags``.\n\n.. code-block:: python\n\n    >>> ft.list_semantic_tags()\n\nv0.27.1 Sep 2, 2021\n===================\n    * Documentation Changes\n        * Add banner to docs about upcoming Featuretools 1.0 release (:pr:`1669`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`thehomebrewnerd`\n\nv0.27.0 Aug 31, 2021\n====================\n    * Changes\n        * Remove autonormalize, tsfresh, nlp_primitives, sklearn_transformer, caegorical_encoding as an add-on libraries (will be added back later) (:pr:`1644`)\n        * Emit a warning message when a ``featuretools_primitives`` entrypoint\n          throws an exception (:pr:`1662`)\n        * Throw a ``RuntimeError`` when two primitives with the same name are\n          encountered during ``featuretools_primitives`` entrypoint handling\n          (:pr:`1662`)\n        * Prevent the ``featuretools_primitives`` entrypoint loader from\n          loading non-class objects as well as the ``AggregationPrimitive`` and\n          ``TransformPrimitive`` base classes (:pr:`1662`)\n    * Testing Changes\n        * Update latest dependency checker with proper install command (:pr:`1652`)\n        * Update isort dependency (:pr:`1654`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`davesque`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`rwedge`\n\nv0.26.2 Aug 17, 2021\n====================\n    * Documentation Changes\n        * Specify conda channel and Windows exe in graphviz installation instructions (:pr:`1611`)\n        * Remove GA token from the layout html (:pr:`1622`)\n    * Testing Changes\n        * Add additional reviewers to minimum and latest dependency checkers (:pr:`1558`, :pr:`1562`, :pr:`1564`, :pr:`1567`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`simha104`\n\nv0.26.1 Jul 23, 2021\n====================\n    * Fixes\n        * Set ``name`` attribute for ``EmailAddressToDomain`` primitive (:pr:`1543`)\n    * Documentation Changes\n        * Remove and ignore unnecessary graph files (:pr:`1544`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`davesque`, :user:`rwedge`\n\nv0.26.0 Jul 15, 2021\n====================\n    * Enhancements\n        * Add ``replace_inf_values`` utility function for replacing ``inf`` values in a feature matrix (:pr:`1505`)\n        * Add URLToProtocol, URLToDomain, URLToTLD, EmailAddressToDomain, IsFreeEmailDomain as transform primitives (:pr:`1508`, :pr:`1531`)\n    * Fixes\n        * ``include_entities`` correctly overrides ``exclude_entities`` in ``primitive_options`` (:pr:`1518`)\n    * Documentation Changes\n        * Prevent logging on build (:pr:`1498`)\n    * Testing Changes\n        * Test featuretools on pandas 1.3.0 release candidate and make fixes (:pr:`1492`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`frances-h`, :user:`gsheni`, :user:`rwedge`, :user:`tamargrey`, :user:`thehomebrewnerd`, :user:`tuethan1999`\n\nv0.25.0 Jun 11, 2021\n====================\n    * Enhancements\n       * Add ``get_valid_primitives`` function (:pr:`1462`)\n       * Add ``EntitySet.dataframe_type`` attribute (:pr:`1473`)\n    * Changes\n        * Upgrade minimum alteryx open source update checker to 2.0.0 (:pr:`1460`)\n    * Testing Changes\n        * Upgrade minimum pip requirement for testing to 21.1.2 (:pr:`1475`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`rwedge`\n\nv0.24.1 May 26, 2021\n====================\n    * Fixes\n        * Update minimum pyyaml requirement to 5.4 (:pr:`1433`)\n        * Update minimum psutil requirement to 5.6.6 (:pr:`1438`)\n    * Documentation Changes\n        * Update nbsphinx version to fix docs build issue (:pr:`1436`)\n    * Testing Changes\n        * Create separate worksflows for each CI job (:pr:`1422`)\n        * Add minimum dependency checker to generate minimum requirement files (:pr:`1428`)\n        * Add unit tests against minimum dependencies for python 3.7 on PRs and main (:pr:`1432`, :pr:`1445`)\n        * Update minimum urllib3 requirement to 1.26.5 (:pr:`1457`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`thehomebrewnerd`\n\nv0.24.0 Apr 30, 2021\n====================\n    * Changes\n        * Add auto assign bot on GitHub (:pr:`1380`)\n        * Reduce DFS max_depth to 1 if single entity in entityset (:pr:`1412`)\n        * Drop Python 3.6 support (:pr:`1413`)\n    * Documentation Changes\n        * Improve formatting of release notes (:pr:`1396`)\n    * Testing Changes\n        * Update Dask/Koalas test fixtures (:pr:`1382`)\n        * Update Spark config in test fixtures and docs (:pr:`1387`, :pr:`1389`)\n        * Don't cancel other CI jobs if one fails (:pr:`1386`)\n        * Update boto3 and urllib3 version requirements (:pr:`1394`)\n        * Update token for dependency checker PR creation (:pr:`1402`, :pr:`1407`, :pr:`1409`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`tamargrey`, :user:`thehomebrewnerd`\n\nv0.23.3 Mar 31, 2021\n====================\n    .. warning::\n        The next non-bugfix release of Featuretools will not support Python 3.6\n\n    * Changes\n        * Minor updates to work with Koalas version 1.7.0 (:pr:`1351`)\n        * Explicitly mention Python 3.8 support in setup.py classifiers (:pr:`1371`)\n        * Fix issue with smart-open version 5.0.0 (:pr:`1372`, :pr:`1376`)\n    * Testing Changes\n        * Make release notes updated check separate from unit tests (:pr:`1347`)\n        * Performance tests now specify which commit to check (:pr:`1354`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`rwedge`, :user:`thehomebrewnerd`\n\nv0.23.2 Feb 26, 2021\n====================\n    .. warning::\n        The next non-bugfix release of Featuretools will not support Python 3.6\n\n    * Enhancements\n        * The ``list_primitives`` function returns valid input types and the return type (:pr:`1341`)\n    * Fixes\n        * Restrict numpy version when installing koalas (:pr:`1329`)\n    * Changes\n        * Warn python 3.6 users support will be dropped in future release (:pr:`1344`)\n    * Documentation Changes\n        * Update docs for defining custom primitives (:pr:`1332`)\n        * Update featuretools release instructions (:pr:`1345`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`jeff-hernandez`, :user:`rwedge`\n\nv0.23.1 Jan 29, 2021\n====================\n    * Fixes\n        * Calculate direct features uses default value if parent missing (:pr:`1312`)\n        * Fix bug and improve tests for ``EntitySet.__eq__`` and ``Entity.__eq__`` (:pr:`1323`)\n    * Documentation Changes\n        * Update Twitter link to documentation toolbar (:pr:`1322`)\n    * Testing Changes\n        * Unpin python-graphviz package on Windows (:pr:`1296`)\n        * Reorganize and clean up tests (:pr:`1294`, :pr:`1303`, :pr:`1306`)\n        * Trigger tests on pull request events (:pr:`1304`, :pr:`1315`)\n        * Remove unnecessary test skips on Windows (:pr:`1320`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`seriallazer`, :user:`thehomebrewnerd`\n\nv0.23.0 Dec 31, 2020\n====================\n    * Fixes\n        * Fix logic for inferring variable type from unusual dtype (:pr:`1273`)\n        * Allow passing entities without relationships to ``calculate_feature_matrix`` (:pr:`1290`)\n    * Changes\n        * Move ``query_by_values`` method from ``Entity`` to ``EntitySet`` (:pr:`1251`)\n        * Move ``_handle_time`` method from ``Entity`` to ``EntitySet`` (:pr:`1276`)\n        * Remove usage of ``ravel`` to resolve unexpected warning with pandas 1.2.0 (:pr:`1286`)\n    * Documentation Changes\n        * Fix installation command for Add-ons (:pr:`1279`)\n        * Fix various broken links in documentation (:pr:`1313`)\n    * Testing Changes\n        * Use repository-scoped token for dependency check (:pr:`1245`:, :pr:`1248`)\n        * Fix install error during docs CI test (:pr:`1250`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`thehomebrewnerd`\n\nBreaking Changes\n++++++++++++++++\n\n* ``Entity.query_by_values`` has been removed and replaced by ``EntitySet.query_by_values`` with an\n  added ``entity_id`` parameter to specify which entity in the entityset should be used for the query.\n\nv0.22.0 Nov 30, 2020\n====================\n    * Enhancements\n        * Allow variable descriptions to be set directly on variable (:pr:`1207`)\n        * Add ability to add feature description captions to feature lineage graphs (:pr:`1212`)\n        * Add support for local tar file in read_entityset (:pr:`1228`)\n    * Fixes\n        * Updates to fix unit test errors from koalas 1.4 (:pr:`1230`, :pr:`1232`)\n    * Documentation Changes\n        * Removed link to unused feedback board (:pr:`1220`)\n        * Update footer with Alteryx Innovation Labs (:pr:`1221`)\n        * Update links to repo in documentation to use alteryx org url (:pr:`1224`)\n    * Testing Changes\n        * Update release notes check to use new repo url (:pr:`1222`)\n        * Use new version of pull request Github Action (:pr:`1234`)\n        * Upgrade pip during featuretools[complete] test (:pr:`1236`)\n        * Migrated CI tests to github actions (:pr:`1226`, :pr:`1237`, :pr:`1239`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`frances-h`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`kmax12`, :user:`rwedge`, :user:`thehomebrewnerd`\n\nv0.21.0 Oct 30, 2020\n====================\n    * Enhancements\n        * Add ``describe_feature`` to generate an English language feature description for a given feature (:pr:`1201`)\n    * Fixes\n        * Update ``EntitySet.add_last_time_indexes`` to work with Koalas 1.3.0 (:pr:`1192`, :pr:`1202`)\n    * Changes\n        * Keep koalas requirements in separate file (:pr:`1195`)\n    * Documentation Changes\n        * Added footer to the documentation (:pr:`1189`)\n        * Add guide for feature selection functions (:pr:`1184`)\n        * Fix README.md badge with correct link (:pr:`1200`)\n    * Testing Changes\n        * Add ``pyspark`` and ``koalas`` to automated dependency checks (:pr:`1191`)\n        * Add DockerHub credentials to CI testing environment (:pr:`1204`)\n        * Update premium primitives job name on CI (:pr:`1205`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`frances-h`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`tamargrey`, :user:`thehomebrewnerd`\n\nv0.20.0 Sep 30, 2020\n====================\n    .. warning::\n        The Text variable type has been deprecated and been replaced with the NaturalLanguage variable type. The Text variable type will be removed in a future release.\n\n    * Fixes\n        * Allow FeatureOutputSlice features to be serialized (:pr:`1150`)\n        * Fix duplicate label column generation when labels are passed in cutoff times and approximate is being used (:pr:`1160`)\n        * Determine calculate_feature_matrix behavior with approximate and a cutoff df that is a subclass of a pandas DataFrame (:pr:`1166`)\n    * Changes\n        * Text variable type has been replaced with NaturalLanguage (:pr:`1159`)\n    * Documentation Changes\n        * Update release doc for clarity and to add Future Release template (:pr:`1151`)\n        * Use the PyData Sphinx theme (:pr:`1169`)\n    * Testing Changes\n        * Stop requiring single-threaded dask scheduler in tests (:pr:`1163`, :pr:`1170`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`rwedge`, :user:`tamargrey`, :user:`tuethan1999`\n\nv0.19.0 Sep 8, 2020\n===================\n    * Enhancements\n        * Support use of Koalas DataFrames in entitysets (:pr:`1031`)\n        * Add feature selection functions for null, correlated, and single value features (:pr:`1126`)\n    * Fixes\n        * Fix ``encode_features`` converting excluded feature columns to a numeric dtype (:pr:`1123`)\n        * Improve performance of unused primitive check in dfs (:pr:`1140`)\n    * Changes\n        * Remove the ability to stack transform primitives (:pr:`1119`, :pr:`1145`)\n        * Sort primitives passed to ``dfs`` to get consistent ordering of features\\* (:pr:`1119`)\n    * Documentation Changes\n        * Added return values to dfs and calculate_feature_matrix (:pr:`1125`)\n    * Testing Changes\n        * Better test case for normalizing from no time index to time index (:pr:`1113`)\n\n    \\* When passing multiple instances of a primitive built with ``make_trans_primitive``\n    or ``maxe_agg_primitive``, those instances must have the same relative order when passed\n    to ``dfs`` to ensure a consistent ordering of features.\n\n    Thanks to the following people for contributing to this release:\n    :user:`frances-h`, :user:`gsheni`, :user:`rwedge`, :user:`tamargrey`, :user:`thehomebrewnerd`, :user:`tuethan1999`\n\n\nBreaking Changes\n++++++++++++++++\n\n* ``ft.dfs`` will no longer build features from Transform primitives where one\n  of the inputs is a Transform feature, a GroupByTransform feature,\n  or a Direct Feature of a Transform / GroupByTransform feature. This will make some\n  features that would previously be generated by ``ft.dfs`` only possible if\n  explicitly specified in ``seed_features``.\n\nv0.18.1 Aug 12, 2020\n====================\n    * Fixes\n        * Fix ``EntitySet.plot()`` when given a dask entityset (:pr:`1086`)\n    * Changes\n        * Use ``nlp-primitives[complete]`` install for ``nlp_primitives`` extra in ``setup.py`` (:pr:`1103`)\n    * Documentation Changes\n        * Fix broken downloads badge in README.md (:pr:`1107`)\n    * Testing Changes\n        * Use CircleCI matrix jobs in config to trigger multiple runs of same job with different parameters (:pr:`1105`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`systemshift`, :user:`thehomebrewnerd`\n\nv0.18.0 Jul 31, 2020\n====================\n    * Enhancements\n        * Warn user if supplied primitives are not used during dfs (:pr:`1073`)\n    * Fixes\n        * Use more consistent and uniform warnings (:pr:`1040`)\n        * Fix issue with missing instance ids and categorical entity index (:pr:`1050`)\n        * Remove warnings.simplefilter in feature_set_calculator to un-silence warnings (:pr:`1053`)\n        * Fix feature visualization for features with '>' or '<' in name (:pr:`1055`)\n        * Fix boolean dtype mismatch between encode_features and dfs and calculate_feature_matrix (:pr:`1082`)\n        * Update primitive options to check reversed inputs if primitive is commutative (:pr:`1085`)\n        * Fix inconsistent ordering of features between kernel restarts (:pr:`1088`)\n    * Changes\n        * Make DFS match ``TimeSince`` primitive with all ``Datetime`` types (:pr:`1048`)\n        * Change default branch to ``main`` (:pr:`1038`)\n        * Raise TypeError if improper input is supplied to ``Entity.delete_variables()`` (:pr:`1064`)\n        * Updates for compatibility with pandas 1.1.0 (:pr:`1079`, :pr:`1089`)\n        * Set pandas version to pandas>=0.24.1,<2.0.0. Filter pandas deprecation warning in Week primitive. (:pr:`1094`)\n    * Documentation Changes\n        * Remove benchmarks folder (:pr:`1049`)\n        * Add custom variables types section to variables page (:pr:`1066`)\n    * Testing Changes\n        * Add fixture for ``ft.demo.load_mock_customer`` (:pr:`1036`)\n        * Refactor Dask test units (:pr:`1052`)\n        * Implement automated process for checking critical dependencies (:pr:`1045`, :pr:`1054`, :pr:`1081`)\n        * Don't run changelog check for release PRs or automated dependency PRs (:pr:`1057`)\n        * Fix non-deterministic behavior in Dask test causing codecov issues (:pr:`1070`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`frances-h`, :user:`gsheni`, :user:`monti-python`, :user:`rwedge`,\n    :user:`systemshift`,  :user:`tamargrey`, :user:`thehomebrewnerd`, :user:`wsankey`\n\nv0.17.0 Jun 30, 2020\n====================\n    * Enhancements\n        * Add ``list_variable_types`` and ``graph_variable_types`` for Variable Types (:pr:`1013`)\n        * Add ``graph_feature`` to generate a feature lineage graph for a given feature (:pr:`1032`)\n    * Fixes\n        * Improve warnings when using a Dask dataframe for cutoff times (:pr:`1026`)\n        * Error if attempting to add entityset relationship where child variable is also child index (:pr:`1034`)\n    * Changes\n        * Remove ``Feature.get_names`` (:pr:`1021`)\n        * Remove unnecessary ``pd.Series`` and ``pd.DatetimeIndex`` calls from primitives (:pr:`1020`, :pr:`1024`)\n        * Improve cutoff time handling when a single value or no value is passed (:pr:`1028`)\n        * Moved ``find_variable_types`` to Variable utils (:pr:`1013`)\n    * Documentation Changes\n        * Add page on Variable Types to describe some Variable Types, and util functions (:pr:`1013`)\n        * Remove featuretools enterprise from documentation (:pr:`1022`)\n        * Add development install instructions to contributing.md (:pr:`1030`)\n    * Testing Changes\n        * Add ``required`` flag to CircleCI codecov upload command (:pr:`1035`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`frances-h`, :user:`gsheni`, :user:`kmax12`, :user:`rwedge`,\n    :user:`thehomebrewnerd`, :user:`tuethan1999`\n\nBreaking Changes\n++++++++++++++++\n\n* Removed ``Feature.get_names``, ``Feature.get_feature_names`` should be used instead\n\nv0.16.0 Jun 5, 2020\n===================\n    * Enhancements\n        * Support use of Dask DataFrames in entitysets (:pr:`783`)\n        * Add ``make_index`` when initializing an EntitySet by passing in an ``entities`` dictionary (:pr:`1010`)\n        * Add ability to use primitive classes and instances as keys in primitive_options dictionary (:pr:`993`)\n    * Fixes\n        * Cleanly close tqdm instance (:pr:`1018`)\n        * Resolve issue with ``NaN`` values in ``LatLong`` columns (:pr:`1007`)\n    * Testing Changes\n        * Update tests for numpy v1.19.0 compatability (:pr:`1016`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`Alex-Monahan`, :user:`frances-h`, :user:`gsheni`, :user:`rwedge`, :user:`thehomebrewnerd`\n\nv0.15.0 May 29, 2020\n====================\n    * Enhancements\n        * Add ``get_default_aggregation_primitives`` and ``get_default_transform_primitives`` (:pr:`945`)\n        * Allow cutoff time dataframe columns to be in any order (:pr:`969`, :pr:`995`)\n        * Add Age primitive, and make it a default transform primitive for DFS (:pr:`987`)\n        * Add ``include_cutoff_time`` arg - control whether data at cutoff times are included in feature calculations (:pr:`959`)\n        * Allow ``variables_types`` to be referenced by their ``type_string``\n          for the ``entity_from_dataframe`` function (:pr:`988`)\n    * Fixes\n        * Fix errors with Equals and NotEquals primitives when comparing categoricals or different dtypes (:pr:`968`)\n        * Normalized type_strings of ``Variable`` classes so that the ``find_variable_types`` function produces a\n          dictionary with a clear key to name transition (:pr:`982`, :pr:`996`)\n        * Remove pandas.datetime in test_calculate_feature_matrix due to deprecation (:pr:`998`)\n    * Documentation Changes\n        * Add python 3.8 support for docs (:pr:`983`)\n        * Adds consistent Entityset Docstrings (:pr:`986`)\n    * Testing Changes\n        * Add automated tests for python 3.8 environment (:pr:`847`)\n        * Update testing dependencies (:pr:`976`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`ctduffy`, :user:`frances-h`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`rightx2`, :user:`rwedge`, :user:`sebrahimi1988`, :user:`thehomebrewnerd`,  :user:`tuethan1999`\n\nBreaking Changes\n++++++++++++++++\n\n* Calls to ``featuretools.dfs`` or ``featuretools.calculate_feature_matrix`` that use a cutoff time\n  dataframe, but do not label the time column with either the target entity time index variable name or\n  as ``time``, will now result in an ``AttributeError``. Previously, the time column was selected to be the first\n  column that was not the instance id column. With this update, the position of the column in the dataframe is\n  no longer used to determine the time column. Now, both instance id columns and time columns in a cutoff time\n  dataframe can be in any order as long as they are named properly.\n\n* The ``type_string`` attributes of all ``Variable`` subclasses are now a snake case conversion of their class names. This\n  changes the ``type_string`` of the ``Unknown``, ``IPAddress``, ``EmailAddress``, ``SubRegionCode``, ``FilePath``, ``LatLong``, and ``ZIPcode`` classes.\n  Old saved entitysets that used these variables may load incorrectly.\n\nv0.14.0 Apr 30, 2020\n====================\n    * Enhancements\n        * ft.encode_features - use less memory for one-hot encoded columns (:pr:`876`)\n    * Fixes\n        * Use logger.warning to fix deprecated logger.warn (:pr:`871`)\n        * Add dtype to interesting_values to fix deprecated empty Series with no dtype (:pr:`933`)\n        * Remove overlap in training windows (:pr:`930`)\n        * Fix progress bar in notebook (:pr:`932`)\n    * Changes\n        * Change premium primitives CI test to Python 3.6 (:pr:`916`)\n        * Remove Python 3.5 support (:pr:`917`)\n    * Documentation Changes\n        * Fix README links to docs (:pr:`872`)\n        * Fix Github links with correct organizations (:pr:`908`)\n        * Fix hyperlinks in docs and docstrings with updated address (:pr:`910`)\n        * Remove unused script for uploading docs to AWS (:pr:`911`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`frances-h`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`rwedge`\n\nBreaking Changes\n++++++++++++++++\n\n* Using training windows in feature calculations can result in different values than previous versions.\n  This was done to prevent consecutive training windows from overlapping by excluding data at the oldest point in time.\n  For example, if we use a cutoff time at the first minute of the hour with a one hour training window,\n  the first minute of the previous hour will no longer be included in the feature calculation.\n\nv0.13.4 Mar 27, 2020\n====================\n    .. warning::\n        The next non-bugfix release of Featuretools will not support Python 3.5\n\n    * Fixes\n        * Fix ft.show_info() not displaying in Jupyter notebooks (:pr:`863`)\n    * Changes\n        * Added Plugin Warnings at Entry Point (:pr:`850`, :pr:`869`)\n    * Documentation Changes\n        * Add links to primitives.featurelabs.com (:pr:`860`)\n        * Add source code links to API reference (:pr:`862`)\n        * Update links for testing Dask/Spark integrations (:pr:`867`)\n        * Update release documentation for featuretools (:pr:`868`)\n    * Testing Changes\n        * Miscellaneous changes (:pr:`861`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`frances-h`, :user:`FreshLeaf8865`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`thehomebrewnerd`\n\nv0.13.3 Feb 28, 2020\n====================\n    * Fixes\n        * Fix a connection closed error when using n_jobs (:pr:`853`)\n    * Changes\n        * Pin msgpack dependency for Python 3.5; remove dataframe from Dask dependency (:pr:`851`)\n    * Documentation Changes\n        * Update link to help documentation page in Github issue template (:pr:`855`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`frances-h`, :user:`rwedge`\n\nv0.13.2 Jan 31, 2020\n====================\n    * Enhancements\n        * Support for Pandas 1.0.0 (:pr:`844`)\n    * Changes\n        * Remove dependency on s3fs library for anonymous downloads from S3 (:pr:`825`)\n    * Testing Changes\n        * Added GitHub Action to automatically run performance tests (:pr:`840`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`frances-h`, :user:`rwedge`\n\nv0.13.1 Dec 28, 2019\n====================\n    * Fixes\n        * Raise error when given wrong input for ignore_variables (:pr:`826`)\n        * Fix multi-output features not created when there is no child data (:pr:`834`)\n        * Removing type casting in Equals and NotEquals primitives (:pr:`504`)\n    * Changes\n        * Replace pd.timedelta time units that were deprecated (:pr:`822`)\n        * Move sklearn wrapper to separate library (:pr:`835`, :pr:`837`)\n    * Testing Changes\n        * Run unit tests in windows environment (:pr:`790`)\n        * Update boto3 version requirement for tests (:pr:`838`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`jeffzi`, :user:`kmax12`, :user:`rwedge`, :user:`systemshift`\n\nv0.13.0 Nov 30, 2019\n====================\n    * Enhancements\n        * Added GitHub Action to auto upload releases to PyPI (:pr:`816`)\n    * Fixes\n        * Fix issue where some primitive options would not be applied (:pr:`807`)\n        * Fix issue with converting to pickle or parquet after adding interesting features (:pr:`798`, :pr:`823`)\n        * Diff primitive now calculates using all available data (:pr:`824`)\n        * Prevent DFS from creating Identity Features of globally ignored variables (:pr:`819`)\n    * Changes\n        * Remove python 2.7 support from serialize.py (:pr:`812`)\n        * Make smart_open, boto3, and s3fs optional dependencies (:pr:`827`)\n    * Documentation Changes\n        * remove python 2.7 support and add 3.7 in install.rst (:pr:`805`)\n        * Fix import error in docs (:pr:`803`)\n        * Fix release title formatting in changelog (:pr:`806`)\n    * Testing Changes\n        * Use multiple CPUS to run tests on CI (:pr:`811`)\n        * Refactor test entityset creation to avoid saving to disk (:pr:`813`, :pr:`821`)\n        * Remove get_values() from test_es.py to remove warnings (:pr:`820`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`frances-h`, :user:`jeff-hernandez`, :user:`rwedge`, :user:`systemshift`\n\nBreaking Changes\n++++++++++++++++\n\n* The libraries used for downloading or uploading from S3 or URLs are now\n  optional and will no longer be installed by default.  To use this\n  functionality they will need to be installed separately.\n* The fix to how the Diff primitive is calculated may slow down the overall\n  calculation time of feature lists that use this primitive.\n\nv0.12.0 Oct 31, 2019\n====================\n    * Enhancements\n        * Added First primitive (:pr:`770`)\n        * Added Entropy aggregation primitive (:pr:`779`)\n        * Allow custom naming for multi-output primitives (:pr:`780`)\n    * Fixes\n        * Prevents user from removing base entity time index using additional_variables (:pr:`768`)\n        * Fixes error when a multioutput primitive was supplied to dfs as a groupby trans primitive (:pr:`786`)\n    * Changes\n        * Drop Python 2 support (:pr:`759`)\n        * Add unit parameter to AvgTimeBetween (:pr:`771`)\n        * Require Pandas 0.24.1 or higher (:pr:`787`)\n    * Documentation Changes\n        * Update featuretools slack link (:pr:`765`)\n        * Set up repo to use Read the Docs (:pr:`776`)\n        * Add First primitive to API reference docs (:pr:`782`)\n    * Testing Changes\n        * CircleCI fixes (:pr:`774`)\n        * Disable PIP progress bars (:pr:`775`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`ablacke-ayx`, :user:`BoopBoopBeepBoop`, :user:`jeffzi`,\n    :user:`kmax12`, :user:`rwedge`, :user:`thehomebrewnerd`, :user:`twdobson`\n\nv0.11.0 Sep 30, 2019\n====================\n    .. warning::\n        The next non-bugfix release of Featuretools will not support Python 2\n\n    * Enhancements\n        * Improve how files are copied and written (:pr:`721`)\n        * Add number of rows to graph in entityset.plot (:pr:`727`)\n        * Added support for pandas DateOffsets in DFS and Timedelta (:pr:`732`)\n        * Enable feature-specific top_n value using a dictionary in encode_features (:pr:`735`)\n        * Added progress_callback parameter to dfs() and calculate_feature_matrix() (:pr:`739`, :pr:`745`)\n        * Enable specifying primitives on a per column or per entity basis (:pr:`748`)\n    * Fixes\n        * Fixed entity set deserialization (:pr:`720`)\n        * Added error message when DateTimeIndex is a variable but not set as the time_index (:pr:`723`)\n        * Fixed CumCount and other group-by transform primitives that take ID as input (:pr:`733`, :pr:`754`)\n        * Fix progress bar undercounting (:pr:`743`)\n        * Updated training_window error assertion to only check against observations (:pr:`728`)\n        * Don't delete the whole destination folder while saving entityset (:pr:`717`)\n    * Changes\n        * Raise warning and not error on schema version mismatch (:pr:`718`)\n        * Change feature calculation to return in order of instance ids provided (:pr:`676`)\n        * Removed time remaining from displayed progress bar in dfs() and calculate_feature_matrix() (:pr:`739`)\n        * Raise warning in normalize_entity() when time_index of base_entity has an invalid type (:pr:`749`)\n        * Remove toolz as a direct dependency (:pr:`755`)\n        * Allow boolean variable types to be used in the Multiply primitive (:pr:`756`)\n    * Documentation Changes\n        * Updated URL for Compose (:pr:`716`)\n    * Testing Changes\n        * Update dependencies (:pr:`738`, :pr:`741`, :pr:`747`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`angela97lin`, :user:`chidauri`, :user:`christopherbunn`,\n    :user:`frances-h`, :user:`jeff-hernandez`, :user:`kmax12`,\n    :user:`MarcoGorelli`, :user:`rwedge`, :user:`thehomebrewnerd`\n\nBreaking Changes\n++++++++++++++++\n\n* Feature calculations will return in the order of instance ids provided instead of the order of time points instances are calculated at.\n\nv0.10.1 Aug 25, 2019\n====================\n    * Fixes\n        * Fix serialized LatLong data being loaded as strings (:pr:`712`)\n    * Documentation Changes\n        * Fixed FAQ cell output (:pr:`710`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`gsheni`, :user:`rwedge`\n\n\nv0.10.0 Aug 19, 2019\n====================\n    .. warning::\n        The next non-bugfix release of Featuretools will not support Python 2\n\n\n    * Enhancements\n        * Give more frequent progress bar updates and update chunk size behavior (:pr:`631`, :pr:`696`)\n        * Added drop_first as param in encode_features (:pr:`647`)\n        * Added support for stacking multi-output primitives (:pr:`679`)\n        * Generate transform features of direct features (:pr:`623`)\n        * Added serializing and deserializing from S3 and deserializing from URLs (:pr:`685`)\n        * Added nlp_primitives as an add-on library (:pr:`704`)\n        * Added AutoNormalize to Featuretools plugins (:pr:`699`)\n        * Added functionality for relative units (month/year) in Timedelta (:pr:`692`)\n        * Added categorical-encoding as an add-on library (:pr:`700`)\n    * Fixes\n        * Fix performance regression in DFS (:pr:`637`)\n        * Fix deserialization of feature relationship path (:pr:`665`)\n        * Set index after adding ancestor relationship variables (:pr:`668`)\n        * Fix user-supplied variable_types modification in Entity init (:pr:`675`)\n        * Don't calculate dependencies of unnecessary features (:pr:`667`)\n        * Prevent normalize entity's new entity having same index as base entity (:pr:`681`)\n        * Update variable type inference to better check for string values (:pr:`683`)\n    * Changes\n        * Moved dask, distributed imports (:pr:`634`)\n    * Documentation Changes\n        * Miscellaneous changes (:pr:`641`, :pr:`658`)\n        * Modified doc_string of top_n in encoding (:pr:`648`)\n        * Hyperlinked ComposeML (:pr:`653`)\n        * Added FAQ (:pr:`620`, :pr:`677`)\n        * Fixed FAQ question with multiple question marks (:pr:`673`)\n    * Testing Changes\n        * Add master, and release tests for premium primitives (:pr:`660`, :pr:`669`)\n        * Miscellaneous changes (:pr:`672`, :pr:`674`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`alexjwang`, :user:`allisonportis`, :user:`ayushpatidar`,\n    :user:`CJStadler`, :user:`ctduffy`, :user:`gsheni`, :user:`jeff-hernandez`,\n    :user:`jeremyliweishih`, :user:`kmax12`, :user:`rwedge`, :user:`zhxt95`,\n\nv0.9.1 Jul 3, 2019\n====================\n    * Enhancements\n        * Speedup groupby transform calculations (:pr:`609`)\n        * Generate features along all paths when there are multiple paths between entities (:pr:`600`, :pr:`608`)\n    * Fixes\n        * Select columns of dataframe using a list (:pr:`615`)\n        * Change type of features calculated on Index features to Categorical (:pr:`602`)\n        * Filter dataframes through forward relationships (:pr:`625`)\n        * Specify Dask version in requirements for python 2 (:pr:`627`)\n        * Keep dataframe sorted by time during feature calculation (:pr:`626`)\n        * Fix bug in encode_features that created duplicate columns of\n          features with multiple outputs (:pr:`622`)\n    * Changes\n        * Remove unused variance_selection.py file (:pr:`613`)\n        * Remove Timedelta data param (:pr:`619`)\n        * Remove DaysSince primitive (:pr:`628`)\n    * Documentation Changes\n        * Add installation instructions for add-on libraries (:pr:`617`)\n        * Clarification of Multi Output Feature Creation (:pr:`638`)\n        * Miscellaneous changes (:pr:`632`, :pr:`639`)\n    * Testing Changes\n        * Miscellaneous changes (:pr:`595`, :pr:`612`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`CJStadler`, :user:`kmax12`, :user:`rwedge`, :user:`gsheni`, :user:`kkleidal`, :user:`ctduffy`\n\nv0.9.0 Jun 19, 2019\n===================\n    * Enhancements\n        * Add unit parameter to timesince primitives (:pr:`558`)\n        * Add ability to install optional add on libraries (:pr:`551`)\n        * Load and save features from open files and strings (:pr:`566`)\n        * Support custom variable types (:pr:`571`)\n        * Support entitysets which have multiple paths between two entities (:pr:`572`, :pr:`544`)\n        * Added show_info function, more output information added to CLI `featuretools info` (:pr:`525`)\n    * Fixes\n        * Normalize_entity specifies error when 'make_time_index' is an invalid string (:pr:`550`)\n        * Schema version added for entityset serialization (:pr:`586`)\n        * Renamed features have names correctly serialized (:pr:`585`)\n        * Improved error message for index/time_index being the same column in normalize_entity and entity_from_dataframe (:pr:`583`)\n        * Removed all mentions of allow_where (:pr:`587`, :pr:`588`)\n        * Removed unused variable in normalize entity (:pr:`589`)\n        * Change time since return type to numeric (:pr:`606`)\n    * Changes\n        * Refactor get_pandas_data_slice to take single entity (:pr:`547`)\n        * Updates TimeSincePrevious and Diff Primitives (:pr:`561`)\n        * Remove unecessary time_last variable (:pr:`546`)\n    * Documentation Changes\n        * Add Featuretools Enterprise to documentation (:pr:`563`)\n        * Miscellaneous changes (:pr:`552`, :pr:`573`, :pr:`577`, :pr:`599`)\n    * Testing Changes\n        * Miscellaneous changes (:pr:`559`, :pr:`569`, :pr:`570`, :pr:`574`, :pr:`584`, :pr:`590`)\n\n    Thanks to the following people for contributing to this release:\n    :user:`alexjwang`, :user:`allisonportis`, :user:`CJStadler`, :user:`ctduffy`, :user:`gsheni`, :user:`kmax12`, :user:`rwedge`\n\nv0.8.0 May 17, 2019\n===================\n    * Rename NUnique to NumUnique (:pr:`510`)\n    * Serialize features as JSON (:pr:`532`)\n    * Drop all variables at once in normalize_entity (:pr:`533`)\n    * Remove unnecessary sorting from normalize_entity (:pr:`535`)\n    * Features cache their names (:pr:`536`)\n    * Only calculate features for instances before cutoff (:pr:`523`)\n    * Remove all relative imports (:pr:`530`)\n    * Added FullName Variable Type (:pr:`506`)\n    * Add error message when target entity does not exist (:pr:`520`)\n    * New demo links (:pr:`542`)\n    * Remove duplicate features check in DFS (:pr:`538`)\n    * featuretools_primitives entry point expects list of primitive classes (:pr:`529`)\n    * Update ALL_VARIABLE_TYPES list (:pr:`526`)\n    * More Informative N Jobs Prints and Warnings (:pr:`511`)\n    * Update sklearn version requirements (:pr:`541`)\n    * Update Makefile (:pr:`519`)\n    * Remove unused parameter in Entity._handle_time (:pr:`524`)\n    * Remove build_ext code from setup.py (:pr:`513`)\n    * Documentation updates (:pr:`512`, :pr:`514`, :pr:`515`, :pr:`521`, :pr:`522`, :pr:`527`, :pr:`545`)\n    * Testing updates (:pr:`509`, :pr:`516`, :pr:`517`, :pr:`539`)\n\n    Thanks to the following people for contributing to this release: :user:`bphi`, :user:`CharlesBradshaw`, :user:`CJStadler`, :user:`glentennis`, :user:`gsheni`, :user:`kmax12`, :user:`rwedge`\n\nBreaking Changes\n++++++++++++++++\n\n* ``NUnique`` has been renamed to ``NumUnique``.\n\n    Previous behavior\n\n    .. code-block:: python\n\n        from featuretools.primitives import NUnique\n\n    New behavior\n\n    .. code-block:: python\n\n        from featuretools.primitives import NumUnique\n\nv0.7.1 Apr 24, 2019\n===================\n    * Automatically generate feature name for controllable primitives (:pr:`481`)\n    * Primitive docstring updates (:pr:`489`, :pr:`492`, :pr:`494`, :pr:`495`)\n    * Change primitive functions that returned strings to return functions (:pr:`499`)\n    * CLI customizable via entrypoints (:pr:`493`)\n    * Improve calculation of aggregation features on grandchildren (:pr:`479`)\n    * Refactor entrypoints to use decorator (:pr:`483`)\n    * Include doctests in testing suite (:pr:`491`)\n    * Documentation updates (:pr:`490`)\n    * Update how standard primitives are imported internally (:pr:`482`)\n\n    Thanks to the following people for contributing to this release: :user:`bukosabino`, :user:`CharlesBradshaw`, :user:`glentennis`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`kmax12`, :user:`minkvsky`, :user:`rwedge`, :user:`thehomebrewnerd`\n\nv0.7.0 Mar 29, 2019\n===================\n    * Improve Entity Set Serialization (:pr:`361`)\n    * Support calling a primitive instance's function directly (:pr:`461`, :pr:`468`)\n    * Support other libraries extending featuretools functionality via entrypoints (:pr:`452`)\n    * Remove featuretools install command (:pr:`475`)\n    * Add GroupByTransformFeature (:pr:`455`, :pr:`472`, :pr:`476`)\n    * Update Haversine Primitive (:pr:`435`, :pr:`462`)\n    * Add commutative argument to SubtractNumeric and DivideNumeric primitives (:pr:`457`)\n    * Add FilePath variable_type (:pr:`470`)\n    * Add PhoneNumber, DateOfBirth, URL variable types (:pr:`447`)\n    * Generalize infer_variable_type, convert_variable_data and convert_all_variable_data methods (:pr:`423`)\n    * Documentation updates (:pr:`438`, :pr:`446`, :pr:`458`, :pr:`469`)\n    * Testing updates (:pr:`440`, :pr:`444`, :pr:`445`, :pr:`459`)\n\n    Thanks to the following people for contributing to this release: :user:`bukosabino`, :user:`CharlesBradshaw`, :user:`ColCarroll`, :user:`glentennis`, :user:`grayskripko`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`jrkinley`, :user:`kmax12`, :user:`RogerTangos`, :user:`rwedge`\n\nBreaking Changes\n++++++++++++++++\n\n* ``ft.dfs`` now has a ``groupby_trans_primitives`` parameter that DFS uses to automatically construct features that group by an ID column and then apply a transform primitive to search group. This change applies to the following primitives: ``CumSum``, ``CumCount``, ``CumMean``, ``CumMin``, and ``CumMax``.\n\n    Previous behavior\n\n    .. code-block:: python\n\n        ft.dfs(entityset=es,\n               target_entity='customers',\n               trans_primitives=[\"cum_mean\"])\n\n    New behavior\n\n    .. code-block:: python\n\n        ft.dfs(entityset=es,\n               target_entity='customers',\n               groupby_trans_primitives=[\"cum_mean\"])\n\n* Related to the above change, cumulative transform features are now defined using a new feature class, ``GroupByTransformFeature``.\n\n    Previous behavior\n\n    .. code-block:: python\n\n        ft.Feature([base_feature, groupby_feature], primitive=CumulativePrimitive)\n\n\n    New behavior\n\n    .. code-block:: python\n\n        ft.Feature(base_feature, groupby=groupby_feature, primitive=CumulativePrimitive)\n\n\nv0.6.1 Feb 15, 2019\n===================\n    * Cumulative primitives (:pr:`410`)\n    * Entity.query_by_values now preserves row order of underlying data (:pr:`428`)\n    * Implementing Country Code and Sub Region Codes as variable types (:pr:`430`)\n    * Added IPAddress and EmailAddress variable types (:pr:`426`)\n    * Install data and dependencies (:pr:`403`)\n    * Add TimeSinceFirst, fix TimeSinceLast (:pr:`388`)\n    * Allow user to pass in desired feature return types (:pr:`372`)\n    * Add new configuration object (:pr:`401`)\n    * Replace NUnique get_function (:pr:`434`)\n    * _calculate_idenity_features now only returns the features asked for, instead of the entire entity (:pr:`429`)\n    * Primitive function name uniqueness (:pr:`424`)\n    * Update NumCharacters and NumWords primitives (:pr:`419`)\n    * Removed Variable.dtype (:pr:`416`, :pr:`433`)\n    * Change to zipcode rep, str for pandas (:pr:`418`)\n    * Remove pandas version upper bound (:pr:`408`)\n    * Make S3 dependencies optional (:pr:`404`)\n    * Check that agg_primitives and trans_primitives are right primitive type (:pr:`397`)\n    * Mean primitive changes (:pr:`395`)\n    * Fix transform stacking on multi-output aggregation (:pr:`394`)\n    * Fix list_primitives (:pr:`391`)\n    * Handle graphviz dependency (:pr:`389`, :pr:`396`, :pr:`398`)\n    * Testing updates (:pr:`402`, :pr:`417`, :pr:`433`)\n    * Documentation updates (:pr:`400`, :pr:`409`, :pr:`415`, :pr:`417`, :pr:`420`, :pr:`421`, :pr:`422`, :pr:`431`)\n\n\n    Thanks to the following people for contributing to this release:  :user:`CharlesBradshaw`, :user:`csala`, :user:`floscha`, :user:`gsheni`, :user:`jxwolstenholme`, :user:`kmax12`, :user:`RogerTangos`, :user:`rwedge`\n\nv0.6.0 Jan 30, 2018\n===================\n    * Primitive refactor (:pr:`364`)\n    * Mean ignore NaNs (:pr:`379`)\n    * Plotting entitysets (:pr:`382`)\n    * Add seed features later in DFS process (:pr:`357`)\n    * Multiple output column features (:pr:`376`)\n    * Add ZipCode Variable Type (:pr:`367`)\n    * Add `primitive.get_filepath` and example of primitive loading data from external files (:pr:`380`)\n    * Transform primitives take series as input (:pr:`385`)\n    * Update dependency requirements (:pr:`378`, :pr:`383`, :pr:`386`)\n    * Add modulo to override tests (:pr:`384`)\n    * Update documentation (:pr:`368`, :pr:`377`)\n    * Update README.md (:pr:`366`, :pr:`373`)\n    * Update CI tests (:pr:`359`, :pr:`360`, :pr:`375`)\n\n    Thanks to the following people for contributing to this release: :user:`floscha`, :user:`gsheni`, :user:`kmax12`, :user:`RogerTangos`, :user:`rwedge`\n\nv0.5.1 Dec 17, 2018\n===================\n    * Add missing dependencies (:pr:`353`)\n    * Move comment to note in documentation (:pr:`352`)\n\nv0.5.0 Dec 17, 2018\n===================\n    * Add specific error for duplicate additional/copy_variables in normalize_entity (:pr:`348`)\n    * Removed EntitySet._import_from_dataframe (:pr:`346`)\n    * Removed time_index_reduce parameter (:pr:`344`)\n    * Allow installation of additional primitives (:pr:`326`)\n    * Fix DatetimeIndex variable conversion (:pr:`342`)\n    * Update Sklearn DFS Transformer (:pr:`343`)\n    * Clean up entity creation logic (:pr:`336`)\n    * remove casting to list in transform feature calculation (:pr:`330`)\n    * Fix sklearn wrapper (:pr:`335`)\n    * Add readme to pypi\n    * Update conda docs after move to conda-forge (:pr:`334`)\n    * Add wrapper for scikit-learn Pipelines (:pr:`323`)\n    * Remove parse_date_cols parameter from EntitySet._import_from_dataframe (:pr:`333`)\n\n    Thanks to the following people for contributing to this release: :user:`bukosabino`, :user:`georgewambold`, :user:`gsheni`, :user:`jeff-hernandez`, :user:`kmax12`, and :user:`rwedge`.\n\nv0.4.1 Nov 29, 2018\n===================\n    * Resolve bug preventing using first column as index by default (:pr:`308`)\n    * Handle return type when creating features from Id variables (:pr:`318`)\n    * Make id an optional parameter of EntitySet constructor (:pr:`324`)\n    * Handle primitives with same function being applied to same column (:pr:`321`)\n    * Update requirements (:pr:`328`)\n    * Clean up DFS arguments (:pr:`319`)\n    * Clean up Pandas Backend (:pr:`302`)\n    * Update properties of cumulative transform primitives (:pr:`320`)\n    * Feature stability between versions documentation (:pr:`316`)\n    * Add download count to GitHub readme (:pr:`310`)\n    * Fixed #297 update tests to check error strings (:pr:`303`)\n    * Remove usage of fixtures in agg primitive tests (:pr:`325`)\n\nv0.4.0 Oct 31, 2018\n===================\n    * Remove ft.utils.gen_utils.getsize and make pympler a test requirement (:pr:`299`)\n    * Update requirements.txt (:pr:`298`)\n    * Refactor EntitySet.find_path(...) (:pr:`295`)\n    * Clean up unused methods (:pr:`293`)\n    * Remove unused parents property of Entity (:pr:`283`)\n    * Removed relationships parameter (:pr:`284`)\n    * Improve time index validation (:pr:`285`)\n    * Encode features with \"unknown\" class in categorical (:pr:`287`)\n    * Allow where clauses on direct features in Deep Feature Synthesis (:pr:`279`)\n    * Change to fullargsspec (:pr:`288`)\n    * Parallel verbose fixes (:pr:`282`)\n    * Update tests for python 3.7 (:pr:`277`)\n    * Check duplicate rows cutoff times (:pr:`276`)\n    * Load retail demo data using compressed file (:pr:`271`)\n\nv0.3.1 Sep 28, 2018\n===================\n    * Handling time rewrite (:pr:`245`)\n    * Update deep_feature_synthesis.py (:pr:`249`)\n    * Handling return type when creating features from DatetimeTimeIndex (:pr:`266`)\n    * Update retail.py (:pr:`259`)\n    * Improve Consistency of Transform Primitives (:pr:`236`)\n    * Update demo docstrings (:pr:`268`)\n    * Handle non-string column names (:pr:`255`)\n    * Clean up merging of aggregation primitives (:pr:`250`)\n    * Add tests for Entity methods (:pr:`262`)\n    * Handle no child data when calculating aggregation features with multiple arguments (:pr:`264`)\n    * Add `is_string` utils function (:pr:`260`)\n    * Update python versions to match docker container (:pr:`261`)\n    * Handle where clause when no child data (:pr:`258`)\n    * No longer cache demo csvs, remove config file (:pr:`257`)\n    * Avoid stacking \"expanding\" primitives (:pr:`238`)\n    * Use randomly generated names in retail csv (:pr:`233`)\n    * Update README.md (:pr:`243`)\n\nv0.3.0 Aug 27, 2018\n===================\n    * Improve performance of all feature calculations (:pr:`224`)\n    * Update agg primitives to use more efficient functions (:pr:`215`)\n    * Optimize metadata calculation (:pr:`229`)\n    * More robust handling when no data at a cutoff time (:pr:`234`)\n    * Workaround categorical merge (:pr:`231`)\n    * Switch which CSV is associated with which variable (:pr:`228`)\n    * Remove unused kwargs from query_by_values, filter_and_sort (:pr:`225`)\n    * Remove convert_links_to_integers (:pr:`219`)\n    * Add conda install instructions (:pr:`223`, :pr:`227`)\n    * Add example of using Dask to parallelize to docs  (:pr:`221`)\n\nv0.2.2 Aug 20, 2018\n===================\n    * Remove unnecessary check no related instances call and refactor (:pr:`209`)\n    * Improve memory usage through support for pandas categorical types (:pr:`196`)\n    * Bump minimum pandas version from 0.20.3 to 0.23.0 (:pr:`216`)\n    * Better parallel memory warnings (:pr:`208`, :pr:`214`)\n    * Update demo datasets (:pr:`187`, :pr:`201`, :pr:`207`)\n    * Make primitive lookup case insensitive  (:pr:`213`)\n    * Use capital name (:pr:`211`)\n    * Set class name for Min (:pr:`206`)\n    * Remove ``variable_types`` from normalize entity (:pr:`205`)\n    * Handle parquet serialization with last time index (:pr:`204`)\n    * Reset index of cutoff times in calculate feature matrix (:pr:`198`)\n    * Check argument types for .normalize_entity (:pr:`195`)\n    * Type checking ignore entities.  (:pr:`193`)\n\nv0.2.1 Jul 2, 2018\n==================\n    * Cpu count fix (:pr:`176`)\n    * Update flight (:pr:`175`)\n    * Move feature matrix calculation helper functions to separate file (:pr:`177`)\n\nv0.2.0 Jun 22, 2018\n===================\n    * Multiprocessing (:pr:`170`)\n    * Handle unicode encoding in repr throughout Featuretools (:pr:`161`)\n    * Clean up EntitySet class (:pr:`145`)\n    * Add support for building and uploading conda package (:pr:`167`)\n    * Parquet serialization (:pr:`152`)\n    * Remove variable stats (:pr:`171`)\n    * Make sure index variable comes first (:pr:`168`)\n    * No last time index update on normalize (:pr:`169`)\n    * Remove list of times as on option for `cutoff_time` in `calculate_feature_matrix` (:pr:`165`)\n    * Config does error checking to see if it can write to disk (:pr:`162`)\n\n\nv0.1.21 May 30, 2018\n====================\n    * Support Pandas 0.23.0 (:pr:`153`, :pr:`154`, :pr:`155`, :pr:`159`)\n    * No EntitySet required in loading/saving features (:pr:`141`)\n    * Use s3 demo csv with better column names (:pr:`139`)\n    * more reasonable start parameter (:pr:`149`)\n    * add issue template (:pr:`133`)\n    * Improve tests (:pr:`136`, :pr:`137`, :pr:`144`, :pr:`147`)\n    * Remove unused functions (:pr:`140`, :pr:`143`, :pr:`146`)\n    * Update documentation after recent changes / removals (:pr:`157`)\n    * Rename demo retail csv file (:pr:`148`)\n    * Add names for binary (:pr:`142`)\n    * EntitySet repr to use get_name rather than id (:pr:`134`)\n    * Ensure config dir is writable (:pr:`135`)\n\nv0.1.20 Apr 13, 2018\n====================\n    * Primitives as strings in DFS parameters (:pr:`129`)\n    * Integer time index bugfixes (:pr:`128`)\n    * Add make_temporal_cutoffs utility function (:pr:`126`)\n    * Show all entities, switch shape display to row/col (:pr:`124`)\n    * Improved chunking when calculating feature matrices  (:pr:`121`)\n    * fixed num characters nan fix (:pr:`118`)\n    * modify ignore_variables docstring (:pr:`117`)\n\nv0.1.19 Mar 21, 2018\n====================\n    * More descriptive DFS progress bar (:pr:`69`)\n    * Convert text variable to string before NumWords (:pr:`106`)\n    * EntitySet.concat() reindexes relationships (:pr:`96`)\n    * Keep non-feature columns when encoding feature matrix (:pr:`111`)\n    * Uses full entity update for dependencies of uses_full_entity features (:pr:`110`)\n    * Update column names in retail demo (:pr:`104`)\n    * Handle Transform features that need access to all values of entity (:pr:`91`)\n\nv0.1.18 Feb 27, 2018\n====================\n    * fixes related instances bug (:pr:`97`)\n    * Adding non-feature columns to calculated feature matrix (:pr:`78`)\n    * Relax numpy version req (:pr:`82`)\n    * Remove `entity_from_csv`, tests, and lint (:pr:`71`)\n\nv0.1.17 Jan 18, 2018\n====================\n    * LatLong type (:pr:`57`)\n    * Last time index fixes (:pr:`70`)\n    * Make median agg primitives ignore nans by default (:pr:`61`)\n    * Remove Python 3.4 support (:pr:`64`)\n    * Change `normalize_entity` to update `secondary_time_index` (:pr:`59`)\n    * Unpin requirements (:pr:`53`)\n    * associative -> commutative (:pr:`56`)\n    * Add Words and Chars primitives (:pr:`51`)\n\nv0.1.16 Dec 19, 2017\n====================\n    * fix EntitySet.combine_variables and standardize encode_features (:pr:`47`)\n    * Python 3 compatibility (:pr:`16`)\n\nv0.1.15 Dec 18, 2017\n====================\n    * Fix variable type in demo data (:pr:`37`)\n    * Custom primitive kwarg fix (:pr:`38`)\n    * Changed order and text of arguments in make_trans_primitive docstring (:pr:`42`)\n\nv0.1.14 Nov 20, 2017\n====================\n    * Last time index (:pr:`33`)\n    * Update Scipy version to 1.0.0 (:pr:`31`)\n\n\nv0.1.13 Nov 1, 2017\n===================\n    * Add MANIFEST.in (:pr:`26`)\n\nv0.1.11 Oct 31, 2017\n====================\n    * Package linting (:pr:`7`)\n    * Custom primitive creation functions (:pr:`13`)\n    * Split requirements to separate files and pin to latest versions (:pr:`15`)\n    * Select low information features (:pr:`18`)\n    * Fix docs typos (:pr:`19`)\n    * Fixed Diff primitive for rare nan case (:pr:`21`)\n    * added some mising doc strings (:pr:`23`)\n    * Trend fix (:pr:`22`)\n    * Remove as_dir=False option from EntitySet.to_pickle() (:pr:`20`)\n    * Entity Normalization Preserves Types of Copy & Additional Variables (:pr:`25`)\n\nv0.1.10 Oct 12, 2017\n====================\n    * NumTrue primitive added and docstring of other primitives updated (:pr:`11`)\n    * fixed hash issue with same base features (:pr:`8`)\n    * Head fix (:pr:`9`)\n    * Fix training window (:pr:`10`)\n    * Add associative attribute to primitives (:pr:`3`)\n    * Add status badges, fix license in setup.py (:pr:`1`)\n    * fixed head printout and flight demo index (:pr:`2`)\n\nv0.1.9 Sep 8, 2017\n==================\n    * Documentation improvements\n    * New ``featuretools.demo.load_mock_customer`` function\n\nv0.1.8 Sep 1, 2017\n==================\n    * Bug fixes\n    * Added ``Percentile`` transform primitive\n\nv0.1.7 Aug 17, 2017\n===================\n    * Performance improvements for approximate in ``calculate_feature_matrix`` and ``dfs``\n    * Added ``Week`` transform primitive\n\nv0.1.6 Jul 26, 2017\n===================\n    * Added ``load_features`` and ``save_features`` to persist and reload features\n    * Added save_progress argument to ``calculate_feature_matrix``\n    * Added approximate parameter to ``calculate_feature_matrix`` and ``dfs``\n    * Added ``load_flight`` to ft.demo\n\nv0.1.5 Jul 11, 2017\n===================\n    * Windows support\n\nv0.1.3 Jul 10, 2017\n===================\n    * Renamed feature submodule to primitives\n    * Renamed prediction_entity arguments to target_entity\n    * Added training_window parameter to ``calculate_feature_matrix``\n\nv0.1.2 Jul 3rd, 2017\n====================\n    * Initial release\n\n.. command\n.. git log --pretty=oneline --abbrev-commit\n"
  },
  {
    "path": "docs/source/resources/ecosystem.rst",
    "content": ":description: A list of libraries, use cases / demos, and tutorials that leverage Featuretools\n\n===============================\nFeaturetools External Ecosystem\n===============================\n\nNew projects are regularly being built on top of Featuretools, highlighting the importance of automated feature engineering. On this page, we have a list of libraries, use cases / demos, and tutorials that leverage Featuretools. If you would like to add a project, please contact us or submit a pull request on `GitHub`_.\n\n.. _`GitHub`: https://github.com/alteryx/featuretools\n\n.. note::\n\n    We are proud and excited to share the work of people using Featuretools, but we cannot endorse or provide support for the tools on this page.\n\n---------\nLibraries\n---------\n\n\n`MLBlocks`_\n===========\n- MLBlocks is a simple framework for composing end-to-end tunable Machine Learning Pipelines by seamlessly combining tools from any python library with a simple, common and uniform interface. MLBlocks contains a primitive that uses Featuretools.\n\n.. _`MLBlocks`: https://github.com/HDI-Project/MLBlocks\n\n`Cardea`_\n=========\n- Cardea is a machine learning library built on top of the FHIR data schema. It uses a number of **automl** tools, including Featuretools.\n\n.. _`Cardea`: https://github.com/D3-AI/Cardea\n\n-----------------\nDemos & Use Cases\n-----------------\n`Predict customer lifetime value`_\n==================================\n- A common use case for machine learning is to predict customer lifetime value. This article walks through the importance of this prediction problem using Featuretools in the process.\n\n.. _`Predict customer lifetime value`: https://towardsdatascience.com/automating-interpretable-feature-engineering-for-predicting-clv-87ece7da9b36\n\n`Predict NHL playoff matches`_\n==============================\n- Many users of `Kaggle`_ are eager to use Featuretools to improve their model performance. In this blog post, a Kaggle user takes a dataset of plays from National Hockey League games and creates a model to predict if a game is a playoff match.\n\n.. _`Predict NHL playoff matches`: https://towardsdatascience.com/automated-feature-engineering-for-predictive-modeling-d8c9fa4e478b\n.. _`Kaggle`: https://www.kaggle.com/\n\n`Predict poverty of households in Costa Rica`_\n==============================================\n- Social programs have a difficult time determining the right people to give aid. Using a dataset of Costa Rican household characteristics, this Kaggle kernel predicts the poverty of households.\n\n.. _`Predict poverty of households in Costa Rica`: https://www.kaggle.com/willkoehrsen/featuretools-for-good\n\n`Predicting Functional Threshold Power (FTP)`_\n==============================================\n- This notebook and accompanying report evaluates the use of machine learning for predicting a cyclist’s FTP using data collected from previous training sessions. Featuretools is used to generate a set of independent variables that capture changes in performance over time.\n\n.. _`Predicting Functional Threshold Power (FTP)`: https://github.com/jrkinley/ftp_proba\n\n.. note::\n\n    For more demos written by `Feature Labs <https://www.featurelabs.com>`_, see `featuretools.com/demos <https://www.featuretools.com/demos/>`_\n\n---------\nTutorials\n---------\n`Automated Feature Engineering in Python`_\n==========================================\n- This article provides a walk-through of how to use a retail dataset with DFS.\n\n.. _`Automated Feature Engineering in Python`: https://towardsdatascience.com/automated-feature-engineering-in-python-99baf11cc219\n\n`A Hands-On Guide to Automated Feature Engineering`_\n====================================================\n- A **in-depth** tutorial that works through using Featuretools to predict future product sales at \"BigMart\".\n\n.. _`A Hands-On Guide to Automated Feature Engineering`: https://www.analyticsvidhya.com/blog/2018/08/guide-automated-feature-engineering-featuretools-python/\n\n`Introduction to Automated Feature Engineering Using DFS`_\n==========================================================\n- This article demonstrates using Featuretools helps automate the manual process of feature engineering on a dataset of home loans.\n\n.. _`Introduction to Automated Feature Engineering Using DFS`: https://heartbeat.fritz.ai/introduction-to-automated-feature-engineering-using-deep-feature-synthesis-dfs-3feb69a7c00b\n\n`Automated Feature Engineering Workshop`_\n=========================================\n- An automated feature engineering workshop using Featuretools hosted at the 2017 Data Summer Conference.\n\n.. _`Automated Feature Engineering Workshop`: https://github.com/fred-navruzov/featuretools-workshop\n\n`Tutorial in Japanese`_\n=======================\n- A tutorial of Featuretools that demonstrates integrating with the feature selection library `Boruta`_ and the hyper parameter tuning library `Optuna`_.\n\n.. _`Tutorial in Japanese`: https://dev.classmethod.jp/machine-learning/yoshim-featuretools-boruta-optuna/\n.. _`Optuna`: https://github.com/pfnet/optuna\n.. _`Boruta`: https://github.com/scikit-learn-contrib/boruta_py\n\n`Building a Churn Prediction Model using Featuretools`_\n=======================================================\n- A video tutorial that shows how to build a churn prediction model using Featuretools along with `Spark`_, `XGBoost`_, and `Google Cloud Platform`_.\n\n.. _`Building a Churn Prediction Model using Featuretools`: https://youtu.be/ZwwneZ6iU3Y\n.. _`Spark`: https://spark.apache.org/\n.. _`XGBoost`: https://github.com/dmlc/xgboost\n.. _`Google Cloud Platform`: https://cloud.google.com/\n\n`Automated Feature Engineering Workshop in Russian`_\n====================================================\n- A video tutorial that shows how to predict if an applicant is capable of repaying a loan using Featuretools.\n\n.. _`Automated Feature Engineering Workshop in Russian`: https://youtu.be/R0-mnamKxqY\n"
  },
  {
    "path": "docs/source/resources/frequently_asked_questions.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Frequently Asked Questions\\n\",\n    \"\\n\",\n    \"Here we are attempting to answer some commonly asked questions that appear on Github, and Stack Overflow.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import pandas as pd\\n\",\n    \"import woodwork as ww\\n\",\n    \"\\n\",\n    \"import featuretools as ft\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## EntitySet\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### How do I get a list of column names and types in an `EntitySet`?\\n\",\n    \"\\n\",\n    \"After you create your `EntitySet`, you may wish to view the column names. An `EntitySet` contains multiple DataFrames, one for each table in the `EntitySet`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es = ft.demo.load_mock_customer(return_entityset=True)\\n\",\n    \"es\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"If you want to view the underlying Dataframe, you can do the following:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es[\\\"transactions\\\"].head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"If you want view the columns and types for the \\\"transactions\\\" DataFrame, you can do the following:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es[\\\"transactions\\\"].ww\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### What is the difference between `copy_columns` and `additional_columns`?\\n\",\n    \"The function `normalize_dataframe` creates a new DataFrame and a relationship from unique values of an existing DataFrame. It takes 2 similar arguments:\\n\",\n    \"\\n\",\n    \"- `additional_columns` removes columns from the base DataFrame and moves them to the new DataFrame. \\n\",\n    \"- `copy_columns` keeps the given columns in the base DataFrame, but also copies them to the new DataFrame.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"data = ft.demo.load_mock_customer()\\n\",\n    \"transactions_df = data[\\\"transactions\\\"].merge(data[\\\"sessions\\\"]).merge(data[\\\"customers\\\"])\\n\",\n    \"products_df = data[\\\"products\\\"]\\n\",\n    \"\\n\",\n    \"es = ft.EntitySet(id=\\\"customer_data\\\")\\n\",\n    \"es = es.add_dataframe(\\n\",\n    \"    dataframe_name=\\\"transactions\\\",\\n\",\n    \"    dataframe=transactions_df,\\n\",\n    \"    index=\\\"transaction_id\\\",\\n\",\n    \"    time_index=\\\"transaction_time\\\",\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"es = es.add_dataframe(\\n\",\n    \"    dataframe_name=\\\"products\\\", dataframe=products_df, index=\\\"product_id\\\"\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"es = es.add_relationship(\\\"products\\\", \\\"product_id\\\", \\\"transactions\\\", \\\"product_id\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Before we normalize to create a new DataFrame, let's look at the base DataFrame\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es[\\\"transactions\\\"].head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Notice the columns `session_id`, `session_start`, `join_date`, `device`, `customer_id`, and `zip_code`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es = es.normalize_dataframe(\\n\",\n    \"    base_dataframe_name=\\\"transactions\\\",\\n\",\n    \"    new_dataframe_name=\\\"sessions\\\",\\n\",\n    \"    index=\\\"session_id\\\",\\n\",\n    \"    make_time_index=\\\"session_start\\\",\\n\",\n    \"    additional_columns=[\\\"join_date\\\"],\\n\",\n    \"    copy_columns=[\\\"device\\\", \\\"customer_id\\\", \\\"zip_code\\\", \\\"session_start\\\"],\\n\",\n    \")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Above, we normalized the columns to create a new DataFrame. \\n\",\n    \"\\n\",\n    \"- For `additional_columns`, the following column `['join_date]` will be removed from the `transactions` DataFrame, and moved to the new `sessions` DataFrame. \\n\",\n    \"\\n\",\n    \"- For `copy_columns`, the following columns `['device', 'customer_id', 'zip_code','session_start']` will be copied from the `transactions` DataFrame to the new `sessions` DataFrame. \\n\",\n    \"\\n\",\n    \"Let's see this in the actual `EntitySet`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es[\\\"transactions\\\"].head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Notice above how `['device', 'customer_id', 'zip_code','session_start']` are still in the `transactions` DataFrame, while `['join_date']` is not. But, they have all been moved to the `sessions` DataFrame, as seen below.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es[\\\"sessions\\\"].head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Why did my columns get new semantic tags?\\n\",\n    \"\\n\",\n    \"During the creation of your `EntitySet`, you might be wondering why the semantic tags in your columns change.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"data = ft.demo.load_mock_customer()\\n\",\n    \"transactions_df = data[\\\"transactions\\\"].merge(data[\\\"sessions\\\"]).merge(data[\\\"customers\\\"])\\n\",\n    \"products_df = data[\\\"products\\\"]\\n\",\n    \"\\n\",\n    \"es = ft.EntitySet(id=\\\"customer_data\\\")\\n\",\n    \"es = es.add_dataframe(\\n\",\n    \"    dataframe_name=\\\"transactions\\\",\\n\",\n    \"    dataframe=transactions_df,\\n\",\n    \"    index=\\\"transaction_id\\\",\\n\",\n    \"    time_index=\\\"transaction_time\\\",\\n\",\n    \")\\n\",\n    \"es.plot()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"If a column contains semantic tags, they will appear on the right side of a semicolon in the plot above. Notice how `session_id` and `session_start` do not have any semantic tags currently associated to them.\\n\",\n    \"\\n\",\n    \"Now, let's normalize the transactions DataFrame to create a new DataFrame.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es = es.normalize_dataframe(\\n\",\n    \"    base_dataframe_name=\\\"transactions\\\",\\n\",\n    \"    new_dataframe_name=\\\"sessions\\\",\\n\",\n    \"    index=\\\"session_id\\\",\\n\",\n    \"    make_time_index=\\\"session_start\\\",\\n\",\n    \"    additional_columns=[\\\"session_start\\\"],\\n\",\n    \")\\n\",\n    \"es.plot()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The `session_id` now has the sematic tag `foreign_key` in the `transactions` DataFrame, and `index` in the new DataFrame, `sessions`. This is the case because when we normalize the DataFrame, we create a new relationship between the `transactions` and `sessions`. There is a one to many relationship between the parent DataFrame, `sessions`, and child DataFrame, `transactions`.\\n\",\n    \"\\n\",\n    \"Therefore, `session_id` has the semantic tag `foreign_key` in `transactions` because it represents an `index` in another DataFrame. There would be a similar effect if we added another DataFrame using `add_dataframe` and `add_relationship`. \\n\",\n    \"\\n\",\n    \"In addition, when we created the new DataFrame, we set `session_start` as the `time_index`. This added the semantic tag `time_index` to the `session_start` column in the new `sessions` DataFrame because it now represents a `time_index`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### How do I update a column's description or metadata?\\n\",\n    \"\\n\",\n    \"You can directly update the description or metadata attributes of the column schema. However, you must specifically use the column schema returned by `DataFrame.ww.columns['col_name']`, **not** `DataFrame.ww['col_name'].ww.schema`. The column schema from `DataFrame.ww.columns['col_name']` is still associated with the EntitySet and propagates any attribute updates, whereas the other does not. As an example, this is how you can update a column's description or metadata:\\n\",\n    \"\\n\",\n    \"```python\\n\",\n    \"column_schema = df.ww.columns['col_name']\\n\",\n    \"column_schema.description = 'my description'\\n\",\n    \"column_schema.metadata.update(key='value')\\n\",\n    \"```\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### How do I combine two or more interesting values?\\n\",\n    \"\\n\",\n    \"You might want to create features that are conditioned on multiple values before they are calculated. This would require the use of `interesting_values`. However, since we are trying to create the feature with multiple conditions, we will need to modify the Dataframe before we create the `EntitySet`.\\n\",\n    \"\\n\",\n    \"Let's look at how you might accomplish this. \\n\",\n    \"\\n\",\n    \"First, let's create our Dataframes.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"data = ft.demo.load_mock_customer()\\n\",\n    \"transactions_df = data[\\\"transactions\\\"].merge(data[\\\"sessions\\\"]).merge(data[\\\"customers\\\"])\\n\",\n    \"products_df = data[\\\"products\\\"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"transactions_df.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"products_df.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's modify our `transactions` Dataframe to create the additional column that represents multiple conditions for our feature.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"transactions_df[\\\"product_id_device\\\"] = (\\n\",\n    \"    transactions_df[\\\"product_id\\\"].astype(str) + \\\" and \\\" + transactions_df[\\\"device\\\"]\\n\",\n    \")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Here, we created a new column called `product_id_device`, which just combines the `product_id` column, and the `device` column.\\n\",\n    \"\\n\",\n    \"Now let's create our `EntitySet`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es = ft.EntitySet(id=\\\"customer_data\\\")\\n\",\n    \"es = es.add_dataframe(\\n\",\n    \"    dataframe_name=\\\"transactions\\\",\\n\",\n    \"    dataframe=transactions_df,\\n\",\n    \"    index=\\\"transaction_id\\\",\\n\",\n    \"    time_index=\\\"transaction_time\\\",\\n\",\n    \"    logical_types={\\n\",\n    \"        \\\"product_id\\\": ww.logical_types.Categorical,\\n\",\n    \"        \\\"product_id_device\\\": ww.logical_types.Categorical,\\n\",\n    \"        \\\"zip_code\\\": ww.logical_types.PostalCode,\\n\",\n    \"    },\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"es = es.add_dataframe(\\n\",\n    \"    dataframe_name=\\\"products\\\", dataframe=products_df, index=\\\"product_id\\\"\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"es = es.normalize_dataframe(\\n\",\n    \"    base_dataframe_name=\\\"transactions\\\",\\n\",\n    \"    new_dataframe_name=\\\"sessions\\\",\\n\",\n    \"    index=\\\"session_id\\\",\\n\",\n    \"    additional_columns=[\\\"device\\\", \\\"product_id_device\\\", \\\"customer_id\\\"],\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"es = es.normalize_dataframe(\\n\",\n    \"    base_dataframe_name=\\\"sessions\\\", new_dataframe_name=\\\"customers\\\", index=\\\"customer_id\\\"\\n\",\n    \")\\n\",\n    \"es\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we are ready to add our interesting values. \\n\",\n    \"\\n\",\n    \"First, let's view our options for what the interesting values could be.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"interesting_values = transactions_df[\\\"product_id_device\\\"].unique().tolist()\\n\",\n    \"interesting_values\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"If you wanted to, you could pick a subset of these, and the `where` features created would only use those conditions. In our example, we will use all the possible interesting values.\\n\",\n    \"\\n\",\n    \"Here, we set all of these values as our interesting values for this specific DataFrame and column. If we wanted to, we could make interesting values in the same way for more than one column, but we will just stick with this one for this example.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"values = {\\\"product_id_device\\\": interesting_values}\\n\",\n    \"es.add_interesting_values(dataframe_name=\\\"sessions\\\", values=values)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now we can run DFS.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    agg_primitives=[\\\"count\\\"],\\n\",\n    \"    where_primitives=[\\\"count\\\"],\\n\",\n    \"    trans_primitives=[],\\n\",\n    \")\\n\",\n    \"feature_matrix.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"To better understand the `where` clause features, let's examine one of those features. \\n\",\n    \"The feature `COUNT(sessions WHERE product_id_device = 5 and tablet)`, tells us how many sessions the customer purchased `product_id` 5 while on a tablet. Notice how the feature depends on multiple conditions **(product_id = 5 & device = tablet)**.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix[[\\\"COUNT(sessions WHERE product_id_device = 5 and tablet)\\\"]]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## DFS\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Why is DFS not creating aggregation features?\\n\",\n    \"You may have created your `EntitySet`, and then applied DFS to create features. However, you may be puzzled as to why no aggregation features were created. \\n\",\n    \"\\n\",\n    \"- **This is most likely because you have a single DataFrame in your EntitySet, and DFS is not capable of creating aggregation features with fewer than 2 DataFrames. Featuretools looks for a relationship, and aggregates based on that relationship.**\\n\",\n    \"\\n\",\n    \"Let's look at a simple example.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"data = ft.demo.load_mock_customer()\\n\",\n    \"transactions_df = data[\\\"transactions\\\"].merge(data[\\\"sessions\\\"]).merge(data[\\\"customers\\\"])\\n\",\n    \"\\n\",\n    \"es = ft.EntitySet(id=\\\"customer_data\\\")\\n\",\n    \"es = es.add_dataframe(\\n\",\n    \"    dataframe_name=\\\"transactions\\\", dataframe=transactions_df, index=\\\"transaction_id\\\"\\n\",\n    \")\\n\",\n    \"es\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Notice how we only have 1 DataFrame in our `EntitySet`. If we try to create aggregation features on this `EntitySet`, it will not be possible because DFS needs 2 DataFrames to generate aggregation features. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    entityset=es, target_dataframe_name=\\\"transactions\\\"\\n\",\n    \")\\n\",\n    \"feature_defs\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"None of the above features are aggregation features. To fix this issue, you can add another DataFrame to your `EntitySet`.\\n\",\n    \"\\n\",\n    \"**Solution #1 - You can add new DataFrame if you have additional data.**\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"products_df = data[\\\"products\\\"]\\n\",\n    \"es = es.add_dataframe(\\n\",\n    \"    dataframe_name=\\\"products\\\", dataframe=products_df, index=\\\"product_id\\\"\\n\",\n    \")\\n\",\n    \"es\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Notice how we now have an additional DataFrame in our `EntitySet`, called `products`.\\n\",\n    \"\\n\",\n    \"**Solution #2 - You can normalize an existing DataFrame.**\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es = es.normalize_dataframe(\\n\",\n    \"    base_dataframe_name=\\\"transactions\\\",\\n\",\n    \"    new_dataframe_name=\\\"sessions\\\",\\n\",\n    \"    index=\\\"session_id\\\",\\n\",\n    \"    make_time_index=\\\"session_start\\\",\\n\",\n    \"    additional_columns=[\\\"device\\\", \\\"customer_id\\\", \\\"zip_code\\\", \\\"join_date\\\"],\\n\",\n    \"    copy_columns=[\\\"session_start\\\"],\\n\",\n    \")\\n\",\n    \"es\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Notice how we now have an additional DataFrame in our `EntitySet`, called `sessions`. Here, the normalization created a relationship between `transactions` and `sessions`. However, we could have specified a relationship between `transactions` and `products` if we had only used Solution \\\\#1.\\n\",\n    \"\\n\",\n    \"Now, we can generate aggregation features.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    entityset=es, target_dataframe_name=\\\"transactions\\\"\\n\",\n    \")\\n\",\n    \"feature_defs[:-10]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"A few of the aggregation features are:\\n\",\n    \"\\n\",\n    \"- `<Feature: sessions.MAX(transactions.amount)>`\\n\",\n    \"- `<Feature: sessions.SKEW(transactions.amount)>`\\n\",\n    \"- `<Feature: sessions.MIN(transactions.amount)>`\\n\",\n    \"- `<Feature: sessions.MEAN(transactions.amount)>`\\n\",\n    \"- `<Feature: sessions.COUNT(transactions)>`\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### How do I speed up the runtime of DFS?\\n\",\n    \"\\n\",\n    \"One issue you may encounter while running `ft.dfs` is slow performance. While Featuretools has generally optimal default settings for calculating features, you may want to speed up performance when you are calculating on a large number of features. \\n\",\n    \"\\n\",\n    \"One quick way to speed up performance is by adjusting the `n_jobs` settings of `ft.dfs` or `ft.calculate_feature_matrix`.\\n\",\n    \"\\n\",\n    \"```python\\n\",\n    \"# setting n_jobs to -1 will use all cores\\n\",\n    \"\\n\",\n    \"feature_matrix, feature_defs = ft.dfs(entityset=es,\\n\",\n    \"                                      target_dataframe_name=\\\"customers\\\",\\n\",\n    \"                                      n_jobs=-1)\\n\",\n    \"\\n\",\n    \"                                      \\n\",\n    \"feature_matrix, feature_defs = ft.calculate_feature_matrix(entityset=es,\\n\",\n    \"                                                           features=feature_defs,\\n\",\n    \"                                                           n_jobs=-1)\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"**For more ways to speed up performance, please visit:**\\n\",\n    \"\\n\",\n    \"- [Improving Computational Performance](../guides/performance.ipynb#improving-computational-performance)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### How do I include only certain features when running DFS?\\n\",\n    \"\\n\",\n    \"When using DFS to generate features, you may wish to include only certain features. There are multiple ways that you do this:\\n\",\n    \"\\n\",\n    \"- Use `ignore_columns` to specify columns in a DataFrame that should not be used to create features. It is a dictionary mapping dataframe names to a list of column names to ignore.\\n\",\n    \"\\n\",\n    \"- Use `drop_contains` to drop features that contain any of the strings listed in this parameter.\\n\",\n    \"\\n\",\n    \"- Use `drop_exact` to drop features that exactly match any of the strings listed in this parameter.\\n\",\n    \"\\n\",\n    \"Here is an example of using all three parameters:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es = ft.demo.load_mock_customer(return_entityset=True)\\n\",\n    \"\\n\",\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    ignore_columns={\\n\",\n    \"        \\\"transactions\\\": [\\\"amount\\\"],\\n\",\n    \"        \\\"customers\\\": [\\\"age\\\", \\\"gender\\\", \\\"birthday\\\"],\\n\",\n    \"    },  # ignore these columns\\n\",\n    \"    drop_contains=[\\\"customers.SUM(\\\"],  # drop features that contain these strings\\n\",\n    \"    drop_exact=[\\\"STD(transactions.quanity)\\\"],\\n\",\n    \")  # drop features that exactly match\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### How do I specify primitives on a per column or per DataFrame basis?\\n\",\n    \"\\n\",\n    \"When using DFS to generate features, you may wish to use only certain features or DataFrames for specific primitives. This can be done through the `primitive_options` parameter. The `primitive_options` parameter is a dictionary that maps a primitive or a tuple of primitives to a dictionary containing options for the primitive(s). A primitive or tuple of primitives can also be mapped to a list of option dictionaries if the primitive(s) \\n\",\n    \"takes multiple inputs. The primitive keys can be the string names of the primitive, the primitive class, or specific instances of the primitive. Each dictionary supplies options for their respective input column. There are multiple ways to control how primitives get applied through these options:\\n\",\n    \"\\n\",\n    \"- Use `ignore_dataframes` to specify DataFrames that should not be used to create features for that primitive. It is a list of DataFrame names to ignore.\\n\",\n    \"\\n\",\n    \"- Use `include_dataframes` to specify the only DataFrames to be included to create features for that primitive. It is a list of DataFrame names to include.\\n\",\n    \"\\n\",\n    \"- Use `ignore_columns` to specify columns in a DataFrame that should not be used to create features for that primitive. It is a dictionary mapping a DataFrame name to a list of column names to ignore.\\n\",\n    \"\\n\",\n    \"- Use `include_columns` to specify the only columns in a DataFrame that should be used to create features for that primitive. It is a dictionary mapping a DataFrame name to a list of column names to include.\\n\",\n    \"\\n\",\n    \"You can also use `primitive_options` to specify which DataFrames or columns you wish to use as groupbys for groupby transformation primitives:\\n\",\n    \"\\n\",\n    \"- Use `ignore_groupby_dataframes` to specify DataFrames that should not be used to get groupbys for that primitive. It is a list of DataFrame names to ignore.\\n\",\n    \"\\n\",\n    \"- Use `include_groupby_dataframes` to specify the only DataFrames that should be used to get groupbys for that primitive. It is a list of DataFrame names to include.\\n\",\n    \"\\n\",\n    \"- Use `ignore_groupby_columns` to specify columns in a DataFrame that should not be used as groupbys for that primitive. It is a dictionary mapping a DataFrame name to a list of column names to ignore.\\n\",\n    \"\\n\",\n    \"- Use `include_groupby_columns` to specify the only columns in a DataFrame that should be used as groupbys for that primitive. It is a dictionary mapping a DataFrame name to a list of column names to include.\\n\",\n    \"\\n\",\n    \"Here is an example of using some of these options:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es = ft.demo.load_mock_customer(return_entityset=True)\\n\",\n    \"\\n\",\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    primitive_options={\\n\",\n    \"        \\\"mode\\\": {\\n\",\n    \"            \\\"ignore_dataframes\\\": [\\\"sessions\\\"],\\n\",\n    \"            \\\"ignore_columns\\\": {\\\"products\\\": [\\\"brand\\\"], \\\"transactions\\\": [\\\"product_id\\\"]},\\n\",\n    \"        },\\n\",\n    \"        # For mode, ignore the \\\"sessions\\\" DataFrame and only include \\\"brands\\\" in the\\n\",\n    \"        # \\\"products\\\" dataframe and \\\"product_id\\\" in the \\\"transactions\\\" DataFrame\\n\",\n    \"        (\\\"count\\\", \\\"mean\\\"): {\\\"include_dataframes\\\": [\\\"sessions\\\", \\\"transactions\\\"]},\\n\",\n    \"        # For count and mean, only include the dataframes \\\"sessions\\\" and \\\"transactions\\\"\\n\",\n    \"    },\\n\",\n    \")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Note that if options are given for a specific instance of a primitive and for the primitive generally (either by string name or class), the instances with their own options will not use the generic options. For example, in this case:\\n\",\n    \"```\\n\",\n    \"special_mean = Mean()\\n\",\n    \"options = {\\n\",\n    \"    special_mean: {'include_dataframes': ['customers']},\\n\",\n    \"    'mean': {'include_dataframes': ['sessions']}\\n\",\n    \"```\\n\",\n    \"the primitive `special_mean` will not use the DataFrame `sessions` because it's options have it only include `customers`. Every other instance of the `Mean` primitive will use the `'mean'` options.  \\n\",\n    \"\\n\",\n    \"**For more examples of specifying options for DFS, please visit:**\\n\",\n    \"\\n\",\n    \"- [Specifying Primitive Options](../guides/specifying_primitive_options.rst)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### If I didn't specify the **cutoff_time**, what date will be used for the feature calculations?\\n\",\n    \"\\n\",\n    \"The cutoff time will be set to the current time using `cutoff_time = datetime.now()`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### How do I select a certain amount of past data when calculating features?\\n\",\n    \"\\n\",\n    \"You may encounter a situation when you wish to make prediction using only a certain amount of historical data. You can accomplish this using the `training_window` parameter in `ft.dfs`. When you use the `training_window`, Featuretools will use the historical data between the `cutoff_time` and `cutoff_time - training_window`.\\n\",\n    \"\\n\",\n    \"In order to make the calculation, Featuretools will check the time in the `time_index` column of the `target_dataframe`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es = ft.demo.load_mock_customer(return_entityset=True)\\n\",\n    \"es[\\\"customers\\\"].ww.time_index\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Our target_dataframe has a `time_index`, which is needed for the `training_window` calculation. Here, we are creating a cutoff time DataFrame so that we can have a unique training window for each customer.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"cutoff_times = pd.DataFrame()\\n\",\n    \"cutoff_times[\\\"customer_id\\\"] = [1, 2, 3, 1]\\n\",\n    \"cutoff_times[\\\"time\\\"] = pd.to_datetime(\\n\",\n    \"    [\\\"2014-1-1 04:00\\\", \\\"2014-1-1 05:00\\\", \\\"2014-1-1 06:00\\\", \\\"2014-1-1 08:00\\\"]\\n\",\n    \")\\n\",\n    \"cutoff_times[\\\"label\\\"] = [True, True, False, True]\\n\",\n    \"\\n\",\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    cutoff_time=cutoff_times,\\n\",\n    \"    cutoff_time_in_index=True,\\n\",\n    \"    training_window=\\\"1 hour\\\",\\n\",\n    \")\\n\",\n    \"feature_matrix.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Above, we ran DFS with `training_window` argument of `1 hour` to create features that only used customer data collected in the last hour (from the cutoff time we provided).\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Can I run DFS on a single table? \\n\",\n    \"\\n\",\n    \"Although possible, running DFS on a single table doesn't make full use of DFS's capabilities. For one, DFS will not be able to use any aggregation primitives, which require at least two tables. You will only be able to use transform primitives. This limits the complexity of the features that DFS can generate through feature stacking. Additionally, in certain situations, running single table DFS on data with time columns could risk label leakage. With data split in multiple tables, featuretools can filter data based on the cutoff time instead of assuming data was flattened appropriately, but it can not do this with only a single table. \\n\",\n    \"\\n\",\n    \"If you only have a single table of data, DFS can certainly still be of use. There are two main ways to pass in a single table to DFS. \\n\",\n    \"\\n\",\n    \"The first is to simply create an EntitySet with one table. \\n\",\n    \"\\n\",\n    \"For example:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"transactions_df = ft.demo.load_mock_customer(return_single_table=True)\\n\",\n    \"\\n\",\n    \"es = ft.EntitySet(id=\\\"customer_data\\\")\\n\",\n    \"es = es.add_dataframe(\\n\",\n    \"    dataframe_name=\\\"transactions\\\",\\n\",\n    \"    dataframe=transactions_df,\\n\",\n    \"    index=\\\"transaction_id\\\",\\n\",\n    \"    time_index=\\\"transaction_time\\\",\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"transactions\\\",\\n\",\n    \"    trans_primitives=[\\n\",\n    \"        \\\"time_since\\\",\\n\",\n    \"        \\\"day\\\",\\n\",\n    \"        \\\"is_weekend\\\",\\n\",\n    \"        \\\"cum_min\\\",\\n\",\n    \"        \\\"minute\\\",\\n\",\n    \"        \\\"weekday\\\",\\n\",\n    \"        \\\"percentile\\\",\\n\",\n    \"        \\\"year\\\",\\n\",\n    \"        \\\"week\\\",\\n\",\n    \"        \\\"cum_mean\\\",\\n\",\n    \"    ],\\n\",\n    \")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The second way is to insert the dataframe into a dictionary mapping its name to a tuple containing specific dataframe information. We then pass in that dictionary to the `dataframes` argument in DFS.\\n\",\n    \"\\n\",\n    \"In this scenario, for the value in our dictionary, we pass in a tuple containing the dataframe, its index column, and its time index. More information about the possible parameters can be found in the [DFS documentation](https://featuretools.alteryx.com/en/stable/generated/featuretools.dfs.html#featuretools.dfs).\\n\",\n    \"\\n\",\n    \"For example: \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"transactions_df = ft.demo.load_mock_customer(return_single_table=True)\\n\",\n    \"\\n\",\n    \"dataframes = {\\\"transactions\\\": (transactions_df, \\\"transaction_id\\\", \\\"transaction_time\\\")}\\n\",\n    \"\\n\",\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    dataframes=dataframes,\\n\",\n    \"    target_dataframe_name=\\\"transactions\\\",\\n\",\n    \"    trans_primitives=[\\n\",\n    \"        \\\"time_since\\\",\\n\",\n    \"        \\\"day\\\",\\n\",\n    \"        \\\"is_weekend\\\",\\n\",\n    \"        \\\"cum_min\\\",\\n\",\n    \"        \\\"minute\\\",\\n\",\n    \"        \\\"weekday\\\",\\n\",\n    \"        \\\"percentile\\\",\\n\",\n    \"        \\\"year\\\",\\n\",\n    \"        \\\"week\\\",\\n\",\n    \"        \\\"cum_mean\\\",\\n\",\n    \"    ],\\n\",\n    \")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Before we examine the output, let's look at our original single table.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"transactions_df.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now we can look at the transformations that Featuretools was able to apply to this single DataFrame to create feature matrix.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### How do I prevent label leakage with DFS?\\n\",\n    \"\\n\",\n    \"One concern you might have with using DFS is about label leakage. You want to make sure that labels in your data aren't used incorrectly to create features and the feature matrix.\\n\",\n    \"\\n\",\n    \"**Featuretools is particularly focused on helping users avoid label leakage.**\\n\",\n    \"\\n\",\n    \"There are two ways to prevent label leakage depending on if your data has timestamps or not.\\n\",\n    \"\\n\",\n    \"#### 1. Data without timestamps\\n\",\n    \"In the case where you do not have timestamps, you can create one `EntitySet` using only the training data and then run `ft.dfs`. This will create a feature matrix using only the training data, but also return a list of feature definitions. Next, you can create an `EntitySet` using the test data and recalculate the same features by calling `ft.calculate_feature_matrix` with the list of feature definitions from before. \\n\",\n    \"\\n\",\n    \"Here is what that flow would look like:\\n\",\n    \"\\n\",\n    \"First, let's create our training data.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"train_data = pd.DataFrame(\\n\",\n    \"    {\\n\",\n    \"        \\\"customer_id\\\": [1, 2, 3, 4, 5],\\n\",\n    \"        \\\"age\\\": [40, 50, 10, 20, 30],\\n\",\n    \"        \\\"gender\\\": [\\\"m\\\", \\\"f\\\", \\\"m\\\", \\\"f\\\", \\\"f\\\"],\\n\",\n    \"        \\\"signup_date\\\": pd.date_range(\\\"2014-01-01 01:41:50\\\", periods=5, freq=\\\"25min\\\"),\\n\",\n    \"        \\\"labels\\\": [True, False, True, False, True],\\n\",\n    \"    }\\n\",\n    \")\\n\",\n    \"train_data.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we can create an entityset for our training data.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es_train_data = ft.EntitySet(id=\\\"customer_train_data\\\")\\n\",\n    \"es_train_data = es_train_data.add_dataframe(\\n\",\n    \"    dataframe_name=\\\"customers\\\", dataframe=train_data, index=\\\"customer_id\\\"\\n\",\n    \")\\n\",\n    \"es_train_data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Next, we are ready to create our features, and feature matrix for the training data.  We don't want Featuretools to use the labels column to build new features, so we will use the ``ignore_columns`` option to exclude it.  This would also remove the labels column from the feature matrix, so we will tell DFS to include it as a seed feature.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"labels_feature = ft.Feature(es_train_data[\\\"customers\\\"].ww[\\\"labels\\\"])\\n\",\n    \"feature_matrix_train, feature_defs = ft.dfs(\\n\",\n    \"    entityset=es_train_data,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    ignore_columns={\\\"customers\\\": [\\\"labels\\\"]},\\n\",\n    \"    seed_features=[labels_feature],\\n\",\n    \")\\n\",\n    \"feature_matrix_train\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We will also encode our feature matrix to make machine learning compatible features. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix_train_enc, features_enc = ft.encode_features(\\n\",\n    \"    feature_matrix_train, feature_defs\\n\",\n    \")\\n\",\n    \"feature_matrix_train_enc.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Notice how the whole feature matrix only includes numeric and boolean values now.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now we can use the feature definitions to calculate our feature matrix for the test data, and avoid label leakage.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"test_train = pd.DataFrame(\\n\",\n    \"    {\\n\",\n    \"        \\\"customer_id\\\": [6, 7, 8, 9, 10],\\n\",\n    \"        \\\"age\\\": [20, 25, 55, 22, 35],\\n\",\n    \"        \\\"gender\\\": [\\\"f\\\", \\\"m\\\", \\\"m\\\", \\\"m\\\", \\\"m\\\"],\\n\",\n    \"        \\\"signup_date\\\": pd.date_range(\\\"2014-01-01 01:41:50\\\", periods=5, freq=\\\"25min\\\"),\\n\",\n    \"        \\\"labels\\\": [True, False, False, True, True],\\n\",\n    \"    }\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"es_test_data = ft.EntitySet(id=\\\"customer_test_data\\\")\\n\",\n    \"es_test_data = es_test_data.add_dataframe(\\n\",\n    \"    dataframe_name=\\\"customers\\\",\\n\",\n    \"    dataframe=test_train,\\n\",\n    \"    index=\\\"customer_id\\\",\\n\",\n    \"    time_index=\\\"signup_date\\\",\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"# Use the feature definitions from earlier\\n\",\n    \"feature_matrix_enc_test = ft.calculate_feature_matrix(\\n\",\n    \"    features=features_enc, entityset=es_test_data\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"feature_matrix_enc_test.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Check out the [Modeling](frequently_asked_questions.ipynb#Modeling) section for an example of using the encoded matrix with sklearn.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### 2. Data with timestamps\\n\",\n    \"\\n\",\n    \"If your data has timestamps, the best way to prevent label leakage is to use a list of **cutoff times**, which specify the last point in time data is allowed to be used for each row in the resulting feature matrix. To use **cutoff times**, you need to set a time index for each time sensitive DataFrame in your entity set.\\n\",\n    \"\\n\",\n    \"> **Tip: Even if your data doesn’t have time stamps, you could add a column with dummy timestamps that can be used by Featuretools as time index.**\\n\",\n    \"\\n\",\n    \"When you call `ft.dfs`, you can provide a DataFrame of cutoff times like this:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"cutoff_times = pd.DataFrame(\\n\",\n    \"    {\\n\",\n    \"        \\\"customer_id\\\": [1, 2, 3, 4, 5],\\n\",\n    \"        \\\"time\\\": pd.date_range(\\\"2014-01-01 01:41:50\\\", periods=5, freq=\\\"25min\\\"),\\n\",\n    \"    }\\n\",\n    \")\\n\",\n    \"cutoff_times.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"train_test_data = pd.DataFrame(\\n\",\n    \"    {\\n\",\n    \"        \\\"customer_id\\\": [1, 2, 3, 4, 5],\\n\",\n    \"        \\\"age\\\": [20, 25, 55, 22, 35],\\n\",\n    \"        \\\"gender\\\": [\\\"f\\\", \\\"m\\\", \\\"m\\\", \\\"m\\\", \\\"m\\\"],\\n\",\n    \"        \\\"signup_date\\\": pd.date_range(\\\"2010-01-01 01:41:50\\\", periods=5, freq=\\\"25min\\\"),\\n\",\n    \"    }\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"es_train_test_data = ft.EntitySet(id=\\\"customer_train_test_data\\\")\\n\",\n    \"es_train_test_data = es_train_test_data.add_dataframe(\\n\",\n    \"    dataframe_name=\\\"customers\\\",\\n\",\n    \"    dataframe=train_test_data,\\n\",\n    \"    index=\\\"customer_id\\\",\\n\",\n    \"    time_index=\\\"signup_date\\\",\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"feature_matrix_train_test, features = ft.dfs(\\n\",\n    \"    entityset=es_train_test_data,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    cutoff_time=cutoff_times,\\n\",\n    \"    cutoff_time_in_index=True,\\n\",\n    \")\\n\",\n    \"feature_matrix_train_test.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Above, we have created a feature matrix that uses cutoff times to avoid label leakage. We could also encode this feature matrix using `ft.encode_features`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### What is the difference between passing a primitive object versus a string to DFS?  \\n\",\n    \"\\n\",\n    \"There are 2 ways to pass primitives to DFS: the primitive object, or a string of the primitive name. \\n\",\n    \"\\n\",\n    \"We will use the Transform primitive called `TimeSincePrevious` to illustrate the differences.\\n\",\n    \"\\n\",\n    \"First, let's use the string of primitive name.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es = ft.demo.load_mock_customer(return_entityset=True)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    agg_primitives=[],\\n\",\n    \"    trans_primitives=[\\\"time_since_previous\\\"],\\n\",\n    \")\\n\",\n    \"feature_matrix\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's use the primitive object.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from featuretools.primitives import TimeSincePrevious\\n\",\n    \"\\n\",\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    agg_primitives=[],\\n\",\n    \"    trans_primitives=[TimeSincePrevious],\\n\",\n    \")\\n\",\n    \"feature_matrix\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we can see above, the feature matrix is the same.\\n\",\n    \"\\n\",\n    \"However, if we need to modify controllable parameters in the primitive, we should use the primitive object. \\n\",\n    \"For instance, let's make TimeSincePrevious return units of hours (the default is in seconds).\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from featuretools.primitives import TimeSincePrevious\\n\",\n    \"\\n\",\n    \"time_since_previous_in_hours = TimeSincePrevious(unit=\\\"hours\\\")\\n\",\n    \"\\n\",\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    agg_primitives=[],\\n\",\n    \"    trans_primitives=[time_since_previous_in_hours],\\n\",\n    \")\\n\",\n    \"feature_matrix\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Features\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### How can I select features based on some attributes (a specific string, an explicit primitive type, a return type, a given depth)?\\n\",\n    \"\\n\",\n    \"You may wish to select a subset of your features based on some attributes. \\n\",\n    \"\\n\",\n    \"Let's say you wanted to select features that had the string `amount` in its name. You can check for this by using the `get_name` function on the feature definitions.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es = ft.demo.load_mock_customer(return_entityset=True)\\n\",\n    \"\\n\",\n    \"feature_defs = ft.dfs(\\n\",\n    \"    entityset=es, target_dataframe_name=\\\"customers\\\", features_only=True\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"features_with_amount = []\\n\",\n    \"for x in feature_defs:\\n\",\n    \"    if \\\"amount\\\" in x.get_name():\\n\",\n    \"        features_with_amount.append(x)\\n\",\n    \"features_with_amount[0:5]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"You might also want to only select features that are aggregation features.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from featuretools import AggregationFeature\\n\",\n    \"\\n\",\n    \"features_only_aggregations = []\\n\",\n    \"for x in feature_defs:\\n\",\n    \"    if type(x) == AggregationFeature:\\n\",\n    \"        features_only_aggregations.append(x)\\n\",\n    \"features_only_aggregations[0:5]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Also, you might only want to select features that are calculated at a certain depth. You can do this by using the `get_depth` function. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"features_only_depth_2 = []\\n\",\n    \"for x in feature_defs:\\n\",\n    \"    if x.get_depth() == 2:\\n\",\n    \"        features_only_depth_2.append(x)\\n\",\n    \"features_only_depth_2[0:5]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Finally, you might only want features that return a certain type. You can do this by using the `column_schema` attribute. For more information on working with column schemas, take a look at [Transitioning from Variables to Woodwork](transition_to_ft_v1.0.ipynb).\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"features_only_numeric = []\\n\",\n    \"for x in feature_defs:\\n\",\n    \"    if \\\"numeric\\\" in x.column_schema.semantic_tags:\\n\",\n    \"        features_only_numeric.append(x)\\n\",\n    \"features_only_numeric[0:5]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Once you have your specific feature list, you can use `ft.calculate_feature_matrix` to generate a feature matrix for only those features.\\n\",\n    \"\\n\",\n    \"For our example, let's use the features with only the string `amount` in its name.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix = ft.calculate_feature_matrix(\\n\",\n    \"    entityset=es, features=features_with_amount\\n\",\n    \")  # change to your specific feature list\\n\",\n    \"feature_matrix.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Above, notice how all the column names for our feature matrix contain the string `amount`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### How do I create **where** features?\\n\",\n    \"\\n\",\n    \"Sometimes, you might want to create features that are conditioned on a second value before it is calculated. This extra filter is called a “where clause”. You can create these features using the using the `interesting_values` of a column.\\n\",\n    \"\\n\",\n    \"If you have categorical columns in your `EntitySet`, you can use `add_interesting_values`. This function will  find interesting values for your categorical columns, which can then be used to generate “where” clauses.\\n\",\n    \"\\n\",\n    \"First, let's create our `EntitySet`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es = ft.demo.load_mock_customer(return_entityset=True)\\n\",\n    \"es\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now we can add the interesting values for the categorical column.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es.add_interesting_values()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now we can run DFS with the `where_primitives` argument to define which primitives to apply with where clauses. In this case, let's use the primitive `count`. For this to work, the primitive `count` must be present in both `agg_primitives` and `where_primitives`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    agg_primitives=[\\\"count\\\"],\\n\",\n    \"    where_primitives=[\\\"count\\\"],\\n\",\n    \"    trans_primitives=[],\\n\",\n    \")\\n\",\n    \"feature_matrix.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We have now created some useful features. One example of a useful feature is the `COUNT(sessions WHERE device = tablet)`. This feature tells us how many sessions a customer completed on a tablet.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix[[\\\"COUNT(sessions WHERE device = tablet)\\\"]]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Primitives\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### What is the difference between the primitive types (Transform, GroupBy Transform, & Aggregation)?\\n\",\n    \"\\n\",\n    \"You might curious to know the difference between the primitive groups.\\n\",\n    \"Let's review the differences between transform, groupby transform, and aggregation primitives.\\n\",\n    \"\\n\",\n    \"First, let's create a simple `EntitySet`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import pandas as pd\\n\",\n    \"\\n\",\n    \"import featuretools as ft\\n\",\n    \"\\n\",\n    \"df = pd.DataFrame(\\n\",\n    \"    {\\n\",\n    \"        \\\"id\\\": [1, 2, 3, 4, 5, 6],\\n\",\n    \"        \\\"time_index\\\": pd.date_range(\\\"1/1/2019\\\", periods=6, freq=\\\"D\\\"),\\n\",\n    \"        \\\"group\\\": [\\\"a\\\", \\\"a\\\", \\\"a\\\", \\\"a\\\", \\\"a\\\", \\\"a\\\"],\\n\",\n    \"        \\\"val\\\": [5, 1, 10, 20, 6, 23],\\n\",\n    \"    }\\n\",\n    \")\\n\",\n    \"es = ft.EntitySet()\\n\",\n    \"es = es.add_dataframe(\\n\",\n    \"    dataframe_name=\\\"observations\\\", dataframe=df, index=\\\"id\\\", time_index=\\\"time_index\\\"\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"es = es.normalize_dataframe(\\n\",\n    \"    base_dataframe_name=\\\"observations\\\", new_dataframe_name=\\\"groups\\\", index=\\\"group\\\"\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"es.plot()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After calling `normalize_dataframe`, the column \\\"group\\\" has the semantic tag \\\"foreign_key\\\" because it identifies another DataFrame. Alternatively, it could be set using the `semantic_tags` parameter when we first call `es.add_dataframe()`.\\n\",\n    \"\\n\",\n    \"#### Transform Primitive\\n\",\n    \"\\n\",\n    \"The cum_sum primitive calculates the running sum in list of numbers.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from featuretools.primitives import CumSum\\n\",\n    \"\\n\",\n    \"cum_sum = CumSum()\\n\",\n    \"cum_sum([1, 2, 3, 4, 5]).tolist()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"If we apply it using the `trans_primitives` argument it will calculate it over the entire observations DataFrame like this:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    target_dataframe_name=\\\"observations\\\",\\n\",\n    \"    entityset=es,\\n\",\n    \"    agg_primitives=[],\\n\",\n    \"    trans_primitives=[\\\"cum_sum\\\"],\\n\",\n    \"    groupby_trans_primitives=[],\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"feature_matrix\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Groupby Transform Primitive\\n\",\n    \"\\n\",\n    \"If we apply it using `groupby_trans_primitives`, then DFS will first group by any foreign key columns before applying the transform primitive. As a result, we get the cumulative sum by group.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    target_dataframe_name=\\\"observations\\\",\\n\",\n    \"    entityset=es,\\n\",\n    \"    agg_primitives=[],\\n\",\n    \"    trans_primitives=[],\\n\",\n    \"    groupby_trans_primitives=[\\\"cum_sum\\\"],\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"feature_matrix\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Aggregation Primitive\\n\",\n    \"\\n\",\n    \"Finally, there is also the aggregation primitive \\\"sum\\\". If we use sum, it will calculate the sum for the group at the cutoff time for each row. Because we didn't specify a cutoff time it will use all the data for each group for each row.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    target_dataframe_name=\\\"observations\\\",\\n\",\n    \"    entityset=es,\\n\",\n    \"    agg_primitives=[\\\"sum\\\"],\\n\",\n    \"    trans_primitives=[],\\n\",\n    \"    cutoff_time_in_index=True,\\n\",\n    \"    groupby_trans_primitives=[],\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"feature_matrix\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"If we set the cutoff time of each row to be the time index, then use sum as an aggregation primitive, the result is the same as cum_sum. (Though the order is different in the displayed dataframe).\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"cutoff_time = df[[\\\"id\\\", \\\"time_index\\\"]]\\n\",\n    \"cutoff_time\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    target_dataframe_name=\\\"observations\\\",\\n\",\n    \"    entityset=es,\\n\",\n    \"    agg_primitives=[\\\"sum\\\"],\\n\",\n    \"    trans_primitives=[],\\n\",\n    \"    groupby_trans_primitives=[],\\n\",\n    \"    cutoff_time_in_index=True,\\n\",\n    \"    cutoff_time=cutoff_time,\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"feature_matrix\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### How do I get a list of all Aggregation and Transform primitives?\\n\",\n    \"\\n\",\n    \"You can do `featuretools.list_primitives()` to get all the primitive in Featuretools. It will return a DataFrame with the names, type, and description of the primitives.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df_primitives = ft.list_primitives()\\n\",\n    \"df_primitives.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df_primitives.tail()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### How do I change the units for a TimeSince primitive?\\n\",\n    \"There are a few primitives in Featuretools that make some time-based calculation. These include `TimeSince, TimeSincePrevious, TimeSinceLast, TimeSinceFirst`. \\n\",\n    \"\\n\",\n    \"You can change the units from the default seconds to any valid time unit, by doing the following:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from featuretools.primitives import (\\n\",\n    \"    TimeSince,\\n\",\n    \"    TimeSinceFirst,\\n\",\n    \"    TimeSinceLast,\\n\",\n    \"    TimeSincePrevious,\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"time_since = TimeSince(unit=\\\"minutes\\\")\\n\",\n    \"time_since_previous = TimeSincePrevious(unit=\\\"hours\\\")\\n\",\n    \"time_since_last = TimeSinceLast(unit=\\\"days\\\")\\n\",\n    \"time_since_first = TimeSinceFirst(unit=\\\"years\\\")\\n\",\n    \"\\n\",\n    \"es = ft.demo.load_mock_customer(return_entityset=True)\\n\",\n    \"\\n\",\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    agg_primitives=[time_since_last, time_since_first],\\n\",\n    \"    trans_primitives=[time_since, time_since_previous],\\n\",\n    \")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Above, we changed the units to the following:\\n\",\n    \"- minutes for `TimeSince`\\n\",\n    \"- hours for `TimeSincePrevious`\\n\",\n    \"- days for `TimeSinceLast`\\n\",\n    \"- years for `TimeSinceFirst`.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Now we can see that our feature matrix contains multiple features where the units for the TimeSince primitives are changed.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"There are now features where time unit is different from the default of seconds, such as `TIME_SINCE_LAST(sessions.session_start, unit=days)`, and `TIME_SINCE_FIRST(sessions.session_start, unit=years)`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Modeling\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### How does my train & test data work with Featuretools and sklearn's **train_test_split**?\\n\",\n    \"\\n\",\n    \"You might be wondering how to properly use your train & test data with Featuretools, and sklearn's **train_test_split**. There are a few things you must do to ensure accuracy with this workflow.\\n\",\n    \"\\n\",\n    \"Let's imagine we have a Dataframes for our train data, with the labels.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"train_data = pd.DataFrame(\\n\",\n    \"    {\\n\",\n    \"        \\\"customer_id\\\": [1, 2, 3, 4, 5],\\n\",\n    \"        \\\"age\\\": [20, 25, 55, 22, 35],\\n\",\n    \"        \\\"gender\\\": [\\\"f\\\", \\\"m\\\", \\\"m\\\", \\\"m\\\", \\\"m\\\"],\\n\",\n    \"        \\\"signup_date\\\": pd.date_range(\\\"2010-01-01 01:41:50\\\", periods=5, freq=\\\"25min\\\"),\\n\",\n    \"        \\\"labels\\\": [False, True, True, False, False],\\n\",\n    \"    }\\n\",\n    \")\\n\",\n    \"train_data.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now we can create our `EntitySet` for the train data, and create our features. To prevent label leakage, we will use cutoff times (see [earlier question](#How-do-I-prevent-label-leakage-with-DFS?)).\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es_train_data = ft.EntitySet(id=\\\"customer_data\\\")\\n\",\n    \"es_train_data = es_train_data.add_dataframe(\\n\",\n    \"    dataframe_name=\\\"customers\\\", dataframe=train_data, index=\\\"customer_id\\\"\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"cutoff_times = pd.DataFrame(\\n\",\n    \"    {\\n\",\n    \"        \\\"customer_id\\\": [1, 2, 3, 4, 5],\\n\",\n    \"        \\\"time\\\": pd.date_range(\\\"2014-01-01 01:41:50\\\", periods=5, freq=\\\"25min\\\"),\\n\",\n    \"    }\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"feature_matrix_train, features = ft.dfs(\\n\",\n    \"    entityset=es_train_data,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    cutoff_time=cutoff_times,\\n\",\n    \"    cutoff_time_in_index=True,\\n\",\n    \")\\n\",\n    \"feature_matrix_train.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We will also encode our feature matrix to compatible for machine learning algorithms.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix_train_enc, feature_enc = ft.encode_features(\\n\",\n    \"    feature_matrix_train, features\\n\",\n    \")\\n\",\n    \"feature_matrix_train_enc.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.model_selection import train_test_split\\n\",\n    \"\\n\",\n    \"X = feature_matrix_train_enc.drop([\\\"labels\\\"], axis=1)\\n\",\n    \"y = feature_matrix_train_enc[\\\"labels\\\"]\\n\",\n    \"\\n\",\n    \"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now you can use the encoded feature matrix with sklearn's **train_test_split**. This will allow you to train your model, and tune your parameters.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### How are categorical columns encoded when splitting training and testing data?\\n\",\n    \"\\n\",\n    \"You might be wondering what happens when categorical columns are encoded with your training and testing data. You might be curious to know what happens if the train data has a categorical column that is not present in the testing data. \\n\",\n    \"\\n\",\n    \"Let's explore a simple example to see what happens during the encoding process.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"train_data = pd.DataFrame(\\n\",\n    \"    {\\n\",\n    \"        \\\"customer_id\\\": [1, 2, 3, 4, 5],\\n\",\n    \"        \\\"product_purchased\\\": [\\\"coke zero\\\", \\\"car\\\", \\\"toothpaste\\\", \\\"coke zero\\\", \\\"car\\\"],\\n\",\n    \"    }\\n\",\n    \")\\n\",\n    \"es_train = ft.EntitySet(id=\\\"customer_data\\\")\\n\",\n    \"es_train = es_train.add_dataframe(\\n\",\n    \"    dataframe_name=\\\"customers\\\",\\n\",\n    \"    dataframe=train_data,\\n\",\n    \"    index=\\\"customer_id\\\",\\n\",\n    \"    logical_types={\\\"product_purchased\\\": ww.logical_types.Categorical},\\n\",\n    \")\\n\",\n    \"feature_matrix_train, features = ft.dfs(\\n\",\n    \"    entityset=es_train, target_dataframe_name=\\\"customers\\\"\\n\",\n    \")\\n\",\n    \"feature_matrix_train\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We will use `ft.encode_features` to properly encode the `product_purchased` column.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix_train_encoded, features_encoded = ft.encode_features(\\n\",\n    \"    feature_matrix_train, features\\n\",\n    \")\\n\",\n    \"feature_matrix_train_encoded.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now lets imagine we have some test data that has doesn't have one of the categorical values (**toothpaste**). Also, the test data has a value that wasn't present in the train data (**water**).\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"test_data = pd.DataFrame(\\n\",\n    \"    {\\n\",\n    \"        \\\"customer_id\\\": [6, 7, 8, 9, 10],\\n\",\n    \"        \\\"product_purchased\\\": [\\\"coke zero\\\", \\\"car\\\", \\\"coke zero\\\", \\\"coke zero\\\", \\\"water\\\"],\\n\",\n    \"    }\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"es_test = ft.EntitySet(id=\\\"customer_data\\\")\\n\",\n    \"es_test = es_test.add_dataframe(\\n\",\n    \"    dataframe_name=\\\"customers\\\", dataframe=test_data, index=\\\"customer_id\\\"\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"feature_matrix_test = ft.calculate_feature_matrix(\\n\",\n    \"    entityset=es_test, features=features_encoded\\n\",\n    \")\\n\",\n    \"feature_matrix_test.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As seen above, we were able to successfully handle the encoding, and deal with the following complications: \\n\",\n    \"- **toothpaste** was present in the training data but not present in the testing data \\n\",\n    \"- **water** was present in the test data but not present in the training data. \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Errors & Warnings\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Why am I getting this error 'Index is not unique on dataframe'?\\n\",\n    \"You may be trying to create your `EntitySet`, and run into this error. \\n\",\n    \"```python\\n\",\n    \"IndexError: Index column must be unique\\n\",\n    \"```\\n\",\n    \"**This is because each dataframe in your EntitySet needs a unique index.**\\n\",\n    \"\\n\",\n    \"Let's look at a simple example.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"product_df = pd.DataFrame({\\\"id\\\": [1, 2, 3, 4, 4], \\\"rating\\\": [3.5, 4.0, 4.5, 1.5, 5.0]})\\n\",\n    \"product_df\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Notice how the `id` column has a duplicate index of `4`. If you try to add this dataframe to the EntitySet, you will run into the following error.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"```python\\n\",\n    \"es = ft.EntitySet(id=\\\"product_data\\\")\\n\",\n    \"es = es.add_dataframe(dataframe_name=\\\"products\\\",\\n\",\n    \"                      dataframe=product_df,\\n\",\n    \"                      index=\\\"id\\\")\\n\",\n    \"```\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"```\\n\",\n    \"---------------------------------------------------------------------------\\n\",\n    \"IndexError                                Traceback (most recent call last)\\n\",\n    \"<ipython-input-78-854fbaf207f8> in <module>\\n\",\n    \"      1 es = ft.EntitySet(id=\\\"product_data\\\")\\n\",\n    \"----> 2 es = es.add_dataframe(dataframe_name=\\\"products\\\",\\n\",\n    \"      3                       dataframe=product_df,\\n\",\n    \"      4                       index=\\\"id\\\")\\n\",\n    \"\\n\",\n    \"~/Code/featuretools/featuretools/entityset/entityset.py in add_dataframe(self, dataframe, dataframe_name, index, logical_types, semantic_tags, make_index, time_index, secondary_time_index, already_sorted)\\n\",\n    \"    625             index_was_created, index, dataframe = _get_or_create_index(index, make_index, dataframe)\\n\",\n    \"    626 \\n\",\n    \"--> 627             dataframe.ww.init(name=dataframe_name,\\n\",\n    \"    628                               index=index,\\n\",\n    \"    629                               time_index=time_index,\\n\",\n    \"\\n\",\n    \"/usr/local/Caskroom/miniconda/base/envs/featuretools/lib/python3.8/site-packages/woodwork/table_accessor.py in init(self, index, time_index, logical_types, already_sorted, schema, validate, use_standard_tags, **kwargs)\\n\",\n    \"     94         \\\"\\\"\\\"\\n\",\n    \"     95         if validate:\\n\",\n    \"---> 96             _validate_accessor_params(self._dataframe, index, time_index, logical_types, schema, use_standard_tags)\\n\",\n    \"     97         if schema is not None:\\n\",\n    \"     98             self._schema = schema\\n\",\n    \"\\n\",\n    \"/usr/local/Caskroom/miniconda/base/envs/featuretools/lib/python3.8/site-packages/woodwork/table_accessor.py in _validate_accessor_params(dataframe, index, time_index, logical_types, schema, use_standard_tags)\\n\",\n    \"    877         # We ignore these parameters if a schema is passed\\n\",\n    \"    878         if index is not None:\\n\",\n    \"--> 879             _check_index(dataframe, index)\\n\",\n    \"    880         if logical_types:\\n\",\n    \"    881             _check_logical_types(dataframe.columns, logical_types)\\n\",\n    \"\\n\",\n    \"/usr/local/Caskroom/miniconda/base/envs/featuretools/lib/python3.8/site-packages/woodwork/table_accessor.py in _check_index(dataframe, index)\\n\",\n    \"    903         # User specifies an index that is in the dataframe but not unique\\n\",\n    \"--> 904         raise IndexError('Index column must be unique')\\n\",\n    \"    905 \\n\",\n    \"    906 \\n\",\n    \"\\n\",\n    \"IndexError: Index column must be unique\\n\",\n    \"```\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"To fix the above error, you can do one of the following solutions:\\n\",\n    \"\\n\",\n    \"**Solution #1 - You can create a unique index on your Dataframe.**\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"product_df = pd.DataFrame({\\\"id\\\": [1, 2, 3, 4, 5], \\\"rating\\\": [3.5, 4.0, 4.5, 1.5, 5.0]})\\n\",\n    \"product_df\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Notice how we now have a unique index column called `id`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es = es.add_dataframe(dataframe_name=\\\"products\\\", dataframe=product_df, index=\\\"id\\\")\\n\",\n    \"es\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As seen above, we can now create our DataFrame for our `EntitySet` without an error by creating a unique index in our Dataframe.\\n\",\n    \"\\n\",\n    \"**Solution #2 - Set make_index to True in your call to add_dataframe to create a new index on that data**\\n\",\n    \"- `make_index` creates a unique index for each row by just looking at what number the row is, in relation to all the other rows.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"product_df = pd.DataFrame({\\\"id\\\": [1, 2, 3, 4, 4], \\\"rating\\\": [3.5, 4.0, 4.5, 1.5, 5.0]})\\n\",\n    \"\\n\",\n    \"es = ft.EntitySet(id=\\\"product_data\\\")\\n\",\n    \"es = es.add_dataframe(\\n\",\n    \"    dataframe_name=\\\"products\\\", dataframe=product_df, index=\\\"product_id\\\", make_index=True\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"es[\\\"products\\\"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As seen above, we created our dataframe for our `EntitySet` without an error using the `make_index` argument.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Why am I getting the following warning 'Using training_window but last_time_index is not set'?\\n\",\n    \"\\n\",\n    \"If you are using a training window, and you haven't set a `last_time_index` for your dataframe, you will get this warning.\\n\",\n    \"The training window attribute in Featuretools limits the amount of past data that can be used while calculating a particular feature vector.\\n\",\n    \"\\n\",\n    \"You can add the `last_time_index` to all dataframes automatically by calling `your_entityset.add_last_time_indexes()` after you create your `EntitySet`. This will remove the warning.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es = ft.demo.load_mock_customer(return_entityset=True)\\n\",\n    \"es.add_last_time_indexes()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now we can run DFS without getting the warning.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"cutoff_times = pd.DataFrame()\\n\",\n    \"cutoff_times[\\\"customer_id\\\"] = [1, 2, 3, 1]\\n\",\n    \"cutoff_times[\\\"time\\\"] = pd.to_datetime(\\n\",\n    \"    [\\\"2014-1-1 04:00\\\", \\\"2014-1-1 05:00\\\", \\\"2014-1-1 06:00\\\", \\\"2014-1-1 08:00\\\"]\\n\",\n    \")\\n\",\n    \"cutoff_times[\\\"label\\\"] = [True, True, False, True]\\n\",\n    \"\\n\",\n    \"feature_matrix, feature_defs = ft.dfs(\\n\",\n    \"    entityset=es,\\n\",\n    \"    target_dataframe_name=\\\"customers\\\",\\n\",\n    \"    cutoff_time=cutoff_times,\\n\",\n    \"    cutoff_time_in_index=True,\\n\",\n    \"    training_window=\\\"1 hour\\\",\\n\",\n    \")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### last_time_index vs. time_index\\n\",\n    \"\\n\",\n    \"- The `time_index` is when the instance was first known.\\n\",\n    \"- The `last_time_index` is when the instance appears for the last time.\\n\",\n    \"- For example, a customer’s session has multiple transactions which can happen at different points in time. If we are trying to count the number of sessions a user has in a given time period, we often want to count all the sessions that had any transaction during the training window. To accomplish this, we need to not only know when a session starts (**time_index**), but also when it ends (**last_time_index**). The last time that an instance appears in the data is stored as the `last_time_index` of a dataframe. \\n\",\n    \"- Once the last_time_index has been set, Featuretools will check to see if the last_time_index is after the start of the training window. That, combined with the cutoff time, allows DFS to discover which data is relevant for a given training window.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Why am I getting errors with Featuretools on [Google Colab](https://colab.research.google.com/)?\\n\",\n    \"\\n\",\n    \"[Google Colab](https://colab.research.google.com/), by default, has Featuretools `0.4.1` installed. You may run into issues following our newest guides, or latest documentation while using an older version of Featuretools. Therefore, we suggest you upgrade to the latest featuretools version by doing the following in your notebook in Google Colab:\\n\",\n    \"```shell\\n\",\n    \"!pip install -U featuretools\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"You may need to Restart the runtime by doing **Runtime** -> **Restart Runtime**.\\n\",\n    \"You can check latest Featuretools version by doing following:\\n\",\n    \"```python\\n\",\n    \"import featuretools as ft\\n\",\n    \"print(ft.__version__)\\n\",\n    \"```\\n\",\n    \"You should see a version greater than `0.4.1`\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"file_extension\": \".py\",\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.2\"\n  },\n  \"mimetype\": \"text/x-python\",\n  \"name\": \"python\",\n  \"npconvert_exporter\": \"python\",\n  \"pygments_lexer\": \"ipython3\",\n  \"version\": 3,\n  \"vscode\": {\n   \"interpreter\": {\n    \"hash\": \"3f6b062a214ec48d1657976024d6bc68979519d14a33afb6ad033fc2e4189514\"\n   }\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "docs/source/resources/help.rst",
    "content": "Help\n====\n\nCouldn't find what you were looking for?\nThe Featuretools community is happy to provide support to users of Featuretools.\n\n\nDiscussion\n----------\n\nConversation happens in the following places:\n\n1.  **General usage questions** are directed to `StackOverflow`_ with the #featuretools tag.\n2.  **Bug reports** are managed on the `GitHub issue\n    tracker`_.\n3.  **Chat** and collaboration within the community occurs on `Slack`_. For general usage questions, please post on\n    Stack Overflow where answers are more searchable by other users.\n\n.. _`StackOverflow`: http://stackoverflow.com/questions/tagged/featuretools\n.. _`Github issue tracker`: https://github.com/alteryx/featuretools/issues\n.. _`Slack`: https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA\n\n\nAsking for help\n---------------\nAll users levels, including beginners, should feel free to ask questions and\nreport bugs when using featuretools. You can get better answers if follow a\nfew simple guidelines:\n\n1.  **Use the right resource**: We suggest using Github or StackOverflow.\n    Questions asked at these locations will be more searchable for other users.\n\n    - Slack should be used for community discussion and collaboration.\n    - For general questions on how something should work or tips, use StackOverflow.\n    - Bugs should be reported on Github.\n\n2.  **Ask in one place only**: Please post your question in one place\n    (StackOverflow or Github).\n\n3.  **Use examples**: Make `minimal, complete, verifiable examples\n    <https://stackoverflow.com/help/mcve>`_. You will get\n    much better answers if your provide code that people can use to reproduce\n    your problem.\n"
  },
  {
    "path": "docs/source/resources/resources_index.rst",
    "content": "Resources\n---------\n\nFrequently asked questions and additional resources\n\n.. toctree::\n   :maxdepth: 1\n\n   transition_to_ft_v1.0\n   frequently_asked_questions\n   help\n   usage_tips/limitations\n   usage_tips/glossary\n   ecosystem\n"
  },
  {
    "path": "docs/source/resources/transition_to_ft_v1.0.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"6004844f\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Transitioning to Featuretools Version 1.0\\n\",\n    \"\\n\",\n    \"Featuretools version 1.0 incorporates many significant changes that impact the way EntitySets are created, how primitives are defined, and in some cases the resulting feature matrix that is created. This document will provide an overview of the significant changes, helping existing Featuretools users transition to version 1.0.\\n\",\n    \"\\n\",\n    \"## Background and Introduction\\n\",\n    \"\\n\",\n    \"### Why make these changes?\\n\",\n    \"The lack of a unified type system across libraries makes sharing information between libraries more difficult. This problem led to the development of [Woodwork](https://woodwork.alteryx.com/en/stable/). Updating Featuretools to use Woodwork for managing column typing information enables easy sharing of feature matrix column types with other libraries without costly conversions between custom type systems. As an example, [EvalML](https://evalml.alteryx.com/en/stable/), which has also adopted Woodwork, can now use Woodwork typing information on a feature matrix directly to create machine learning models, without first inferring or redefining column types.\\n\",\n    \"\\n\",\n    \"Other benefits of using Woodwork for managing typing in Featuretools include:\\n\",\n    \"\\n\",\n    \"- Simplified code - custom type management code has been removed\\n\",\n    \"- Seamless integration of new types and improvements to type integration as Woodwork improves\\n\",\n    \"- Easy and flexible storage of additional information about columns. For example, we can now store whether a feature was engineered by Featuretools or present in the original data.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"4a9bfede\",\n   \"metadata\": {},\n   \"source\": [\n    \"### What has changed?\\n\",\n    \"- The legacy Featuretools custom typing system has been replaced with Woodwork for managing column types\\n\",\n    \"- Both the `Entity` and `Variable` classes have been removed from Featuretools\\n\",\n    \"- Several key Featuretools methods have been moved or updated\\n\",\n    \"\\n\",\n    \"#### Comparison between legacy typing system and Woodwork typing systems\\n\",\n    \"| Featuretools < 1.0 | Featuretools 1.0 | Description |\\n\",\n    \"| ---- | ---- | ---- |\\n\",\n    \"| Entity | Woodwork DataFrame | stores typing information for all columns |\\n\",\n    \"| Variable | ColumnSchema | stores typing information for a single column |\\n\",\n    \"| Variable subclass | LogicalType and semantic_tags | elements used to define a column type |\\n\",\n    \"\\n\",\n    \"#### Summary of significant method changes\\n\",\n    \"\\n\",\n    \"The table below outlines the most significant changes that have occurred. In Summary: In some cases, the method arguments have also changed, and those changes are outlined in more detail throughout this document.\\n\",\n    \"\\n\",\n    \"| Older Versions | Featuretools 1.0 |\\n\",\n    \"| ---- | ---- |\\n\",\n    \"| EntitySet.entity_from_dataframe | EntitySet.add_dataframe |\\n\",\n    \"| EntitySet.normalize_entity | EntitySet.normalize_dataframe |\\n\",\n    \"| EntitySet.update_data | EntitySet.replace_dataframe |\\n\",\n    \"| Entity.variable_types | es['dataframe_name'].ww |\\n\",\n    \"| es['entity_id']['variable_name'] | es['dataframe_name'].ww.columns['column_name'] |\\n\",\n    \"| Entity.convert_variable_type | es['dataframe_name'].ww.set_types |\\n\",\n    \"| Entity.add_interesting_values | es.add_interesting_values(dataframe_name='df_name', ...) |\\n\",\n    \"| Entity.set_secondary_time_index | es.set_secondary_time_index(dataframe_name='df_name', ...) |\\n\",\n    \"| Feature(es['entity_id']['variable_name']) | Feature(es['dataframe_name'].ww['column_name']) |\\n\",\n    \"| dfs(target_entity='entity_id', ...) | dfs(target_dataframe_name='dataframe_name', ...) |\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"c3b1e217\",\n   \"metadata\": {},\n   \"source\": [\n    \"For more information on how Woodwork manages typing information, refer to the [Woodwork Understanding Types and Tags](https://woodwork.alteryx.com/en/stable/guides/logical_types_and_semantic_tags.html) guide.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"a8453248\",\n   \"metadata\": {},\n   \"source\": [\n    \"### What do these changes mean for users?\\n\",\n    \"Removing these classes required moving several methods from the `Entity` to the `EntitySet` object. This change also impacts the way relationships, features and primitives are defined, requiring different parameters than were previously required. Also, because the Woodwork typing system is not identical to the old Featuretools typing system, in some cases the feature matrix that is returned can be slightly different as a result of columns being identified as different types.\\n\",\n    \"\\n\",\n    \"All of these changes, and more, will be reviewed in detail throughout this document, providing examples of both the old and new API where possible.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"de402e3b\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Removal of `Entity` Class and Updates to `EntitySet`\\n\",\n    \"\\n\",\n    \"In previous versions of Featuretools an EntitySet was created by adding multiple entities and then defining relationships between variables (columns) in different entities. Starting in Featuretools version 1.0, EntitySets are now created by adding multiple dataframes and defining relationships between columns in the dataframes. While conceptually similar, there are some minor differences in the process.\\n\",\n    \"\\n\",\n    \"### Adding dataframes to an EntitySet\\n\",\n    \"\\n\",\n    \"When adding dataframes to an EntitySet, users can pass in a Woodwork dataframe or a regular dataframe without Woodwork typing information. If users supply a dataframe that has Woodwork typing information initialized, Featuretools will simply use this typing information directly. If users supply a dataframe without Woodwork initialized, Featuretools will initialize Woodwork on the dataframe, performing type inference for any column that does not have typing information specified.\\n\",\n    \"\\n\",\n    \"Below are some examples to illustrate this process. First we will create two small dataframes to use for the example.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"5bea1bd4\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import pandas as pd\\n\",\n    \"\\n\",\n    \"import featuretools as ft\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"b094ca23\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"orders_df = pd.DataFrame(\\n\",\n    \"    {\\\"order_id\\\": [0, 1, 2], \\\"order_date\\\": [\\\"2021-01-02\\\", \\\"2021-01-03\\\", \\\"2021-01-04\\\"]}\\n\",\n    \")\\n\",\n    \"items_df = pd.DataFrame(\\n\",\n    \"    {\\n\",\n    \"        \\\"id\\\": [0, 1, 2, 3, 4],\\n\",\n    \"        \\\"order_id\\\": [0, 1, 1, 2, 2],\\n\",\n    \"        \\\"item_price\\\": [29.95, 4.99, 10.25, 20.50, 15.99],\\n\",\n    \"        \\\"on_sale\\\": [False, True, False, True, False],\\n\",\n    \"    }\\n\",\n    \")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"db705814\",\n   \"metadata\": {},\n   \"source\": [\n    \"With older versions of Featuretools, users would first create an EntitySet object, and then add dataframes to the EntitySet, by calling `entity_from_dataframe` as shown below.\\n\",\n    \"\\n\",\n    \"```python\\n\",\n    \"es = ft.EntitySet('old_es')\\n\",\n    \"\\n\",\n    \"es.entity_from_dataframe(dataframe=orders_df,\\n\",\n    \"                         entity_id='orders',\\n\",\n    \"                         index='order_id',\\n\",\n    \"                         time_index='order_date')\\n\",\n    \"es.entity_from_dataframe(dataframe=items_df,\\n\",\n    \"                         entity_id='items',\\n\",\n    \"                         index='id')\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"```\\n\",\n    \"Entityset: old_es\\n\",\n    \"  Entities:\\n\",\n    \"    orders [Rows: 3, Columns: 2]\\n\",\n    \"    items [Rows: 5, Columns: 3]\\n\",\n    \"  Relationships:\\n\",\n    \"    No relationships\\n\",\n    \"```\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"f6f95f35\",\n   \"metadata\": {},\n   \"source\": [\n    \"With Featuretools 1.0, the steps for adding a dataframe to an EntitySet are the same, but some of the details have changed. First, create an EntitySet as before. To add the dataframe call `EntitySet.add_dataframe` in place of the previous `EntitySet.entity_from_dataframe` call. Note that the name of the dataframe is specified in the `dataframe_name` argument, which was previously called `entity_id`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"b1fdffe4\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es = ft.EntitySet(\\\"new_es\\\")\\n\",\n    \"\\n\",\n    \"es.add_dataframe(\\n\",\n    \"    dataframe=orders_df,\\n\",\n    \"    dataframe_name=\\\"orders\\\",\\n\",\n    \"    index=\\\"order_id\\\",\\n\",\n    \"    time_index=\\\"order_date\\\",\\n\",\n    \")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"1c983744\",\n   \"metadata\": {},\n   \"source\": [\n    \"You can also define the name, index, and time index by first [initializing Woodwork](https://woodwork.alteryx.com/en/stable/generated/woodwork.table_accessor.WoodworkTableAccessor.init.html#woodwork.table_accessor.WoodworkTableAccessor.init) on the dataframe and then passing the Woodwork initialized dataframe directly to the `add_dataframe` call. For this example we will initialize Woodwork on `items_df`, setting the dataframe name as `items` and specifying that the index should be the `id` column.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"0d5ad8e5\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"items_df.ww.init(name=\\\"items\\\", index=\\\"id\\\")\\n\",\n    \"items_df.ww\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"07f5f27c\",\n   \"metadata\": {},\n   \"source\": [\n    \"With Woodwork initialized, we no longer need to specify values for the `dataframe_name` or `index` arguments when calling `add_dataframe` as Featuretools will simply use the values that were already specified when Woodwork was initialized.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"5f4ab39a\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es.add_dataframe(dataframe=items_df)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"93814387\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Accessing column typing information\\n\",\n    \"\\n\",\n    \"Previously, column variable type information could be accessed for an entire Entity through `Entity.variable_types` or for an individual column by selecting the individual column first through `es['entity_id']['col_id']`.\\n\",\n    \"\\n\",\n    \"```python\\n\",\n    \"es['items'].variable_types\\n\",\n    \"```\\n\",\n    \"```\\n\",\n    \"{'id': featuretools.variable_types.variable.Index,\\n\",\n    \" 'order_id': featuretools.variable_types.variable.Numeric,\\n\",\n    \" 'item_price': featuretools.variable_types.variable.Numeric}\\n\",\n    \"```\\n\",\n    \"```python\\n\",\n    \"es['items']['item_price']\\n\",\n    \"```\\n\",\n    \"```\\n\",\n    \"<Variable: item_price (dtype = numeric)>\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"With the updated version of Featuretools, the logical types and semantic tags for all of the columns in a single dataframe can be viewed through the `.ww` namespace on the dataframe. First, select the dataframe from the EntitySet with `es['dataframe_name']` and then access the typing information by chaining a `.ww` call on the end as shown below.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"6abb9b10\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es[\\\"items\\\"].ww\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"72775903\",\n   \"metadata\": {},\n   \"source\": [\n    \"The logical type and semantic tags for a single column can be obtained from the Woodwork columns dictionary stored on the dataframe, returning a `Woodwork.ColumnSchema` object that stores the typing information:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"da516642\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es[\\\"items\\\"].ww.columns[\\\"item_price\\\"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"50f9f70a\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Type inference and updating column types\\n\",\n    \"\\n\",\n    \"Featuretools will attempt to infer types for any columns that do not have types defined by the user. Prior to version 1.0, Featuretools implemented custom type inference code to determine what variable type should be assigned to each column. You could see the inferred variable types by viewing the contents of the `Entity.variable_types` dictionary.\\n\",\n    \"\\n\",\n    \"Starting in Featuretools 1.0, column type inference is being handled by Woodwork. Any columns that do not have a logical type assigned by the user when adding a dataframe to an EntitySet will have their logical types inferred by Woodwork. As before, type inference can be skipped for any columns in a dataframe by passing the appropriate logical types in a dictionary when calling `EntitySet.add_dataframe`.\\n\",\n    \"\\n\",\n    \"As an example, we can create a new dataframe and add it to an EntitySet, specifying the logical type for the user's full name as the Woodwork `PersonFullName` logical type.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"a34016b5\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"users_df = pd.DataFrame(\\n\",\n    \"    {\\\"id\\\": [0, 1, 2], \\\"name\\\": [\\\"John Doe\\\", \\\"Rita Book\\\", \\\"Teri Dactyl\\\"]}\\n\",\n    \")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"d999e022\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es.add_dataframe(\\n\",\n    \"    dataframe=users_df,\\n\",\n    \"    dataframe_name=\\\"users\\\",\\n\",\n    \"    index=\\\"id\\\",\\n\",\n    \"    logical_types={\\\"name\\\": \\\"PersonFullName\\\"},\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"es[\\\"users\\\"].ww\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"d2eff5e1\",\n   \"metadata\": {},\n   \"source\": [\n    \"Looking at the typing information above, we can see that the logical type for the `name` column was set to `PersonFullName` as we specified.\\n\",\n    \"\\n\",\n    \"Situations will occur where type inference identifies a column as having the incorrect logical type. In these situations, the logical type can be updated using the Woodwork `set_types` method. Let's say we want the `order_id` column of the `orders` dataframe to have a `Categorical` logical type instead of the `Integer` type that was inferred. Previously, this would have accomplished through the `Entity.convert_variable_type` method.\\n\",\n    \"\\n\",\n    \"```python\\n\",\n    \"from featuretools.variable_types import Categorical\\n\",\n    \"\\n\",\n    \"es['items'].convert_variable_type(variable_id='order_id', new_type=Categorical)\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"Now, we can perform this same update using Woodwork:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"a6c095b5\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es[\\\"items\\\"].ww.set_types(logical_types={\\\"order_id\\\": \\\"Categorical\\\"})\\n\",\n    \"es[\\\"items\\\"].ww\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"d9d84e08\",\n   \"metadata\": {},\n   \"source\": [\n    \"For additional information on Woodwork typing and how it is used in Featuretools, refer to [Woodwork Typing in Featuretools](../getting_started/woodwork_types.ipynb).\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"bf3dfea2\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Adding interesting values\\n\",\n    \"\\n\",\n    \"Interesting values can be added to all dataframes in an EntitySet, a single dataframe in an EntitySet, or to a single column of a dataframe in an EntitySet.\\n\",\n    \"\\n\",\n    \"To add interesting values for all of the dataframes in an EntitySet, simply call `EntitySet.add_interesting_values`, optionally specifying the maximum number of values to add for each column. This remains unchanged from older versions of Featuretools to the 1.0 release.\\n\",\n    \"\\n\",\n    \"Adding values for a single dataframe or for a single column has changed. Previously to add interesting values for an Entity, users would call `Entity.add_interesting_values()`:\\n\",\n    \"```python\\n\",\n    \"es['items'].add_interesting_values()\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"Now, in order to specify interesting values for a single dataframe, you call `add_interesting_values` on the EntitySet, and pass the name of the dataframe for which you want interesting values added:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"c058d2ed\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es.add_interesting_values(dataframe_name=\\\"items\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"c3e0a247\",\n   \"metadata\": {},\n   \"source\": [\n    \"Previously, to manually add interesting values for a column, you would simply assign them to the attribute of the variable:\\n\",\n    \"\\n\",\n    \"```python\\n\",\n    \"es['items']['order_id'].interesting_values = [1, 2]\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"Now, this is done through `EntitySet.add_interesting_values`, passing in the name of the dataframe and a dictionary mapping column names to the interesting values to assign for that column. For example, to assign the interesting values of `[1, 2]` to the `order_id` column of the `items` dataframe, use the following approach:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"8276114b\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es.add_interesting_values(dataframe_name=\\\"items\\\", values={\\\"order_id\\\": [1, 2]})\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"22e70b84\",\n   \"metadata\": {},\n   \"source\": [\n    \"Interesting values for multiple columns in the same dataframe can be assigned by adding more entries to the dictionary passed to the `values` parameter.\\n\",\n    \"\\n\",\n    \"Accessing interesting values has changed as well. Previously interesting values could be viewed from the variable:\\n\",\n    \"```python\\n\",\n    \"es['items']['order_id'].interesting_values\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"Interesting values are now stored in the Woodwork metadata for the columns in a dataframe:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"8461c4f7\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es[\\\"items\\\"].ww.columns[\\\"order_id\\\"].metadata[\\\"interesting_values\\\"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"cb23501f\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Setting a secondary time index\\n\",\n    \"\\n\",\n    \"In earlier versions of Featuretools, a secondary time index could be set on an Entity by calling `Entity.set_secondary_time_index`. \\n\",\n    \"```python\\n\",\n    \"es_flight = ft.demo.load_flight(nrows=100)\\n\",\n    \"\\n\",\n    \"arr_time_columns = ['arr_delay', 'dep_delay', 'carrier_delay', 'weather_delay',\\n\",\n    \"                    'national_airspace_delay', 'security_delay',\\n\",\n    \"                    'late_aircraft_delay', 'canceled', 'diverted',\\n\",\n    \"                    'taxi_in', 'taxi_out', 'air_time', 'dep_time']\\n\",\n    \"es_flight['trip_logs'].set_secondary_time_index({'arr_time': arr_time_columns})\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"Since the `Entity` class has been removed in Featuretools 1.0, this now needs to be done through the `EntitySet` instead:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"b80b1f6a\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es_flight = ft.demo.load_flight(nrows=100)\\n\",\n    \"\\n\",\n    \"arr_time_columns = [\\n\",\n    \"    \\\"arr_delay\\\",\\n\",\n    \"    \\\"dep_delay\\\",\\n\",\n    \"    \\\"carrier_delay\\\",\\n\",\n    \"    \\\"weather_delay\\\",\\n\",\n    \"    \\\"national_airspace_delay\\\",\\n\",\n    \"    \\\"security_delay\\\",\\n\",\n    \"    \\\"late_aircraft_delay\\\",\\n\",\n    \"    \\\"canceled\\\",\\n\",\n    \"    \\\"diverted\\\",\\n\",\n    \"    \\\"taxi_in\\\",\\n\",\n    \"    \\\"taxi_out\\\",\\n\",\n    \"    \\\"air_time\\\",\\n\",\n    \"    \\\"dep_time\\\",\\n\",\n    \"]\\n\",\n    \"es_flight.set_secondary_time_index(\\n\",\n    \"    dataframe_name=\\\"trip_logs\\\", secondary_time_index={\\\"arr_time\\\": arr_time_columns}\\n\",\n    \")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"2ebee2e6\",\n   \"metadata\": {},\n   \"source\": [\n    \"Previously, the secondary time index could be accessed directly from the Entity with `es_flight['trip_logs'].secondary_time_index`. Starting in Featuretools 1.0 the secondary time index and the associated columns are stored in the Woodwork dataframe metadata and can be accessed as shown below.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"3ea95fdb\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es_flight[\\\"trip_logs\\\"].ww.metadata[\\\"secondary_time_index\\\"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"f2f9b64c\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Normalizing Entities/DataFrames\\n\",\n    \"\\n\",\n    \"`EntitySet.normalize_entity` has been renamed to `EntitySet.normalize_dataframe` in Featuretools 1.0. The new method works in the same way as the old method, but some of the parameters have been renamed. The table below shows the old and new names for reference. When calling this method, the new parameter names need to be used.\\n\",\n    \"\\n\",\n    \"| Old Parameter Name | New Parameter Name |\\n\",\n    \"| --- | --- |\\n\",\n    \"| base_entity_id | base_dataframe_name |\\n\",\n    \"| new_entity_id | new_dataframe_name |\\n\",\n    \"| additional_variables | additional_columns |\\n\",\n    \"| copy_variables | copy_columns |\\n\",\n    \"| new_entity_time_index | new_dataframe_time_index |\\n\",\n    \"| new_entity_secondary_time_index | new_dataframe_secondary_time_index |\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"ca81708b\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Defining and adding relationships\\n\",\n    \"\\n\",\n    \"In earlier versions of Featuretools, relationships were defined by creating a `Relationship` object, which took two `Variables` as inputs. To define a relationship between the orders Entity and the items Entity, we would first create a `Relationship` and then add it to the EntitySet:\\n\",\n    \"\\n\",\n    \"```python\\n\",\n    \"relationship = ft.Relationship(es['orders']['order_id'], es['items']['order_id'])\\n\",\n    \"es.add_relationship(relationship)\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"With Featuretools 1.0, the process is similar, but there are two different ways to add the relationship to the EntitySet. One way is to pass the dataframe and column names to `EntitySet.add_relationship`, and another is to pass a previously created `Relationship` object to the `relationship` keyword argument. Both approaches are demonstrated below.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"7d738807\",\n   \"metadata\": {\n    \"nbshpinx\": \"hidden\"\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Undo change from above and change child column logical type to match parent and prevent warning\\n\",\n    \"# NOTE: This cell is hidden in the docs build\\n\",\n    \"es[\\\"items\\\"].ww.set_types(logical_types={\\\"order_id\\\": \\\"Integer\\\"})\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"97c04dd4\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"es.add_relationship(\\n\",\n    \"    parent_dataframe_name=\\\"orders\\\",\\n\",\n    \"    parent_column_name=\\\"order_id\\\",\\n\",\n    \"    child_dataframe_name=\\\"items\\\",\\n\",\n    \"    child_column_name=\\\"order_id\\\",\\n\",\n    \")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"26643d04\",\n   \"metadata\": {\n    \"nbshpinx\": \"hidden\"\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Reset the relationship so we can add it again\\n\",\n    \"# NOTE: This cell is hidden in the docs build\\n\",\n    \"es.relationships = []\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"317e5657\",\n   \"metadata\": {},\n   \"source\": [\n    \"Alternatively, we can first create a `Relationship` and pass that to `EntitySet.add_relationship`. When defining a `Relationship` we need to pass in the EntitySet to which it belongs along with the names for the parent dataframe and parent column and the name of the child dataframe and child column.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"47e54c72\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"relationship = ft.Relationship(\\n\",\n    \"    entityset=es,\\n\",\n    \"    parent_dataframe_name=\\\"orders\\\",\\n\",\n    \"    parent_column_name=\\\"order_id\\\",\\n\",\n    \"    child_dataframe_name=\\\"items\\\",\\n\",\n    \"    child_column_name=\\\"order_id\\\",\\n\",\n    \")\\n\",\n    \"es.add_relationship(relationship=relationship)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"7a49ba91\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Updating data for a dataframe in an EntitySet\\n\",\n    \"\\n\",\n    \"Previously to update (replace) the data associated with an Entity, users could call `Entity.update_data` and pass in the new dataframe. As an example, let's update the data in our `users` Entity:\\n\",\n    \"```python\\n\",\n    \"new_users_df = pd.DataFrame({\\n\",\n    \"    'id': [3, 4],\\n\",\n    \"    'name': ['Anne Teak', 'Art Decco']\\n\",\n    \"})\\n\",\n    \"\\n\",\n    \"es['users'].update_data(df=new_users_df)\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"To accomplish this task with Featuretools 1.0, we will use the `EntitySet.replace_dataframe` method instead:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"b45a81d5\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"new_users_df = pd.DataFrame({\\\"id\\\": [0, 1], \\\"name\\\": [\\\"Anne Teak\\\", \\\"Art Decco\\\"]})\\n\",\n    \"\\n\",\n    \"es.replace_dataframe(dataframe_name=\\\"users\\\", df=new_users_df)\\n\",\n    \"es[\\\"users\\\"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"679af861\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining features\\n\",\n    \"\\n\",\n    \"The syntax for defining features has changed slightly in Featuretools 1.0. Previously, identity features could be defined simply by passing in the variable that should be used to build the feature.\\n\",\n    \"\\n\",\n    \"```python\\n\",\n    \"feature = ft.Feature(es['items']['item_price'])\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"Starting with Featuretools 1.0, a similar syntax can be used, but because `es['items']` will now return a Woodwork dataframe instead of an `Entity`, we need to update the syntax slightly to access the Woodwork column. To update, simply add `.ww` between the dataframe name selector and the column selector as shown below.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"88902f6b\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature = ft.Feature(es[\\\"items\\\"].ww[\\\"item_price\\\"])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"0faf41e4\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Defining primitives\\n\",\n    \"\\n\",\n    \"In earlier versions of Featuretools, primitive input and return types were defined by specifying the appropriate `Variable` class. Starting in version 1.0, the input and return types are defined by Woodwork `ColumnSchema` objects. \\n\",\n    \"\\n\",\n    \"To illustrate this change, let's look closer at the `Age` transform primitive. This primitive takes a datetime representing a date of birth and returns a numeric value corresponding to a person's age. In previous versions of Featuretools, the input type was defined by specifying the `DateOfBirth` variable type and the return type was specified by the `Numeric` variable type:\\n\",\n    \"\\n\",\n    \"```python\\n\",\n    \"input_types = [DateOfBirth]\\n\",\n    \"return_type = Numeric\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"Woodwork does not have a specific `DateOfBirth` logical type, but rather identifies a column as a date of birth column by specifying the logical type as `Datetime` with a semantic tag of `date_of_birth`. There is also no `Numeric` logical type in Woodwork, but rather Woodwork identifies all columns that can be used for numeric operations with the semantic tag of `numeric`. Furthermore, we know the `Age` primitive will return a floating point number, which would correspond to a Woodwork logical type of `Double`. With these items in mind, we can redefine the `Age` input types and return types with `ColumnSchema` objects as follows:\\n\",\n    \"\\n\",\n    \"```python\\n\",\n    \"input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={'date_of_birth'})]\\n\",\n    \"return_type = ColumnSchema(logical_type=Double, semantic_tags={'numeric'})\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"Aside from changing the way input and return types are defined, the rest of the process for defining primitives remains unchanged.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"ebcd6d9e\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Mapping from old Featuretools variable types to Woodwork ColumnSchemas\\n\",\n    \"\\n\",\n    \"Types defined by Woodwork differ from the old variable types that were defined by Featuretools prior to version 1.0. While there is not a direct mapping from the old variable types to the new Woodwork types defined by `ColumnSchema` objects, the approximate mapping is shown below.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"| Featuretools Variable | Woodwork Column Schema |\\n\",\n    \"| --- | --- |\\n\",\n    \"| Boolean | ColumnSchema(logical_type=Boolean) or ColumnSchema(logical_type=BooleanNullable) |\\n\",\n    \"| Categorical | ColumnSchema(logical_type=Categorical) |\\n\",\n    \"| CountryCode | ColumnSchema(logical_type=CountryCode) |\\n\",\n    \"| Datetime | ColumnSchema(logical_type=Datetime) |\\n\",\n    \"| DateOfBirth | ColumnSchema(logical_type=Datetime, semantic_tags={'date_of_birth'}) |\\n\",\n    \"| DatetimeTimeIndex | ColumnSchema(logical_type=Datetime, semantic_tags={'time_index'}) |\\n\",\n    \"| Discrete | ColumnSchema(semantic_tags={'category'}) |\\n\",\n    \"| EmailAddress | ColumnSchema(logical_type=EmailAddress) |\\n\",\n    \"| FilePath | ColumnSchema(logical_type=Filepath) |\\n\",\n    \"| FullName | ColumnSchema(logical_type=PersonFullName) |\\n\",\n    \"| Id | ColumnSchema(semantic_tags={'foreign_key'}) |\\n\",\n    \"| Index | ColumnSchema(semantic_tags={'index'}) |\\n\",\n    \"| IPAddress | ColumnSchema(logical_type=IPAddress) |\\n\",\n    \"| LatLong | ColumnSchema(logical_type=LatLong) |\\n\",\n    \"| NaturalLanguage | ColumnSchema(logical_type=NaturalLanguage) |\\n\",\n    \"| Numeric | ColumnSchema(semantic_tags={'numeric'}) |\\n\",\n    \"| NumericTimeIndex | ColumnSchema(semantic_tags={'numeric', 'time_index'}) |\\n\",\n    \"| Ordinal | ColumnSchema(logical_type=Ordinal) |\\n\",\n    \"| PhoneNumber | ColumnSchema(logical_type=PhoneNumber) |\\n\",\n    \"| SubRegionCode | ColumnSchema(logical_type=SubRegionCode) |\\n\",\n    \"| Timedelta | ColumnSchema(logical_type=Timedelta) |\\n\",\n    \"| TimeIndex | ColumnSchema(semantic_tags={'time_index'}) |\\n\",\n    \"| URL | ColumnSchema(logical_type=URL) |\\n\",\n    \"| Unknown | ColumnSchema(logical_type=Unknown) |\\n\",\n    \"| ZIPCode | ColumnSchema(logical_type=PostalCode) |\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"fec87370\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Changes to Deep Feature Synthesis and Calculate Feature Matrix\\n\",\n    \"\\n\",\n    \"The argument names for both `featuretools.dfs` and `featuretools.calculate_feature_matrix` have changed slightly in Featuretools 1.0. In prior versions, users could generate a list of features using the default primitives and options like this:\\n\",\n    \"\\n\",\n    \"```python\\n\",\n    \"features = ft.dfs(entityset=es,\\n\",\n    \"                  target_entity='items',\\n\",\n    \"                  features_only=True)\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"In Featuretools 1.0, the `target_entity` argument has been renamed to `target_dataframe_name`, but otherwise this basic call remains the same.\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"5428949c\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"features = ft.dfs(entityset=es, target_dataframe_name=\\\"items\\\", features_only=True)\\n\",\n    \"features\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"3154734d\",\n   \"metadata\": {},\n   \"source\": [\n    \"In addition, the `dfs` argument `ignore_entities` was renamed to `ignore_dataframes` and `ignore_variables` was renamed to `ignore_columns`. Similarly, if specifying primitive options, all references to `entities` should be replaced with `dataframes` and references to `variables` should be replaced with columns. For example, the primitive option of `include_groupby_entities` is now `include_groupby_dataframes` and `include_variables` is now `include_columns`.\\n\",\n    \"\\n\",\n    \"The basic call to `featuretools.calculate_feature_matrix` remains unchanged if passing in an EntitySet along with a list of features to caluculate. However, users calling `calculate_feature_matrix` by passing in a list of `entities` and `relationships` should note that the `entities` argument has been renamed to `dataframes` and the values in the dictionary values should now include Woodwork logical types instead of Featuretools `Variable` classes.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"456da22e\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix = ft.calculate_feature_matrix(features=features, entityset=es)\\n\",\n    \"feature_matrix\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"b87489cf\",\n   \"metadata\": {},\n   \"source\": [\n    \"In addition to the changes in argument names, there are a couple other changes to the returned feature matrix that users should be aware of. First, because of slight differences in the way Woodwork defines column types compared to how the prior Featuretools implementation did, there can be some differences in the features that are generated between old and new versions. The most notable impact is in the way foreign key columns are handled. Previously, Featuretools treated all foreign key (previously `Id`) columns as categorical columns, and would generate appropriate features from these columns. Starting in version 1.0, foreign key columns are not constrained to be categorical, and if they are another type such as `Integer`, features will not be generated from these columns. Manually converting foreign key columns to `Categorical` as shown above will result in features much closer to those achieved with previous versions.\\n\",\n    \"\\n\",\n    \"Also, because Woodwork's type inference process differs from the previous Featuretools type inference process, an EntitySet may have column types identified differently. This difference in column types could impact the features that are generated. If it is important to have the same set of features, check all of the logical types in the EntitySet dataframes and update them to the expected types if there are columns that have been inferred as unexpected types.\\n\",\n    \"\\n\",\n    \"Finally, the feature matrix calculated by Featuretools will now have Woodwork initialized. This means that users can view feature matrix column typing information through the Woodwork namespace as follows.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"cdb45cc9\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix.ww\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"68910d73\",\n   \"metadata\": {},\n   \"source\": [\n    \"Featuretools now labels features by whether they were originally in the dataframes, or whether they were created by Featuretools. This information is stored in the Woodwork `origin` attribute for the column. Columns that were in the original data will be labeled with `base` and features that were created by Featuretools will be labeled with `engineered`.\\n\",\n    \"\\n\",\n    \"As a demonstration of how to access this information, let's compare two features in the feature matrix: `item_price` and `orders.MEAN(items.item_price)`. `item_price` was present in the original data, and `orders.MEAN(items.item_price)` was created by Featuretools.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"f3e143fe\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix.ww[\\\"item_price\\\"].ww.origin\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"12cf8260\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"feature_matrix.ww[\\\"orders.MEAN(items.item_price)\\\"].ww.origin\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"4c429c75\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Other changes\\n\",\n    \"\\n\",\n    \"In addition to the changes outlined above, there are several other smaller changes in Featuretools 1.0 of which existing users should be aware.\\n\",\n    \"\\n\",\n    \"- Column ordering of an dataframe in an EntitySet might be different than it was before. Previously, Featuretools would reorder the columns such that the index column would always be the first column in the dataframe. This behavior has been removed, and the index column is no longer guaranteed to be the first column in the dataframe. Now the index column will remain in the position it was when the dataframe was added to the EntitySet.\\n\",\n    \"\\n\",\n    \"- For `LatLong` columns, older versions of Featuretools would replace single `nan` values in the columns with a tuple `(nan, nan)`. This is no longer the case, and single `nan` values will now remain in the `LatLong` column. Based on the behavior in Woodwork, any values of `(nan, nan)` in a `LatLong` column will be replaced with a single `nan` value.\\n\",\n    \"\\n\",\n    \"- Since Featuretools no longer defines `Variable` objects with relationships between them, the `featuretools.variable_types.graph_variable_types` function has been removed.\\n\",\n    \"\\n\",\n    \"- The `featuretools.variable_types.list_variable_types` utility function has been removed and replaced with two corresponding Woodwork functions: `woodwork.list_logical_types` and `woodwork.list_semantic_tags`. Starting in Featuretools 1.0, the Woodwork utility functions should be used to obtain information on the logical types and semantic tags that can be applied to dataframe columns.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}\n"
  },
  {
    "path": "docs/source/resources/usage_tips/glossary.rst",
    "content": ".. _glossary:\n.. currentmodule:: featuretools\n\nGlossary\n========\n\n.. glossary::\n    :sorted:\n\n    feature\n        A transformation of data used for machine learning.  Featuretools has a custom language for defining features as described :ref:`here <primitives>`. All features are represented by subclasses of :class:`FeatureBase`.\n\n    feature engineering\n        The process of transforming data into representations that are better for machine learning.\n\n    cutoff time\n        The last point in time data is allowed to be used when calculating a feature\n\n    EntitySet\n        A collection of dataframes and the relationships between them. Represented by the :class:`.EntitySet` class.\n\n    instance\n        Equivalent to a row in a relational database. Each dataframe has many instances, and each instance has a value for each column and feature defined on the dataframe.\n\n    target dataframe\n        The dataframe for which we will be making features\n\n    parent dataframe\n        A dataframe that is referenced by another dataframe via relationship. The \"one\" in a one-to-many relationship.\n\n    child dataframe\n        A dataframe that references another dataframe via relationship. The \"many\" in a one-to-many relationship.\n\n    relationship\n        A mapping between a parent dataframe and a child dataframe. The child dataframe must contain a column referencing the index column on the parent dataframe. Represented by the :class:`.Relationship` class.\n\n    logical type\n        Additional information about how a column should be interpreted or parsed beyond how the data is stored on disk or in memory. Used to determine which primitives can be applied to a column to generate features.\n\n    semantic tag\n        Optional additional information on the column about the meaning or potential uses of data. Used to determine which primitives can be applied to a column to generate features.\n\n    ColumnSchema\n        All of a Woodwork column's type information including the logical type and any semantic tags.\n"
  },
  {
    "path": "docs/source/resources/usage_tips/limitations.rst",
    "content": "Limitations\n-----------\nIn-memory\n*********\n\nFeaturetools is intended to be run on datasets that can fit in memory on one machine. For advice on handing large dataset refer to :ref:`Improving Computational Performance <performance>`.\n\nBring your own labels\n*********************\n\nIf you are doing supervised machine learning, you must supply your own labels and cutoff times. To structure this process, you can use `Compose <https://compose.featurelabs.com>`_, which is an open source project for automatically generating labels with cutoff times.\n"
  },
  {
    "path": "docs/source/set-headers.py",
    "content": "import urllib.request\n\nopener = urllib.request.build_opener()\nopener.addheaders = [(\"Testing\", \"True\")]\nurllib.request.install_opener(opener)\n"
  },
  {
    "path": "docs/source/setup.py",
    "content": "import os\n\nimport featuretools as ft\n\n\ndef load_feature_plots():\n    es = ft.demo.load_mock_customer(return_entityset=True)\n    path = os.path.join(\n        os.path.dirname(os.path.abspath(__file__)),\n        \"getting_started/graphs/\",\n    )\n    agg_feat = ft.AggregationFeature(\n        ft.IdentityFeature(es[\"sessions\"].ww[\"session_id\"]),\n        \"customers\",\n        ft.primitives.Count,\n    )\n    trans_feat = ft.TransformFeature(\n        ft.IdentityFeature(es[\"customers\"].ww[\"join_date\"]),\n        ft.primitives.TimeSincePrevious,\n    )\n    demo_feat = ft.AggregationFeature(\n        ft.TransformFeature(\n            ft.IdentityFeature(es[\"transactions\"].ww[\"transaction_time\"]),\n            ft.primitives.Weekday,\n        ),\n        \"sessions\",\n        ft.primitives.Mode,\n    )\n    ft.graph_feature(agg_feat, to_file=os.path.join(path, \"agg_feat.dot\"))\n    ft.graph_feature(trans_feat, to_file=os.path.join(path, \"trans_feat.dot\"))\n    ft.graph_feature(demo_feat, to_file=os.path.join(path, \"demo_feat.dot\"))\n\n\nif __name__ == \"__main__\":\n    load_feature_plots()\n"
  },
  {
    "path": "docs/source/templates/layout.html",
    "content": "{% extends \"!layout.html\" %}\n\n{%- block extrahead %}\n\n\n{% set image = 'https://alteryx-oss-web-images.s3.amazonaws.com/OpenSource_OpenGraph_1200x630px-featuretools.png' %}\n{% set description = 'Automated feature engineering in Python' %}\n{% if meta is defined %}\n    {% if meta.description is defined %}\n        {% set description = meta.description %}\n    {% endif %}\n{% endif %}\n\n<meta property=\"og:title\" content=\"{{ title|striptags|e }}{{ titlesuffix }}\">\n<meta content=\"{{description}}\" />\n<meta property=\"og:description\" content=\"{{description}}\">\n<meta property=\"og:image\" content=\"{{image}}\">\n<meta property=\"twitter:image\" content=\"{{image}}\">\n<meta name=\"twitter:card\" content=\"summary_large_image\">\n\n\n{% endblock %}\n\n{%- block footer %}\n\n<footer class=\"footer\">\n  <div class=\"footer-container\">\n    <div class=\"footer-cell-1\">\n      <img class=\"footer-image-alteryx\" src=\"{{ pathto('_static/images/alteryx_open_source.svg', 1) }}\" alt=\"Alteryx Open Source\">\n    </div>\n    <div class=\"footer-cell-2\">\n      <a href=\"https://github.com/alteryx/featuretools#readme\" target=\"_blank\">\n        <img  class=\"footer-image-github\" src=\"{{ pathto('_static/images/github.svg', 1) }}\" alt=\"GitHub\">\n      </a>\n      <a href=\"https://twitter.com/AlteryxOSS\" target=\"_blank\">\n        <img  class=\"footer-image-twitter\" src=\"{{ pathto('_static/images/twitter.svg', 1) }}\" alt=\"Twitter\">\n      </a>\n      <a href=\"https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA\" target=\"_blank\">\n        <img  class=\"footer-image-github\" src=\"{{ pathto('_static/images/slack.svg', 1) }}\" alt=\"Slack\">\n      </a>\n      <a href=\"https://stackoverflow.com/questions/tagged/featuretools\" target=\"_blank\">\n        <img  class=\"footer-image-github\" src=\"{{ pathto('_static/images/stackoverflow.svg', 1) }}\" alt=\"Stack Overflow\">\n      </a>\n    </div>\n    <div class=\"footer-cell-3\">\n      <hr class=\"footer-line\">\n    </div>\n    <div class=\"footer-cell-4\">\n      <img class=\"footer-image-copyright\" src=\"{{ pathto('_static/images/copyright.svg', 1) }}\" alt=\"Copyright\">\n    </div>\n  </div>\n</footer>\n\n{% endblock %}\n"
  },
  {
    "path": "featuretools/__init__.py",
    "content": "# flake8: noqa\nfrom featuretools.version import __version__\nfrom featuretools.config_init import config\nfrom featuretools.entityset.api import *\nfrom featuretools import primitives\nfrom featuretools.synthesis.api import *\nfrom featuretools.primitives import list_primitives, summarize_primitives\nfrom featuretools.computational_backends.api import *\nfrom featuretools import tests\nfrom featuretools.utils.recommend_primitives import get_recommended_primitives\nfrom featuretools.utils.time_utils import *\nfrom featuretools.utils.utils_info import show_info\nimport featuretools.demo\nfrom featuretools import feature_base\nfrom featuretools import selection\nfrom featuretools.feature_base import (\n    AggregationFeature,\n    DirectFeature,\n    Feature,\n    FeatureBase,\n    GroupByTransformFeature,\n    IdentityFeature,\n    TransformFeature,\n    graph_feature,\n    describe_feature,\n    save_features,\n    load_features,\n)\n\nimport logging\nimport pkg_resources\nimport sys\nimport traceback\nimport warnings\nfrom woodwork import list_logical_types, list_semantic_tags\n\nlogger = logging.getLogger(\"featuretools\")\n\n# Call functions registered by other libraries when featuretools is imported\nfor entry_point in pkg_resources.iter_entry_points(\"featuretools_initialize\"):\n    try:\n        method = entry_point.load()\n        if callable(method):\n            method()\n    except Exception:\n        pass\nfor entry_point in pkg_resources.iter_entry_points(\"alteryx_open_src_initialize\"):\n    try:\n        method = entry_point.load()\n        if callable(method):\n            method(\"featuretools\")\n    except Exception:\n        pass\n\n# Load in submodules registered by other libraries into Featuretools namespace\nfor entry_point in pkg_resources.iter_entry_points(\"featuretools_plugin\"):\n    try:\n        sys.modules[\"featuretools.\" + entry_point.name] = entry_point.load()\n    except Exception:\n        message = \"Featuretools failed to load plugin {} from library {}. \"\n        message += \"For a full stack trace, set logging to debug.\"\n        logger.warning(message.format(entry_point.name, entry_point.module_name))\n        logger.debug(traceback.format_exc())\n"
  },
  {
    "path": "featuretools/__main__.py",
    "content": ""
  },
  {
    "path": "featuretools/computational_backends/__init__.py",
    "content": "# flake8: noqa\nfrom featuretools.computational_backends.api import *\n"
  },
  {
    "path": "featuretools/computational_backends/api.py",
    "content": "# flake8: noqa\nfrom featuretools.computational_backends.calculate_feature_matrix import (\n    approximate_features,\n    calculate_feature_matrix,\n)\nfrom featuretools.computational_backends.utils import (\n    bin_cutoff_times,\n    create_client_and_cluster,\n    replace_inf_values,\n)\n"
  },
  {
    "path": "featuretools/computational_backends/calculate_feature_matrix.py",
    "content": "import logging\nimport math\nimport os\nimport shutil\nimport time\nimport warnings\nfrom datetime import datetime\n\nimport cloudpickle\nimport numpy as np\nimport pandas as pd\nfrom woodwork.logical_types import (\n    Age,\n    AgeNullable,\n    Boolean,\n    BooleanNullable,\n    Integer,\n    IntegerNullable,\n)\n\nfrom featuretools.computational_backends.feature_set import FeatureSet\nfrom featuretools.computational_backends.feature_set_calculator import (\n    FeatureSetCalculator,\n)\nfrom featuretools.computational_backends.utils import (\n    _check_cutoff_time_type,\n    _validate_cutoff_time,\n    bin_cutoff_times,\n    create_client_and_cluster,\n    gather_approximate_features,\n    gen_empty_approx_features_df,\n    get_ww_types_from_features,\n    save_csv_decorator,\n)\nfrom featuretools.entityset.relationship import RelationshipPath\nfrom featuretools.feature_base import AggregationFeature, FeatureBase\nfrom featuretools.utils import Trie\nfrom featuretools.utils.gen_utils import (\n    import_or_raise,\n    make_tqdm_iterator,\n)\n\nlogger = logging.getLogger(\"featuretools.computational_backend\")\n\nPBAR_FORMAT = \"Elapsed: {elapsed} | Progress: {l_bar}{bar}\"\nFEATURE_CALCULATION_PERCENTAGE = (\n    0.95  # make total 5% higher to allot time for wrapping up at end\n)\n\n\ndef calculate_feature_matrix(\n    features,\n    entityset=None,\n    cutoff_time=None,\n    instance_ids=None,\n    dataframes=None,\n    relationships=None,\n    cutoff_time_in_index=False,\n    training_window=None,\n    approximate=None,\n    save_progress=None,\n    verbose=False,\n    chunk_size=None,\n    n_jobs=1,\n    dask_kwargs=None,\n    progress_callback=None,\n    include_cutoff_time=True,\n):\n    \"\"\"Calculates a matrix for a given set of instance ids and calculation times.\n\n    Args:\n        features (list[:class:`.FeatureBase`]): Feature definitions to be calculated.\n\n        entityset (EntitySet): An already initialized entityset. Required if `dataframes` and `relationships`\n            not provided\n\n        cutoff_time (pd.DataFrame or Datetime): Specifies times at which to calculate\n            the features for each instance. The resulting feature matrix will use data\n            up to and including the cutoff_time. Can either be a DataFrame or a single\n            value. If a DataFrame is passed the instance ids for which to calculate features\n            must be in a column with the same name as the target dataframe index or a column\n            named `instance_id`. The cutoff time values in the DataFrame must be in a column with\n            the same name as the target dataframe time index or a column named `time`. If the\n            DataFrame has more than two columns, any additional columns will be added to the\n            resulting feature matrix. If a single value is passed, this value will be used for\n            all instances.\n\n        instance_ids (list): List of instances to calculate features on. Only\n            used if cutoff_time is a single datetime.\n\n        dataframes (dict[str -> tuple(DataFrame, str, str, dict[str -> str/Woodwork.LogicalType], dict[str->str/set], boolean)]):\n            Dictionary of DataFrames. Entries take the format\n            {dataframe name -> (dataframe, index column, time_index, logical_types, semantic_tags, make_index)}.\n            Note that only the dataframe is required. If a Woodwork DataFrame is supplied, any other parameters\n            will be ignored.\n\n        relationships (list[(str, str, str, str)]): list of relationships\n            between dataframes. List items are a tuple with the format\n            (parent dataframe name, parent column, child dataframe name, child column).\n\n        cutoff_time_in_index (bool): If True, return a DataFrame with a MultiIndex\n            where the second index is the cutoff time (first is instance id).\n            DataFrame will be sorted by (time, instance_id).\n\n        training_window (Timedelta or str, optional):\n            Window defining how much time before the cutoff time data\n            can be used when calculating features. If ``None``, all data before cutoff time is used.\n            Defaults to ``None``.\n\n        approximate (Timedelta or str): Frequency to group instances with similar\n            cutoff times by for features with costly calculations. For example,\n            if bucket is 24 hours, all instances with cutoff times on the same\n            day will use the same calculation for expensive features.\n\n        verbose (bool, optional): Print progress info. The time granularity is\n            per chunk.\n\n        chunk_size (int or float or None): maximum number of rows of\n            output feature matrix to calculate at time. If passed an integer\n            greater than 0, will try to use that many rows per chunk. If passed\n            a float value between 0 and 1 sets the chunk size to that\n            percentage of all rows. if None, and n_jobs > 1 it will be set to 1/n_jobs\n\n        n_jobs (int, optional): number of parallel processes to use when\n            calculating feature matrix. Requires Dask if not equal to 1.\n\n        dask_kwargs (dict, optional): Dictionary of keyword arguments to be\n            passed when creating the dask client and scheduler. Even if n_jobs\n            is not set, using `dask_kwargs` will enable multiprocessing.\n            Main parameters:\n\n            cluster (str or dask.distributed.LocalCluster):\n                cluster or address of cluster to send tasks to. If unspecified,\n                a cluster will be created.\n            diagnostics port (int):\n                port number to use for web dashboard.  If left unspecified, web\n                interface will not be enabled.\n\n            Valid keyword arguments for LocalCluster will also be accepted.\n\n        save_progress (str, optional): path to save intermediate computational results.\n\n        progress_callback (callable): function to be called with incremental progress updates.\n            Has the following parameters:\n\n                update: percentage change (float between 0 and 100) in progress since last call\n                progress_percent: percentage (float between 0 and 100) of total computation completed\n                time_elapsed: total time in seconds that has elapsed since start of call\n\n        include_cutoff_time (bool): Include data at cutoff times in feature calculations. Defaults to ``True``.\n\n    Returns:\n        pd.DataFrame: The feature matrix.\n    \"\"\"\n    assert (\n        isinstance(features, list)\n        and features != []\n        and all([isinstance(feature, FeatureBase) for feature in features])\n    ), \"features must be a non-empty list of features\"\n\n    # handle loading entityset\n    from featuretools.entityset.entityset import EntitySet\n\n    if not isinstance(entityset, EntitySet):\n        if dataframes is not None:\n            entityset = EntitySet(\"entityset\", dataframes, relationships)\n        else:\n            raise TypeError(\"No dataframes or valid EntitySet provided\")\n\n    target_dataframe = entityset[features[0].dataframe_name]\n\n    cutoff_time = _validate_cutoff_time(cutoff_time, target_dataframe)\n    entityset._check_time_indexes()\n\n    if isinstance(cutoff_time, pd.DataFrame):\n        if instance_ids:\n            msg = \"Passing 'instance_ids' is valid only if 'cutoff_time' is a single value or None - ignoring\"\n            warnings.warn(msg)\n        pass_columns = [\n            col for col in cutoff_time.columns if col not in [\"instance_id\", \"time\"]\n        ]\n        # make sure dtype of instance_id in cutoff time\n        # is same as column it references\n        target_dataframe = features[0].dataframe\n        ltype = target_dataframe.ww.logical_types[target_dataframe.ww.index]\n        cutoff_time.ww.init(logical_types={\"instance_id\": ltype})\n    else:\n        pass_columns = []\n        if cutoff_time is None:\n            if entityset.time_type == \"numeric\":\n                cutoff_time = np.inf\n            else:\n                cutoff_time = datetime.now()\n\n        if instance_ids is None:\n            index_col = target_dataframe.ww.index\n            df = entityset._handle_time(\n                dataframe_name=target_dataframe.ww.name,\n                df=target_dataframe,\n                time_last=cutoff_time,\n                training_window=training_window,\n                include_cutoff_time=include_cutoff_time,\n            )\n            instance_ids = df[index_col]\n\n        # convert list or range object into series\n        if not isinstance(instance_ids, pd.Series):\n            instance_ids = pd.Series(instance_ids)\n\n        cutoff_time = (cutoff_time, instance_ids)\n\n    _check_cutoff_time_type(cutoff_time, entityset.time_type)\n\n    # Approximate provides no benefit with a single cutoff time, so ignore it\n    if isinstance(cutoff_time, tuple) and approximate is not None:\n        msg = (\n            \"Using approximate with a single cutoff_time value or no cutoff_time \"\n            \"provides no computational efficiency benefit\"\n        )\n        warnings.warn(msg)\n        cutoff_time = pd.DataFrame(\n            {\n                \"instance_id\": cutoff_time[1],\n                \"time\": [cutoff_time[0]] * len(cutoff_time[1]),\n            },\n        )\n        target_dataframe = features[0].dataframe\n        ltype = target_dataframe.ww.logical_types[target_dataframe.ww.index]\n        cutoff_time.ww.init(logical_types={\"instance_id\": ltype})\n\n    feature_set = FeatureSet(features)\n\n    # Get features to approximate\n    if approximate is not None:\n        approximate_feature_trie = gather_approximate_features(feature_set)\n        # Make a new FeatureSet that ignores approximated features\n        feature_set = FeatureSet(\n            features,\n            approximate_feature_trie=approximate_feature_trie,\n        )\n\n    # Check if there are any non-approximated aggregation features\n    no_unapproximated_aggs = True\n    for feature in features:\n        if isinstance(feature, AggregationFeature):\n            # do not need to check if feature is in to_approximate since\n            # only base features of direct features can be in to_approximate\n            no_unapproximated_aggs = False\n            break\n\n        if approximate is not None:\n            all_approx_features = {\n                f for _, feats in feature_set.approximate_feature_trie for f in feats\n            }\n        else:\n            all_approx_features = set()\n        deps = feature.get_dependencies(deep=True, ignored=all_approx_features)\n        for dependency in deps:\n            if isinstance(dependency, AggregationFeature):\n                no_unapproximated_aggs = False\n                break\n\n    cutoff_df_time_col = \"time\"\n    target_time = \"_original_time\"\n\n    if approximate is not None:\n        # If there are approximated aggs, bin times\n        binned_cutoff_time = bin_cutoff_times(cutoff_time, approximate)\n\n        # Think about collisions: what if original time is a feature\n        binned_cutoff_time.ww[target_time] = cutoff_time[cutoff_df_time_col]\n\n        cutoff_time_to_pass = binned_cutoff_time\n\n    else:\n        cutoff_time_to_pass = cutoff_time\n\n    if isinstance(cutoff_time, pd.DataFrame):\n        cutoff_time_len = cutoff_time.shape[0]\n    else:\n        cutoff_time_len = len(cutoff_time[1])\n\n    chunk_size = _handle_chunk_size(chunk_size, cutoff_time_len)\n    tqdm_options = {\n        \"total\": (cutoff_time_len / FEATURE_CALCULATION_PERCENTAGE),\n        \"bar_format\": PBAR_FORMAT,\n        \"disable\": True,\n    }\n\n    if verbose:\n        tqdm_options.update({\"disable\": False})\n    elif progress_callback is not None:\n        # allows us to utilize progress_bar updates without printing to anywhere\n        tqdm_options.update({\"file\": open(os.devnull, \"w\"), \"disable\": False})\n\n    with make_tqdm_iterator(**tqdm_options) as progress_bar:\n        if n_jobs != 1 or dask_kwargs is not None:\n            feature_matrix = parallel_calculate_chunks(\n                cutoff_time=cutoff_time_to_pass,\n                chunk_size=chunk_size,\n                feature_set=feature_set,\n                approximate=approximate,\n                training_window=training_window,\n                save_progress=save_progress,\n                entityset=entityset,\n                n_jobs=n_jobs,\n                no_unapproximated_aggs=no_unapproximated_aggs,\n                cutoff_df_time_col=cutoff_df_time_col,\n                target_time=target_time,\n                pass_columns=pass_columns,\n                progress_bar=progress_bar,\n                dask_kwargs=dask_kwargs or {},\n                progress_callback=progress_callback,\n                include_cutoff_time=include_cutoff_time,\n            )\n        else:\n            feature_matrix = calculate_chunk(\n                cutoff_time=cutoff_time_to_pass,\n                chunk_size=chunk_size,\n                feature_set=feature_set,\n                approximate=approximate,\n                training_window=training_window,\n                save_progress=save_progress,\n                entityset=entityset,\n                no_unapproximated_aggs=no_unapproximated_aggs,\n                cutoff_df_time_col=cutoff_df_time_col,\n                target_time=target_time,\n                pass_columns=pass_columns,\n                progress_bar=progress_bar,\n                progress_callback=progress_callback,\n                include_cutoff_time=include_cutoff_time,\n            )\n\n        # ensure rows are sorted by input order\n        if isinstance(cutoff_time, pd.DataFrame):\n            feature_matrix = feature_matrix.ww.reindex(\n                pd.MultiIndex.from_frame(\n                    cutoff_time[[\"instance_id\", \"time\"]],\n                    names=feature_matrix.index.names,\n                ),\n            )\n        else:\n            # Maintain index dtype\n            index_dtype = feature_matrix.index.get_level_values(0).dtype\n            feature_matrix = feature_matrix.ww.reindex(\n                cutoff_time[1].astype(index_dtype),\n                level=0,\n            )\n        if not cutoff_time_in_index:\n            feature_matrix.ww.reset_index(level=\"time\", drop=True, inplace=True)\n\n        if save_progress and os.path.exists(os.path.join(save_progress, \"temp\")):\n            shutil.rmtree(os.path.join(save_progress, \"temp\"))\n\n        # force to 100% since we saved last 5 percent\n        previous_progress = progress_bar.n\n        progress_bar.update(progress_bar.total - progress_bar.n)\n\n        if progress_callback is not None:\n            (\n                update,\n                progress_percent,\n                time_elapsed,\n            ) = update_progress_callback_parameters(progress_bar, previous_progress)\n            progress_callback(update, progress_percent, time_elapsed)\n\n        progress_bar.refresh()\n\n    return feature_matrix\n\n\ndef calculate_chunk(\n    cutoff_time,\n    chunk_size,\n    feature_set,\n    entityset,\n    approximate,\n    training_window,\n    save_progress,\n    no_unapproximated_aggs,\n    cutoff_df_time_col,\n    target_time,\n    pass_columns,\n    progress_bar=None,\n    progress_callback=None,\n    include_cutoff_time=True,\n    schema=None,\n):\n    if not isinstance(feature_set, FeatureSet):\n        feature_set = cloudpickle.loads(feature_set)  # pragma: no cover\n\n    feature_matrix = []\n    if no_unapproximated_aggs and approximate is not None:\n        if entityset.time_type == \"numeric\":\n            group_time = np.inf\n        else:\n            group_time = datetime.now()\n\n    if isinstance(cutoff_time, tuple):\n        update_progress_callback = None\n        if progress_bar is not None:\n\n            def update_progress_callback(done):\n                previous_progress = progress_bar.n\n                progress_bar.update(done * len(cutoff_time[1]))\n                if progress_callback is not None:\n                    (\n                        update,\n                        progress_percent,\n                        time_elapsed,\n                    ) = update_progress_callback_parameters(\n                        progress_bar,\n                        previous_progress,\n                    )\n                    progress_callback(update, progress_percent, time_elapsed)\n\n        time_last = cutoff_time[0]\n        ids = cutoff_time[1]\n        calculator = FeatureSetCalculator(\n            entityset,\n            feature_set,\n            time_last,\n            training_window=training_window,\n        )\n        _feature_matrix = calculator.run(\n            ids,\n            progress_callback=update_progress_callback,\n            include_cutoff_time=include_cutoff_time,\n        )\n        time_index = pd.Index([time_last] * len(ids), name=\"time\")\n        _feature_matrix = _feature_matrix.set_index(time_index, append=True)\n        feature_matrix.append(_feature_matrix)\n\n    else:\n        if schema:\n            cutoff_time.ww.init_with_full_schema(schema=schema)  # pragma: no cover\n        for _, group in cutoff_time.groupby(cutoff_df_time_col):\n            # if approximating, calculate the approximate features\n            if approximate is not None:\n                group.ww.init(schema=cutoff_time.ww.schema, validate=False)\n                precalculated_features_trie = approximate_features(\n                    feature_set,\n                    group,\n                    window=approximate,\n                    entityset=entityset,\n                    training_window=training_window,\n                    include_cutoff_time=include_cutoff_time,\n                )\n            else:\n                precalculated_features_trie = None\n\n            @save_csv_decorator(save_progress)\n            def calc_results(\n                time_last,\n                ids,\n                precalculated_features=None,\n                training_window=None,\n                include_cutoff_time=True,\n            ):\n                update_progress_callback = None\n\n                if progress_bar is not None:\n\n                    def update_progress_callback(done):\n                        previous_progress = progress_bar.n\n                        progress_bar.update(done * group.shape[0])\n                        if progress_callback is not None:\n                            (\n                                update,\n                                progress_percent,\n                                time_elapsed,\n                            ) = update_progress_callback_parameters(\n                                progress_bar,\n                                previous_progress,\n                            )\n                            progress_callback(update, progress_percent, time_elapsed)\n\n                calculator = FeatureSetCalculator(\n                    entityset,\n                    feature_set,\n                    time_last,\n                    training_window=training_window,\n                    precalculated_features=precalculated_features,\n                )\n                matrix = calculator.run(\n                    ids,\n                    progress_callback=update_progress_callback,\n                    include_cutoff_time=include_cutoff_time,\n                )\n\n                return matrix\n\n            # if all aggregations have been approximated, can calculate all together\n            if no_unapproximated_aggs and approximate is not None:\n                inner_grouped = [[group_time, group]]\n            else:\n                # if approximated features, set cutoff_time to unbinned time\n                if precalculated_features_trie is not None:\n                    group[cutoff_df_time_col] = group[target_time]\n\n                inner_grouped = group.groupby(cutoff_df_time_col, sort=True)\n\n            if chunk_size is not None:\n                inner_grouped = _chunk_dataframe_groups(inner_grouped, chunk_size)\n\n            for time_last, group in inner_grouped:\n                # sort group by instance id\n                ids = group[\"instance_id\"].sort_values().values\n                if no_unapproximated_aggs and approximate is not None:\n                    window = None\n                else:\n                    window = training_window\n\n                # calculate values for those instances at time time_last\n                _feature_matrix = calc_results(\n                    time_last,\n                    ids,\n                    precalculated_features=precalculated_features_trie,\n                    training_window=window,\n                    include_cutoff_time=include_cutoff_time,\n                )\n\n                id_name = _feature_matrix.index.name\n\n                # if approximate, merge feature matrix with group frame to get original\n                # cutoff times and passed columns\n                if approximate:\n                    cols = [c for c in _feature_matrix.columns if c not in pass_columns]\n                    indexer = group[[\"instance_id\", target_time] + pass_columns]\n                    _feature_matrix = _feature_matrix[cols].merge(\n                        indexer,\n                        right_on=[\"instance_id\"],\n                        left_index=True,\n                        how=\"right\",\n                    )\n                    _feature_matrix.set_index(\n                        [\"instance_id\", target_time],\n                        inplace=True,\n                    )\n                    _feature_matrix.index.set_names([id_name, \"time\"], inplace=True)\n                    _feature_matrix.sort_index(level=1, kind=\"mergesort\", inplace=True)\n                else:\n                    # all rows have same cutoff time. set time and add passed columns\n                    num_rows = len(ids)\n                    if len(pass_columns) > 0:\n                        pass_through = group[\n                            [\"instance_id\", cutoff_df_time_col] + pass_columns\n                        ]\n                        pass_through.rename(\n                            columns={\n                                \"instance_id\": id_name,\n                                cutoff_df_time_col: \"time\",\n                            },\n                            inplace=True,\n                        )\n\n                    time_index = pd.Index([time_last] * num_rows, name=\"time\")\n                    _feature_matrix = _feature_matrix.set_index(\n                        time_index,\n                        append=True,\n                    )\n                    if len(pass_columns) > 0:\n                        pass_through.set_index([id_name, \"time\"], inplace=True)\n                        for col in pass_columns:\n                            _feature_matrix[col] = pass_through[col]\n                feature_matrix.append(_feature_matrix)\n\n    ww_init_kwargs = get_ww_types_from_features(\n        feature_set.target_features,\n        entityset,\n        pass_columns,\n        cutoff_time,\n    )\n    feature_matrix = init_ww_and_concat_fm(feature_matrix, ww_init_kwargs)\n    return feature_matrix\n\n\ndef approximate_features(\n    feature_set,\n    cutoff_time,\n    window,\n    entityset,\n    training_window=None,\n    include_cutoff_time=True,\n):\n    \"\"\"Given a set of features and cutoff_times to be passed to\n    calculate_feature_matrix, calculates approximate values of some features\n    to speed up calculations.  Cutoff times are sorted into\n    window-sized buckets and the approximate feature values are only calculated\n    at one cutoff time for each bucket.\n\n\n    ..note:: this only approximates DirectFeatures of AggregationFeatures, on\n        the target dataframe. In future versions, it may also be possible to\n        approximate these features on other top-level dataframes\n\n    Args:\n        cutoff_time (pd.DataFrame): specifies what time to calculate\n            the features for each instance at. The resulting feature matrix will use data\n            up to and including the cutoff_time. A DataFrame with\n            'instance_id' and 'time' columns.\n\n        window (Timedelta or str): frequency to group instances with similar\n            cutoff times by for features with costly calculations. For example,\n            if bucket is 24 hours, all instances with cutoff times on the same\n            day will use the same calculation for expensive features.\n\n        entityset (:class:`.EntitySet`): An already initialized entityset.\n\n        feature_set (:class:`.FeatureSet`): The features to be calculated.\n\n        training_window (`Timedelta`, optional):\n            Window defining how much older than the cutoff time data\n            can be to be included when calculating the feature. If None, all older data is used.\n\n        include_cutoff_time (bool):\n            If True, data at cutoff times are included in feature calculations.\n\n    \"\"\"\n    approx_fms_trie = Trie(path_constructor=RelationshipPath)\n\n    target_time_colname = \"target_time\"\n    cutoff_time.ww[target_time_colname] = cutoff_time[\"time\"]\n    approx_cutoffs = bin_cutoff_times(cutoff_time, window)\n    cutoff_df_time_col = \"time\"\n    cutoff_df_instance_col = \"instance_id\"\n    # should this order be by dependencies so that calculate_feature_matrix\n    # doesn't skip approximating something?\n    for relationship_path, approx_feature_names in feature_set.approximate_feature_trie:\n        if not approx_feature_names:\n            continue\n\n        (\n            cutoffs_with_approx_e_ids,\n            new_approx_dataframe_index_col,\n        ) = _add_approx_dataframe_index_col(\n            entityset,\n            feature_set.target_df_name,\n            approx_cutoffs.copy(),\n            relationship_path,\n        )\n\n        # Select only columns we care about\n        columns_we_want = [\n            new_approx_dataframe_index_col,\n            cutoff_df_time_col,\n            target_time_colname,\n        ]\n\n        cutoffs_with_approx_e_ids = cutoffs_with_approx_e_ids[columns_we_want]\n        cutoffs_with_approx_e_ids = cutoffs_with_approx_e_ids.drop_duplicates()\n        cutoffs_with_approx_e_ids.dropna(\n            subset=[new_approx_dataframe_index_col],\n            inplace=True,\n        )\n\n        approx_features = [\n            feature_set.features_by_name[name] for name in approx_feature_names\n        ]\n        if cutoffs_with_approx_e_ids.empty:\n            approx_fm = gen_empty_approx_features_df(approx_features)\n        else:\n            cutoffs_with_approx_e_ids.sort_values(\n                [cutoff_df_time_col, new_approx_dataframe_index_col],\n                inplace=True,\n            )\n            # CFM assumes specific column names for cutoff_time argument\n            rename = {new_approx_dataframe_index_col: cutoff_df_instance_col}\n            cutoff_time_to_pass = cutoffs_with_approx_e_ids.rename(columns=rename)\n            cutoff_time_to_pass = cutoff_time_to_pass[\n                [cutoff_df_instance_col, cutoff_df_time_col]\n            ]\n\n            cutoff_time_to_pass.drop_duplicates(inplace=True)\n            approx_fm = calculate_feature_matrix(\n                approx_features,\n                entityset,\n                cutoff_time=cutoff_time_to_pass,\n                training_window=training_window,\n                approximate=None,\n                cutoff_time_in_index=False,\n                chunk_size=cutoff_time_to_pass.shape[0],\n                include_cutoff_time=include_cutoff_time,\n            )\n\n        approx_fms_trie.get_node(relationship_path).value = approx_fm\n\n    return approx_fms_trie\n\n\ndef scatter_warning(num_scattered_workers, num_workers):\n    if num_scattered_workers != num_workers:\n        scatter_warning = \"EntitySet was only scattered to {} out of {} workers\"\n        logger.warning(scatter_warning.format(num_scattered_workers, num_workers))\n\n\ndef parallel_calculate_chunks(\n    cutoff_time,\n    chunk_size,\n    feature_set,\n    approximate,\n    training_window,\n    save_progress,\n    entityset,\n    n_jobs,\n    no_unapproximated_aggs,\n    cutoff_df_time_col,\n    target_time,\n    pass_columns,\n    progress_bar,\n    dask_kwargs=None,\n    progress_callback=None,\n    include_cutoff_time=True,\n):\n    import_or_raise(\n        \"distributed\",\n        \"Dask must be installed to calculate feature matrix with n_jobs set to anything but 1\",\n    )\n    from dask.base import tokenize\n    from distributed import Future, as_completed\n\n    client = None\n    cluster = None\n    try:\n        client, cluster = create_client_and_cluster(\n            n_jobs=n_jobs,\n            dask_kwargs=dask_kwargs,\n            entityset_size=entityset.__sizeof__(),\n        )\n        # scatter the entityset\n        # denote future with leading underscore\n        start = time.time()\n        es_token = \"EntitySet-{}\".format(tokenize(entityset))\n        if es_token in client.list_datasets():\n            msg = \"Using EntitySet persisted on the cluster as dataset {}\"\n            progress_bar.write(msg.format(es_token))\n            _es = client.get_dataset(es_token)\n        else:\n            _es = client.scatter([entityset])[0]\n            client.publish_dataset(**{_es.key: _es})\n\n        # save features to a tempfile and scatter it\n        pickled_feats = cloudpickle.dumps(feature_set)\n        _saved_features = client.scatter(pickled_feats)\n        client.replicate([_es, _saved_features])\n        num_scattered_workers = len(\n            client.who_has([Future(es_token)]).get(es_token, []),\n        )\n        num_workers = len(client.scheduler_info()[\"workers\"].values())\n\n        schema = None\n        if isinstance(cutoff_time, pd.DataFrame):\n            schema = cutoff_time.ww.schema\n            chunks = cutoff_time.groupby(cutoff_df_time_col)\n            cutoff_time_len = cutoff_time.shape[0]\n        else:\n            chunks = cutoff_time\n            cutoff_time_len = len(cutoff_time[1])\n\n        if not chunk_size:\n            chunk_size = _handle_chunk_size(1.0 / num_workers, cutoff_time_len)\n\n        chunks = _chunk_dataframe_groups(chunks, chunk_size)\n\n        chunks = [df for _, df in chunks]\n\n        if len(chunks) < num_workers:  # pragma: no cover\n            chunk_warning = (\n                \"Fewer chunks ({}), than workers ({}) consider reducing the chunk size\"\n            )\n            warning_string = chunk_warning.format(len(chunks), num_workers)\n            progress_bar.write(warning_string)\n\n        scatter_warning(num_scattered_workers, num_workers)\n        end = time.time()\n        scatter_time = round(end - start)\n\n        # if enabled, reset timer after scatter for better time remaining estimates\n        if not progress_bar.disable:\n            progress_bar.reset()\n\n        scatter_string = \"EntitySet scattered to {} workers in {} seconds\"\n        progress_bar.write(scatter_string.format(num_scattered_workers, scatter_time))\n        # map chunks\n        # TODO: consider handling task submission dask kwargs\n        _chunks = client.map(\n            calculate_chunk,\n            chunks,\n            feature_set=_saved_features,\n            chunk_size=None,\n            entityset=_es,\n            approximate=approximate,\n            training_window=training_window,\n            save_progress=save_progress,\n            no_unapproximated_aggs=no_unapproximated_aggs,\n            cutoff_df_time_col=cutoff_df_time_col,\n            target_time=target_time,\n            pass_columns=pass_columns,\n            progress_bar=None,\n            progress_callback=progress_callback,\n            include_cutoff_time=include_cutoff_time,\n            schema=schema,\n        )\n\n        feature_matrix = []\n        iterator = as_completed(_chunks).batches()\n        for batch in iterator:\n            results = client.gather(batch)\n            for result in results:\n                feature_matrix.append(result)\n                previous_progress = progress_bar.n\n                progress_bar.update(result.shape[0])\n                if progress_callback is not None:\n                    (\n                        update,\n                        progress_percent,\n                        time_elapsed,\n                    ) = update_progress_callback_parameters(\n                        progress_bar,\n                        previous_progress,\n                    )\n                    progress_callback(update, progress_percent, time_elapsed)\n\n    except Exception:\n        raise\n    finally:\n        if client is not None:\n            client.close()\n\n        if \"cluster\" not in dask_kwargs and cluster is not None:\n            cluster.close()  # pragma: no cover\n\n    ww_init_kwargs = get_ww_types_from_features(\n        feature_set.target_features,\n        entityset,\n        pass_columns,\n        cutoff_time,\n    )\n    feature_matrix = init_ww_and_concat_fm(feature_matrix, ww_init_kwargs)\n    return feature_matrix\n\n\ndef _add_approx_dataframe_index_col(es, target_dataframe_name, cutoffs, path):\n    \"\"\"\n    Add a column to the cutoff df linking it to the dataframe at the end of the\n    path.\n\n    Return the updated cutoff df and the name of this column. The name will\n    consist of the columns which were joined through.\n    \"\"\"\n    last_child_col = \"instance_id\"\n    last_parent_col = es[target_dataframe_name].ww.index\n\n    for _, relationship in path:\n        child_cols = [last_parent_col, relationship._child_column_name]\n        child_df = es[relationship.child_name][child_cols]\n\n        # Rename relationship.child_column to include the columns we have\n        # joined through.\n        new_col_name = \"%s.%s\" % (last_child_col, relationship._child_column_name)\n        to_rename = {relationship._child_column_name: new_col_name}\n        child_df = child_df.rename(columns=to_rename)\n        cutoffs = cutoffs.merge(\n            child_df,\n            left_on=last_child_col,\n            right_on=last_parent_col,\n        )\n\n        # These will be used in the next iteration.\n        last_child_col = new_col_name\n        last_parent_col = relationship._parent_column_name\n\n    return cutoffs, new_col_name\n\n\ndef _chunk_dataframe_groups(grouped, chunk_size):\n    \"\"\"chunks a grouped dataframe into groups no larger than chunk_size\"\"\"\n    if isinstance(grouped, tuple):\n        for i in range(0, len(grouped[1]), chunk_size):\n            yield None, (grouped[0], grouped[1].iloc[i : i + chunk_size])\n    else:\n        for group_key, group_df in grouped:\n            for i in range(0, len(group_df), chunk_size):\n                yield group_key, group_df.iloc[i : i + chunk_size]\n\n\ndef _handle_chunk_size(chunk_size, total_size):\n    if chunk_size is not None:\n        assert chunk_size > 0, \"Chunk size must be greater than 0\"\n\n        if chunk_size < 1:\n            chunk_size = math.ceil(chunk_size * total_size)\n\n        chunk_size = int(chunk_size)\n\n    return chunk_size\n\n\ndef update_progress_callback_parameters(progress_bar, previous_progress):\n    update = (progress_bar.n - previous_progress) / progress_bar.total * 100\n    progress_percent = (progress_bar.n / progress_bar.total) * 100\n    time_elapsed = progress_bar.format_dict[\"elapsed\"]\n    return (update, progress_percent, time_elapsed)\n\n\ndef init_ww_and_concat_fm(feature_matrix, ww_init_kwargs):\n    cols_to_check = {\n        col\n        for col, ltype in ww_init_kwargs[\"logical_types\"].items()\n        if isinstance(ltype, (Age, Boolean, Integer))\n    }\n    replacement_type = {\n        \"age\": AgeNullable(),\n        \"boolean\": BooleanNullable(),\n        \"integer\": IntegerNullable(),\n    }\n    for fm in feature_matrix:\n        updated_cols = set()\n        for col in cols_to_check:\n            # Only convert types if null values are present\n            if fm[col].isnull().any():\n                current_type = ww_init_kwargs[\"logical_types\"][col].type_string\n                ww_init_kwargs[\"logical_types\"][col] = replacement_type[current_type]\n                updated_cols.add(col)\n        cols_to_check = cols_to_check - updated_cols\n        fm.ww.init(**ww_init_kwargs)\n\n    feature_matrix = pd.concat(feature_matrix)\n\n    feature_matrix.ww.init(**ww_init_kwargs)\n    return feature_matrix\n"
  },
  {
    "path": "featuretools/computational_backends/feature_set.py",
    "content": "import itertools\nimport logging\nfrom collections import defaultdict\n\nfrom featuretools.entityset.relationship import RelationshipPath\nfrom featuretools.feature_base import (\n    AggregationFeature,\n    FeatureOutputSlice,\n    GroupByTransformFeature,\n    TransformFeature,\n)\nfrom featuretools.utils import Trie\n\nlogger = logging.getLogger(\"featuretools.computational_backend\")\n\n\nclass FeatureSet(object):\n    \"\"\"\n    Represents an immutable set of features to be calculated for a single dataframe, and their\n    dependencies.\n    \"\"\"\n\n    def __init__(self, features, approximate_feature_trie=None):\n        \"\"\"\n        Args:\n            features (list[Feature]): Features of the target dataframe.\n            approximate_feature_trie (Trie[RelationshipPath, set[str]], optional): Dependency\n                features to ignore because they have already been approximated. For example, if\n                one of the target features is a direct feature of a feature A and A is included in\n                approximate_feature_trie then neither A nor its dependencies will appear in\n                FeatureSet.feature_trie.\n        \"\"\"\n        self.target_df_name = features[0].dataframe_name\n        self.target_features = features\n        self.target_feature_names = {f.unique_name() for f in features}\n\n        if not approximate_feature_trie:\n            approximate_feature_trie = Trie(\n                default=list,\n                path_constructor=RelationshipPath,\n            )\n        self.approximate_feature_trie = approximate_feature_trie\n\n        # Maps the unique name of each feature to the actual feature. This is necessary\n        # because features do not support equality and so cannot be used as\n        # dictionary keys. The equality operator on features produces a new\n        # feature (which will always be truthy).\n        self.features_by_name = {f.unique_name(): f for f in features}\n\n        feature_dependents = defaultdict(set)\n        for f in features:\n            deps = f.get_dependencies(deep=True)\n            for dep in deps:\n                feature_dependents[dep.unique_name()].add(f.unique_name())\n                self.features_by_name[dep.unique_name()] = dep\n                subdeps = dep.get_dependencies(deep=True)\n                for sd in subdeps:\n                    feature_dependents[sd.unique_name()].add(dep.unique_name())\n\n        # feature names (keys) and the features that rely on them (values).\n        self.feature_dependents = {\n            fname: [self.features_by_name[dname] for dname in feature_dependents[fname]]\n            for fname, f in self.features_by_name.items()\n        }\n\n        self._feature_trie = None\n\n    @property\n    def feature_trie(self):\n        \"\"\"\n        The target features and their dependencies organized into a trie by relationship path.\n        This is built once when it is first called (to avoid building it if it is not needed) and\n        then used for all subsequent calls.\n\n        The edges of the trie are RelationshipPaths and the values are tuples of\n        (bool, set[str], set[str]). The bool represents whether the full dataframe is needed at\n        that node, the first set contains the names of features which are needed on the full\n        dataframe, and the second set contains the names of the rest of the features\n\n        Returns:\n            Trie[RelationshipPath, (bool, set[str], set[str])]\n        \"\"\"\n        if not self._feature_trie:\n            self._feature_trie = self._build_feature_trie()\n\n        return self._feature_trie\n\n    def _build_feature_trie(self):\n        \"\"\"\n        Build the feature trie by adding the target features and their dependencies recursively.\n        \"\"\"\n        feature_trie = Trie(\n            default=lambda: (False, set(), set()),\n            path_constructor=RelationshipPath,\n        )\n\n        for f in self.target_features:\n            self._add_feature_to_trie(feature_trie, f, self.approximate_feature_trie)\n\n        return feature_trie\n\n    def _add_feature_to_trie(\n        self,\n        trie,\n        feature,\n        approximate_feature_trie,\n        ancestor_needs_full_dataframe=False,\n    ):\n        \"\"\"\n        Add the given feature to the root of the trie, and recurse on its dependencies. If it is in\n        approximate_feature_trie then it will not be added and we will not recurse on its dependencies.\n        \"\"\"\n        node_needs_full_dataframe, full_features, not_full_features = trie.value\n        needs_full_dataframe = (\n            ancestor_needs_full_dataframe or self.uses_full_dataframe(feature)\n        )\n\n        name = feature.unique_name()\n\n        # If this feature is ignored then don't add it or any of its dependencies.\n        if name in approximate_feature_trie.value:\n            return\n\n        # Add the feature to one of the sets, depending on whether it needs the full dataframe.\n        if needs_full_dataframe:\n            full_features.add(name)\n            if name in not_full_features:\n                not_full_features.remove(name)\n\n            # Update needs_full_dataframe for this node.\n            trie.value = (True, full_features, not_full_features)\n\n            # Set every node in relationship path to needs_full_dataframe.\n            sub_trie = trie\n            for edge in feature.relationship_path:\n                sub_trie = sub_trie.get_node([edge])\n                (_, f1, f2) = sub_trie.value\n                sub_trie.value = (True, f1, f2)\n        else:\n            if name not in full_features:\n                not_full_features.add(name)\n\n            sub_trie = trie.get_node(feature.relationship_path)\n\n        sub_ignored_trie = approximate_feature_trie.get_node(feature.relationship_path)\n\n        for dep_feat in feature.get_dependencies():\n            if isinstance(dep_feat, FeatureOutputSlice):\n                dep_feat = dep_feat.base_feature\n            self._add_feature_to_trie(\n                sub_trie,\n                dep_feat,\n                sub_ignored_trie,\n                ancestor_needs_full_dataframe=needs_full_dataframe,\n            )\n\n    def group_features(self, feature_names):\n        \"\"\"\n        Topologically sort the given features, then group by path,\n        feature type, use_previous, and where.\n        \"\"\"\n        features = [self.features_by_name[name] for name in feature_names]\n        depths = self._get_feature_depths(features)\n\n        def key_func(f):\n            return (\n                depths[f.unique_name()],\n                f.relationship_path_name(),\n                str(f.__class__),\n                _get_use_previous(f),\n                _get_where(f),\n                self.uses_full_dataframe(f),\n                _get_groupby(f),\n            )\n\n        # Sort the list of features by the complex key function above, then\n        # group them by the same key\n        sort_feats = sorted(features, key=key_func)\n        feature_groups = [\n            list(g) for _, g in itertools.groupby(sort_feats, key=key_func)\n        ]\n\n        return feature_groups\n\n    def _get_feature_depths(self, features):\n        \"\"\"\n        Generate and return a mapping of {feature name -> depth} in the\n        feature DAG for the given dataframe.\n        \"\"\"\n        order = defaultdict(int)\n        depths = {}\n        queue = features[:]\n        while queue:\n            # Get the next feature.\n            f = queue.pop(0)\n\n            depths[f.unique_name()] = order[f.unique_name()]\n\n            # Only look at dependencies if they are on the same dataframe.\n            if not f.relationship_path:\n                dependencies = f.get_dependencies()\n                for dep in dependencies:\n                    order[dep.unique_name()] = min(\n                        order[f.unique_name()] - 1,\n                        order[dep.unique_name()],\n                    )\n                    queue.append(dep)\n\n        return depths\n\n    def uses_full_dataframe(self, feature, check_dependents=False):\n        if (\n            isinstance(feature, TransformFeature)\n            and feature.primitive.uses_full_dataframe\n        ):\n            return True\n        return check_dependents and self._dependent_uses_full_dataframe(feature)\n\n    def _dependent_uses_full_dataframe(self, feature):\n        for d in self.feature_dependents[feature.unique_name()]:\n            if isinstance(d, TransformFeature) and d.primitive.uses_full_dataframe:\n                return True\n        return False\n\n\n# These functions are used for sorting and grouping features\n\n\ndef _get_use_previous(\n    f,\n):  # TODO Sort and group features for DateOffset with two different temporal values\n    if isinstance(f, AggregationFeature) and f.use_previous is not None:\n        if len(f.use_previous.times.keys()) > 1:\n            return (\"\", -1)\n        else:\n            unit = list(f.use_previous.times.keys())[0]\n            value = f.use_previous.times[unit]\n            return (unit, value)\n    else:\n        return (\"\", -1)\n\n\ndef _get_where(f):\n    if isinstance(f, AggregationFeature) and f.where is not None:\n        return f.where.unique_name()\n    else:\n        return \"\"\n\n\ndef _get_groupby(f):\n    if isinstance(f, GroupByTransformFeature):\n        return f.groupby.unique_name()\n    else:\n        return \"\"\n"
  },
  {
    "path": "featuretools/computational_backends/feature_set_calculator.py",
    "content": "from datetime import datetime\nfrom functools import partial\n\nimport numpy as np\nimport pandas as pd\nimport pandas.api.types as pdtypes\n\nfrom featuretools.entityset.relationship import RelationshipPath\nfrom featuretools.exceptions import UnknownFeature\nfrom featuretools.feature_base import (\n    AggregationFeature,\n    DirectFeature,\n    GroupByTransformFeature,\n    IdentityFeature,\n    TransformFeature,\n)\nfrom featuretools.utils import Trie\nfrom featuretools.utils.gen_utils import get_relationship_column_id\n\n\nclass FeatureSetCalculator(object):\n    \"\"\"\n    Calculates the values of a set of features for given instance ids.\n    \"\"\"\n\n    def __init__(\n        self,\n        entityset,\n        feature_set,\n        time_last=None,\n        training_window=None,\n        precalculated_features=None,\n    ):\n        \"\"\"\n        Args:\n            feature_set (FeatureSet): The features to calculate values for.\n\n            time_last (pd.Timestamp, optional): Last allowed time. Data from exactly this\n                time not allowed.\n\n            training_window (Timedelta, optional): Window defining how much time before the cutoff time data\n                can be used when calculating features. If None, all data before cutoff time is used.\n\n            precalculated_features (Trie[RelationshipPath -> pd.DataFrame]):\n                Maps RelationshipPaths to dataframes of precalculated_features\n\n        \"\"\"\n        self.entityset = entityset\n        self.feature_set = feature_set\n        self.training_window = training_window\n\n        if time_last is None:\n            time_last = datetime.now()\n\n        self.time_last = time_last\n\n        if precalculated_features is None:\n            precalculated_features = Trie(path_constructor=RelationshipPath)\n\n        self.precalculated_features = precalculated_features\n\n        # total number of features (including dependencies) to be calculate\n        self.num_features = sum(\n            len(features1) + len(features2)\n            for _, (_, features1, features2) in self.feature_set.feature_trie\n        )\n\n    def run(self, instance_ids, progress_callback=None, include_cutoff_time=True):\n        \"\"\"\n        Calculate values of features for the given instances of the target\n        dataframe.\n\n        Summary of algorithm:\n        1. Construct a trie where the edges are relationships and each node\n            contains a set of features for a single dataframe. See\n            FeatureSet._build_feature_trie.\n        2. Initialize a trie for storing dataframes.\n        3. Traverse the trie using depth first search. At each node calculate\n            the features and store the resulting dataframe in the dataframe\n            trie (so that its values can be used by features which depend on\n            these features). See _calculate_features_for_dataframe.\n        4. Get the dataframe at the root of the trie (for the target dataframe) and\n            return the columns corresponding to the requested features.\n\n        Args:\n            instance_ids (np.ndarray or pd.Categorical): Instance ids for which\n                to build features.\n\n            progress_callback (callable): function to be called with incremental progress updates\n\n            include_cutoff_time (bool): If True, data at cutoff time are included\n                in calculating features.\n\n        Returns:\n            pd.DataFrame : Pandas DataFrame of calculated feature values.\n                Indexed by instance_ids. Columns in same order as features\n                passed in.\n        \"\"\"\n        assert len(instance_ids) > 0, \"0 instance ids provided\"\n\n        if progress_callback is None:\n            # do nothing for the progress call back if not provided\n            def progress_callback(*args):\n                pass\n\n        feature_trie = self.feature_set.feature_trie\n\n        df_trie = Trie(path_constructor=RelationshipPath)\n        full_dataframe_trie = Trie(path_constructor=RelationshipPath)\n\n        target_dataframe = self.entityset[self.feature_set.target_df_name]\n\n        self._calculate_features_for_dataframe(\n            dataframe_name=self.feature_set.target_df_name,\n            feature_trie=feature_trie,\n            df_trie=df_trie,\n            full_dataframe_trie=full_dataframe_trie,\n            precalculated_trie=self.precalculated_features,\n            filter_column=target_dataframe.ww.index,\n            filter_values=instance_ids,\n            progress_callback=progress_callback,\n            include_cutoff_time=include_cutoff_time,\n        )\n\n        # The dataframe for the target dataframe should be stored at the root of\n        # df_trie.\n        df = df_trie.value\n\n        # Fill in empty rows with default values.\n        index_dtype = df.index.dtype.name\n        if df.empty:\n            return self.generate_default_df(instance_ids=instance_ids)\n\n        missing_ids = [\n            i for i in instance_ids if i not in df[target_dataframe.ww.index]\n        ]\n        if missing_ids:\n            default_df = self.generate_default_df(\n                instance_ids=missing_ids,\n                extra_columns=df.columns,\n            )\n\n            df = pd.concat([df, default_df], sort=True)\n\n        df.index.name = self.entityset[self.feature_set.target_df_name].ww.index\n\n        # Order by instance_ids\n        unique_instance_ids = pd.unique(instance_ids)\n        unique_instance_ids = unique_instance_ids.astype(instance_ids.dtype)\n        df = df.reindex(unique_instance_ids)\n\n        # Keep categorical index if original index was categorical\n        if index_dtype == \"category\":\n            df.index = df.index.astype(\"category\")\n\n        column_list = []\n\n        for feat in self.feature_set.target_features:\n            column_list.extend(feat.get_feature_names())\n\n        return df[column_list]\n\n    def _calculate_features_for_dataframe(\n        self,\n        dataframe_name,\n        feature_trie,\n        df_trie,\n        full_dataframe_trie,\n        precalculated_trie,\n        filter_column,\n        filter_values,\n        parent_data=None,\n        progress_callback=None,\n        include_cutoff_time=True,\n    ):\n        \"\"\"\n        Generate dataframes with features calculated for this node of the trie,\n        and all descendant nodes. The dataframes will be stored in df_trie.\n\n        Args:\n            dataframe_name (str): The name of the dataframe to calculate features for.\n\n            feature_trie (Trie): the trie with sets of features to calculate.\n                The root contains features for the given dataframe.\n\n            df_trie (Trie): a parallel trie for storing dataframes. The\n                dataframe with features calculated will be placed in the root.\n\n            full_dataframe_trie (Trie): a trie storing dataframes will all dataframe\n                rows, for features that are uses_full_dataframe.\n\n            precalculated_trie (Trie): a parallel trie containing dataframes\n                with precalculated features. The dataframe specified by dataframe_name\n                will be at the root.\n\n            filter_column (str): The name of the column to filter this\n                dataframe by.\n\n            filter_values (pd.Series): The values to filter the filter_column\n                to.\n\n            parent_data (tuple[Relationship, list[str], pd.DataFrame]): Data\n                related to the parent of this trie. This will only be present if\n                the relationship points from this dataframe to the parent dataframe. A\n                3 tuple of (parent_relationship,\n                ancestor_relationship_columns, parent_df).\n                ancestor_relationship_columns is the names of columns which\n                link the parent dataframe to its ancestors.\n\n            include_cutoff_time (bool): If True, data at cutoff time are included\n                in calculating features.\n\n        \"\"\"\n        # Step 1: Get a dataframe for the given dataframe name, filtered by the given\n        # conditions.\n\n        (\n            need_full_dataframe,\n            full_dataframe_features,\n            not_full_dataframe_features,\n        ) = feature_trie.value\n\n        all_features = full_dataframe_features | not_full_dataframe_features\n        columns = self._necessary_columns(dataframe_name, all_features)\n\n        # If we need the full dataframe then don't filter by filter_values.\n        if need_full_dataframe:\n            query_column = None\n            query_values = None\n        else:\n            query_column = filter_column\n            query_values = filter_values\n\n        df = self.entityset.query_by_values(\n            dataframe_name=dataframe_name,\n            instance_vals=query_values,\n            column_name=query_column,\n            columns=columns,\n            time_last=self.time_last,\n            training_window=self.training_window,\n            include_cutoff_time=include_cutoff_time,\n        )\n\n        # call to update timer\n        progress_callback(0)\n\n        # Step 2: Add columns to the dataframe linking it to all ancestors.\n        new_ancestor_relationship_columns = []\n        if parent_data:\n            parent_relationship, ancestor_relationship_columns, parent_df = parent_data\n\n            if ancestor_relationship_columns:\n                (\n                    df,\n                    new_ancestor_relationship_columns,\n                ) = self._add_ancestor_relationship_columns(\n                    df,\n                    parent_df,\n                    ancestor_relationship_columns,\n                    parent_relationship,\n                )\n\n            # Add the column linking this dataframe to its parent, so that\n            # descendants get linked to the parent.\n            new_ancestor_relationship_columns.append(\n                parent_relationship._child_column_name,\n            )\n\n        # call to update timer\n        progress_callback(0)\n\n        # Step 3: Recurse on children.\n\n        # Pass filtered values, even if we are using a full df.\n        if need_full_dataframe:\n            filtered_df = df[df[filter_column].isin(filter_values)]\n        else:\n            filtered_df = df\n\n        for edge, sub_trie in feature_trie.children():\n            is_forward, relationship = edge\n            if is_forward:\n                sub_dataframe_name = relationship.parent_dataframe.ww.name\n                sub_filter_column = relationship._parent_column_name\n                sub_filter_values = filtered_df[relationship._child_column_name]\n                parent_data = None\n            else:\n                sub_dataframe_name = relationship.child_dataframe.ww.name\n                sub_filter_column = relationship._child_column_name\n                sub_filter_values = filtered_df[relationship._parent_column_name]\n\n                parent_data = (relationship, new_ancestor_relationship_columns, df)\n\n            sub_df_trie = df_trie.get_node([edge])\n            sub_full_dataframe_trie = full_dataframe_trie.get_node([edge])\n            sub_precalc_trie = precalculated_trie.get_node([edge])\n            self._calculate_features_for_dataframe(\n                dataframe_name=sub_dataframe_name,\n                feature_trie=sub_trie,\n                df_trie=sub_df_trie,\n                full_dataframe_trie=sub_full_dataframe_trie,\n                precalculated_trie=sub_precalc_trie,\n                filter_column=sub_filter_column,\n                filter_values=sub_filter_values,\n                parent_data=parent_data,\n                progress_callback=progress_callback,\n                include_cutoff_time=include_cutoff_time,\n            )\n\n        # Step 4: Calculate the features for this dataframe.\n        #\n        # All dependencies of the features for this dataframe have been calculated\n        # by the above recursive calls, and their results stored in df_trie.\n\n        # Add any precalculated features.\n        precalculated_features_df = precalculated_trie.value\n        if precalculated_features_df is not None:\n            # Left outer merge to keep all rows of df.\n            df = df.merge(\n                precalculated_features_df,\n                how=\"left\",\n                left_index=True,\n                right_index=True,\n                suffixes=(\"\", \"_precalculated\"),\n            )\n\n        # call to update timer\n        progress_callback(0)\n\n        # First, calculate any features that require the full dataframe. These can\n        # be calculated first because all of their dependents are included in\n        # full_dataframe_features.\n        if need_full_dataframe:\n            df = self._calculate_features(\n                df,\n                full_dataframe_trie,\n                full_dataframe_features,\n                progress_callback,\n            )\n\n            # Store full dataframe\n            full_dataframe_trie.value = df\n\n            # Filter df so that features that don't require the full dataframe are\n            # only calculated on the necessary instances.\n            df = df[df[filter_column].isin(filter_values)]\n\n        # Calculate all features that don't require the full dataframe.\n        df = self._calculate_features(\n            df,\n            df_trie,\n            not_full_dataframe_features,\n            progress_callback,\n        )\n\n        # Step 5: Store the dataframe for this dataframe at the root of df_trie, so\n        # that it can be accessed by the caller.\n        df_trie.value = df\n\n    def _calculate_features(self, df, df_trie, features, progress_callback):\n        # Group the features so that each group can be calculated together.\n        # The groups must also be in topological order (if A is a transform of B\n        # then B must be in a group before A).\n        feature_groups = self.feature_set.group_features(features)\n\n        for group in feature_groups:\n            representative_feature = group[0]\n            handler = self._feature_type_handler(representative_feature)\n            df = handler(group, df, df_trie, progress_callback)\n\n        return df\n\n    def _add_ancestor_relationship_columns(\n        self,\n        child_df,\n        parent_df,\n        ancestor_relationship_columns,\n        relationship,\n    ):\n        \"\"\"\n        Merge ancestor_relationship_columns from parent_df into child_df, adding a prefix to\n        each column name specifying the relationship.\n\n        Return the updated df and the new relationship column names.\n\n        Args:\n            child_df (pd.DataFrame): The dataframe to add relationship columns to.\n            parent_df (pd.DataFrame): The dataframe to copy relationship columns from.\n            ancestor_relationship_columns (list[str]): The names of\n                relationship columns in the parent_df to copy into child_df.\n            relationship (Relationship): the relationship through which the\n                child is connected to the parent.\n        \"\"\"\n        relationship_name = relationship.parent_name\n        new_relationship_columns = [\n            \"%s.%s\" % (relationship_name, col) for col in ancestor_relationship_columns\n        ]\n\n        # create an intermediate dataframe which shares a column\n        # with the child dataframe and has a column with the\n        # original parent's id.\n        col_map = {relationship._parent_column_name: relationship._child_column_name}\n        for child_column, parent_column in zip(\n            new_relationship_columns,\n            ancestor_relationship_columns,\n        ):\n            col_map[parent_column] = child_column\n\n        merge_df = parent_df[list(col_map.keys())].rename(columns=col_map)\n\n        merge_df.index.name = None  # change index name for merge\n\n        # Merge the dataframe, adding the relationship columns to the child.\n        # Left outer join so that all rows in child are kept (if it contains\n        # all rows of the dataframe then there may not be corresponding rows in the\n        # parent_df).\n        df = child_df.merge(\n            merge_df,\n            how=\"left\",\n            left_on=relationship._child_column_name,\n            right_on=relationship._child_column_name,\n        )\n\n        # ensure index is maintained\n        df.set_index(\n            relationship.child_dataframe.ww.index,\n            drop=False,\n            inplace=True,\n        )\n\n        return df, new_relationship_columns\n\n    def generate_default_df(self, instance_ids, extra_columns=None):\n        default_row = []\n        default_cols = []\n        for f in self.feature_set.target_features:\n            for name in f.get_feature_names():\n                default_cols.append(name)\n                default_row.append(f.default_value)\n\n        default_matrix = [default_row] * len(instance_ids)\n        default_df = pd.DataFrame(\n            default_matrix,\n            columns=default_cols,\n            index=instance_ids,\n            dtype=\"object\",\n        )\n        index_name = self.entityset[self.feature_set.target_df_name].ww.index\n        default_df.index.name = index_name\n        if extra_columns is not None:\n            for c in extra_columns:\n                if c not in default_df.columns:\n                    default_df[c] = [np.nan] * len(instance_ids)\n        return default_df\n\n    def _feature_type_handler(self, f):\n        if type(f) == TransformFeature:\n            return self._calculate_transform_features\n        elif type(f) == GroupByTransformFeature:\n            return self._calculate_groupby_features\n        elif type(f) == DirectFeature:\n            return self._calculate_direct_features\n        elif type(f) == AggregationFeature:\n            return self._calculate_agg_features\n        elif type(f) == IdentityFeature:\n            return self._calculate_identity_features\n        else:\n            raise UnknownFeature(\"{} feature unknown\".format(f.__class__))\n\n    def _calculate_identity_features(self, features, df, _df_trie, progress_callback):\n        for f in features:\n            assert f.get_name() in df.columns, (\n                'Column \"%s\" missing frome dataframe' % f.get_name()\n            )\n\n        progress_callback(len(features) / float(self.num_features))\n\n        return df\n\n    def _calculate_transform_features(\n        self,\n        features,\n        frame,\n        _df_trie,\n        progress_callback,\n    ):\n        frame_empty = frame.empty\n        feature_values = []\n        for f in features:\n            # handle when no data\n            if frame_empty:\n                # Even though we are adding the default values here, when these new\n                # features are added to the dataframe in update_feature_columns, they\n                # are added as empty columns since the dataframe itself is empty.\n                feature_values.append(\n                    (f, [f.default_value for _ in range(f.number_output_features)]),\n                )\n                progress_callback(1 / float(self.num_features))\n                continue\n\n            # collect only the columns we need for this transformation\n\n            column_data = [frame[bf.get_name()] for bf in f.base_features]\n\n            feature_func = f.get_function()\n            # apply the function to the relevant dataframe slice and add the\n            # feature row to the results dataframe.\n            if f.primitive.uses_calc_time:\n                values = feature_func(*column_data, time=self.time_last)\n            else:\n                values = feature_func(*column_data)\n\n            # if we don't get just the values, the assignment breaks when indexes don't match\n            if f.number_output_features > 1:\n                values = [strip_values_if_series(value) for value in values]\n            else:\n                values = [strip_values_if_series(values)]\n\n            feature_values.append((f, values))\n\n            progress_callback(1 / float(self.num_features))\n\n        frame = update_feature_columns(feature_values, frame)\n        return frame\n\n    def _calculate_groupby_features(self, features, frame, _df_trie, progress_callback):\n        # set default values to handle the null group\n        default_values = {}\n        for f in features:\n            for name in f.get_feature_names():\n                default_values[name] = f.default_value\n\n        frame = pd.concat(\n            [frame, pd.DataFrame(default_values, index=frame.index)],\n            axis=1,\n        )\n\n        # handle when no data\n        if frame.shape[0] == 0:\n            progress_callback(len(features) / float(self.num_features))\n\n            return frame\n\n        groupby = features[0].groupby.get_name()\n        grouped = frame.groupby(groupby)\n        groups = frame[\n            groupby\n        ].unique()  # get all the unique group name to iterate over later\n\n        for f in features:\n            feature_vals = []\n            for _ in range(f.number_output_features):\n                feature_vals.append([])\n\n            for group in groups:\n                # skip null key if it exists\n                if pd.isnull(group):\n                    continue\n\n                column_names = [bf.get_name() for bf in f.base_features]\n                # exclude the groupby column from being passed to the function\n                column_data = [\n                    grouped[name].get_group(group) for name in column_names[:-1]\n                ]\n                feature_func = f.get_function()\n\n                # apply the function to the relevant dataframe slice and add the\n                # feature row to the results dataframe.\n                if f.primitive.uses_calc_time:\n                    values = feature_func(*column_data, time=self.time_last)\n                else:\n                    values = feature_func(*column_data)\n\n                if f.number_output_features == 1:\n                    values = [values]\n\n                # make sure index is aligned\n                for i, value in enumerate(values):\n                    if isinstance(value, pd.Series):\n                        value.index = column_data[0].index\n                    else:\n                        value = pd.Series(value, index=column_data[0].index)\n                    feature_vals[i].append(value)\n\n            if any(feature_vals):\n                assert len(feature_vals) == len(f.get_feature_names())\n                for col_vals, name in zip(feature_vals, f.get_feature_names()):\n                    frame[name].update(pd.concat(col_vals))\n\n            progress_callback(1 / float(self.num_features))\n\n        return frame\n\n    def _calculate_direct_features(\n        self,\n        features,\n        child_df,\n        df_trie,\n        progress_callback,\n    ):\n        path = features[0].relationship_path\n        assert len(path) == 1, \"Error calculating DirectFeatures, len(path) != 1\"\n\n        parent_df = df_trie.get_node([path[0]]).value\n        _is_forward, relationship = path[0]\n        merge_col = relationship._child_column_name\n\n        # generate a mapping of old column names (in the parent dataframe) to\n        # new column names (in the child dataframe) for the merge\n        col_map = {relationship._parent_column_name: merge_col}\n        index_as_feature = None\n\n        fillna_dict = {}\n        for f in features:\n            feature_defaults = {\n                name: f.default_value\n                for name in f.get_feature_names()\n                if not pd.isna(f.default_value)\n            }\n            fillna_dict.update(feature_defaults)\n            if f.base_features[0].get_name() == relationship._parent_column_name:\n                index_as_feature = f\n            base_names = f.base_features[0].get_feature_names()\n            for name, base_name in zip(f.get_feature_names(), base_names):\n                if name in child_df.columns:\n                    continue\n                col_map[base_name] = name\n\n        # merge the identity feature from the parent dataframe into the child\n        merge_df = parent_df[list(col_map.keys())].rename(columns=col_map)\n\n        if index_as_feature is not None:\n            merge_df.set_index(\n                index_as_feature.get_name(),\n                inplace=True,\n                drop=False,\n            )\n        else:\n            merge_df.set_index(merge_col, inplace=True)\n\n        new_df = child_df.merge(\n            merge_df,\n            left_on=merge_col,\n            right_index=True,\n            how=\"left\",\n        )\n\n        progress_callback(len(features) / float(self.num_features))\n\n        return new_df.fillna(fillna_dict)\n\n    def _calculate_agg_features(self, features, frame, df_trie, progress_callback):\n        test_feature = features[0]\n        child_dataframe = test_feature.base_features[0].dataframe\n        base_frame = df_trie.get_node(test_feature.relationship_path).value\n        # Sometimes approximate features get computed in a previous filter frame\n        # and put in the current one dynamically,\n        # so there may be existing features here\n        fl = []\n        for f in features:\n            for ind in f.get_feature_names():\n                if ind not in frame.columns:\n                    fl.append(f)\n                    break\n        features = fl\n        if not len(features):\n            progress_callback(len(features) / float(self.num_features))\n            return frame\n\n        # handle where\n        base_frame_empty = base_frame.empty\n        where = test_feature.where\n        if where is not None and not base_frame_empty:\n            base_frame = base_frame.loc[base_frame[where.get_name()]]\n\n        # when no child data, just add all the features to frame with nan\n        base_frame_empty = base_frame.empty\n        if base_frame_empty:\n            feature_values = []\n            for f in features:\n                feature_values.append((f, np.full(f.number_output_features, np.nan)))\n                progress_callback(1 / float(self.num_features))\n            frame = update_feature_columns(feature_values, frame)\n        else:\n            relationship_path = test_feature.relationship_path\n\n            groupby_col = get_relationship_column_id(relationship_path)\n\n            # if the use_previous property exists on this feature, include only the\n            # instances from the child dataframe included in that Timedelta\n            use_previous = test_feature.use_previous\n            if use_previous:\n                # Filter by use_previous values\n                time_last = self.time_last\n                if use_previous.has_no_observations():\n                    time_first = time_last - use_previous\n                    ti = child_dataframe.ww.time_index\n                    if ti is not None:\n                        base_frame = base_frame[base_frame[ti] >= time_first]\n                else:\n                    n = use_previous.get_value(\"o\")\n\n                    def last_n(df):\n                        return df.iloc[-n:]\n\n                    base_frame = base_frame.groupby(\n                        groupby_col,\n                        observed=True,\n                        sort=False,\n                        group_keys=False,\n                    ).apply(last_n)\n\n            to_agg = {}\n            agg_rename = {}\n            to_apply = set()\n            # apply multi-column and time-dependent features as we find them, and\n            # save aggregable features for later\n            for f in features:\n                if _can_agg(f):\n                    column_id = f.base_features[0].get_name()\n                    if column_id not in to_agg:\n                        to_agg[column_id] = []\n                    func = f.get_function()\n\n                    # for some reason, using the string count is significantly\n                    # faster than any method a primitive can return\n                    # https://stackoverflow.com/questions/55731149/use-a-function-instead-of-string-in-pandas-groupby-agg\n                    if func == pd.Series.count:\n                        func = \"count\"\n\n                    funcname = func\n                    if callable(func):\n                        # if the same function is being applied to the same\n                        # column twice, wrap it in a partial to avoid\n                        # duplicate functions\n                        funcname = str(id(func))\n                        if \"{}-{}\".format(column_id, funcname) in agg_rename:\n                            func = partial(func)\n                            funcname = str(id(func))\n\n                        func.__name__ = funcname\n\n                    to_agg[column_id].append(func)\n                    # this is used below to rename columns that pandas names for us\n                    agg_rename[\"{}-{}\".format(column_id, funcname)] = f.get_name()\n                    continue\n\n                to_apply.add(f)\n\n            # Apply the non-aggregable functions generate a new dataframe, and merge\n            # it with the existing one\n            if len(to_apply):\n                wrap = agg_wrapper(to_apply, self.time_last)\n                # groupby_col can be both the name of the index and a column,\n                # to silence pandas warning about ambiguity we explicitly pass\n                # the column (in actuality grouping by both index and group would\n                # work)\n                to_merge = base_frame.groupby(\n                    base_frame[groupby_col],\n                    observed=True,\n                    sort=False,\n                    group_keys=False,\n                ).apply(wrap)\n                frame = pd.merge(\n                    left=frame,\n                    right=to_merge,\n                    left_index=True,\n                    right_index=True,\n                    how=\"left\",\n                )\n\n                progress_callback(len(to_apply) / float(self.num_features))\n\n            # Apply the aggregate functions to generate a new dataframe, and merge\n            # it with the existing one\n            if len(to_agg):\n                # groupby_col can be both the name of the index and a column,\n                # to silence pandas warning about ambiguity we explicitly pass\n                # the column (in actuality grouping by both index and group would\n                # work)\n                to_merge = base_frame.groupby(\n                    base_frame[groupby_col],\n                    observed=True,\n                    sort=False,\n                ).agg(to_agg)\n                # rename columns to the correct feature names\n                to_merge.columns = [agg_rename[\"-\".join(x)] for x in to_merge.columns]\n                to_merge = to_merge[list(agg_rename.values())]\n\n                # Workaround for pandas bug where categories are in the wrong order\n                # see: https://github.com/pandas-dev/pandas/issues/22501\n                #\n                # Pandas claims that bug is fixed but it still shows up in some\n                # cases.  More investigation needed.\n                if isinstance(frame.index, pd.CategoricalDtype):\n                    categories = pdtypes.CategoricalDtype(\n                        categories=frame.index.categories,\n                    )\n                    to_merge.index = to_merge.index.astype(object).astype(categories)\n\n                frame = pd.merge(\n                    left=frame,\n                    right=to_merge,\n                    left_index=True,\n                    right_index=True,\n                    how=\"left\",\n                )\n\n                # determine number of features that were just merged\n                progress_callback(len(to_merge.columns) / float(self.num_features))\n\n        # Handle default values\n        fillna_dict = {}\n        for f in features:\n            feature_defaults = {name: f.default_value for name in f.get_feature_names()}\n            fillna_dict.update(feature_defaults)\n\n        frame = frame.fillna(fillna_dict)\n\n        return frame\n\n    def _necessary_columns(self, dataframe_name, feature_names):\n        # We have to keep all index and foreign columns because we don't know what forward\n        # relationships will come from this node.\n        df = self.entityset[dataframe_name]\n        index_columns = {\n            col\n            for col in df.columns\n            if {\"index\", \"foreign_key\", \"time_index\"} & df.ww.semantic_tags[col]\n        }\n        features = (self.feature_set.features_by_name[name] for name in feature_names)\n\n        feature_columns = {\n            f.column_name for f in features if isinstance(f, IdentityFeature)\n        }\n        return list(index_columns | feature_columns)\n\n\ndef _can_agg(feature):\n    assert isinstance(feature, AggregationFeature)\n    base_features = feature.base_features\n    if feature.where is not None:\n        base_features = [\n            bf.get_name()\n            for bf in base_features\n            if bf.get_name() != feature.where.get_name()\n        ]\n\n    if feature.primitive.uses_calc_time:\n        return False\n    single_output = feature.primitive.number_output_features == 1\n    return len(base_features) == 1 and single_output\n\n\ndef agg_wrapper(feats, time_last):\n    def wrap(df):\n        d = {}\n        feature_values = []\n        for f in feats:\n            func = f.get_function()\n            column_ids = [bf.get_name() for bf in f.base_features]\n            args = [df[v] for v in column_ids]\n\n            if f.primitive.uses_calc_time:\n                values = func(*args, time=time_last)\n            else:\n                values = func(*args)\n\n            if f.number_output_features == 1:\n                values = [values]\n            feature_values.append((f, values))\n\n        d = update_feature_columns(feature_values, d)\n\n        return pd.Series(d)\n\n    return wrap\n\n\ndef update_feature_columns(feature_data, data):\n    new_cols = {}\n    for item in feature_data:\n        names = item[0].get_feature_names()\n        values = item[1]\n        assert len(names) == len(values)\n        for name, value in zip(names, values):\n            new_cols[name] = value\n\n    # Handle the case where a dict is being updated\n    if isinstance(data, dict):\n        data.update(new_cols)\n        return data\n\n    return pd.concat([data, pd.DataFrame(new_cols, index=data.index)], axis=1)\n\n\ndef strip_values_if_series(values):\n    if isinstance(values, pd.Series):\n        values = values.values\n    return values\n"
  },
  {
    "path": "featuretools/computational_backends/utils.py",
    "content": "import logging\nimport os\nimport typing\nimport warnings\nfrom datetime import datetime\nfrom functools import wraps\n\nimport numpy as np\nimport pandas as pd\nimport psutil\nfrom woodwork.logical_types import Datetime, Double\n\nfrom featuretools.entityset.relationship import RelationshipPath\nfrom featuretools.feature_base import AggregationFeature, DirectFeature\nfrom featuretools.utils import Trie\nfrom featuretools.utils.gen_utils import import_or_none\nfrom featuretools.utils.wrangle import _check_time_type, _check_timedelta\n\nlogger = logging.getLogger(\"featuretools.computational_backend\")\n\n\ndef bin_cutoff_times(cutoff_time, bin_size):\n    binned_cutoff_time = cutoff_time.ww.copy()\n    if isinstance(bin_size, int):\n        binned_cutoff_time[\"time\"] = binned_cutoff_time[\"time\"].apply(\n            lambda x: x / bin_size * bin_size,\n        )\n    else:\n        bin_size = _check_timedelta(bin_size)\n        binned_cutoff_time[\"time\"] = datetime_round(\n            binned_cutoff_time[\"time\"],\n            bin_size,\n        )\n    return binned_cutoff_time\n\n\ndef save_csv_decorator(save_progress=None):\n    def inner_decorator(method):\n        @wraps(method)\n        def wrapped(*args, **kwargs):\n            if save_progress is None:\n                r = method(*args, **kwargs)\n            else:\n                time = args[0].to_pydatetime()\n                file_name = \"ft_\" + time.strftime(\"%Y_%m_%d_%I-%M-%S-%f\") + \".csv\"\n                file_path = os.path.join(save_progress, file_name)\n                temp_dir = os.path.join(save_progress, \"temp\")\n                if not os.path.exists(temp_dir):\n                    os.makedirs(temp_dir)\n                temp_file_path = os.path.join(temp_dir, file_name)\n                r = method(*args, **kwargs)\n                r.to_csv(temp_file_path)\n                os.rename(temp_file_path, file_path)\n            return r\n\n        return wrapped\n\n    return inner_decorator\n\n\ndef datetime_round(dt, freq):\n    \"\"\"\n    round down Timestamp series to a specified freq\n    \"\"\"\n    if not freq.is_absolute():\n        raise ValueError(\"Unit is relative\")\n\n    # TODO: multitemporal units\n    all_units = list(freq.times.keys())\n    if len(all_units) == 1:\n        unit = all_units[0]\n        value = freq.times[unit]\n        if unit == \"m\":\n            unit = \"t\"\n        # No support for weeks in datetime.datetime\n        if unit == \"w\":\n            unit = \"d\"\n            value = value * 7\n        freq = str(value) + unit\n        return dt.dt.floor(freq)\n    else:\n        assert \"Frequency cannot have multiple temporal parameters\"\n\n\ndef gather_approximate_features(feature_set):\n    \"\"\"\n    Find features which can be approximated. Returned as a trie where the values\n    are sets of feature names.\n\n    Args:\n        feature_set (FeatureSet): Features to search the dependencies of for\n            features to approximate.\n\n    Returns:\n        Trie[RelationshipPath, set[str]]\n    \"\"\"\n    approximate_feature_trie = Trie(default=set, path_constructor=RelationshipPath)\n\n    for feature in feature_set.target_features:\n        if feature_set.uses_full_dataframe(feature, check_dependents=True):\n            continue\n\n        if isinstance(feature, DirectFeature):\n            path = feature.relationship_path\n            base_feature = feature.base_features[0]\n\n            while isinstance(base_feature, DirectFeature):\n                path = path + base_feature.relationship_path\n                base_feature = base_feature.base_features[0]\n\n            if isinstance(base_feature, AggregationFeature):\n                node_feature_set = approximate_feature_trie.get_node(path).value\n                node_feature_set.add(base_feature.unique_name())\n\n    return approximate_feature_trie\n\n\ndef gen_empty_approx_features_df(approx_features):\n    df = pd.DataFrame(columns=[f.get_name() for f in approx_features])\n    df.index.name = approx_features[0].dataframe.ww.index\n    return df\n\n\ndef n_jobs_to_workers(n_jobs):\n    try:\n        cpus = len(psutil.Process().cpu_affinity())\n    except AttributeError:\n        cpus = psutil.cpu_count()\n\n    # Taken from sklearn parallel_backends code\n    # https://github.com/scikit-learn/scikit-learn/blob/27bbdb570bac062c71b3bb21b0876fd78adc9f7e/sklearn/externals/joblib/_parallel_backends.py#L120\n    if n_jobs < 0:\n        workers = max(cpus + 1 + n_jobs, 1)\n    else:\n        workers = min(n_jobs, cpus)\n\n    assert workers > 0, \"Need at least one worker\"\n    return workers\n\n\ndef create_client_and_cluster(n_jobs, dask_kwargs, entityset_size):\n    Client, LocalCluster = get_client_cluster()\n\n    cluster = None\n    if \"cluster\" in dask_kwargs:\n        cluster = dask_kwargs[\"cluster\"]\n    else:\n        # diagnostics_port sets the default port to launch bokeh web interface\n        # if it is set to None web interface will not be launched\n        diagnostics_port = None\n        if \"diagnostics_port\" in dask_kwargs:\n            diagnostics_port = dask_kwargs[\"diagnostics_port\"]\n            del dask_kwargs[\"diagnostics_port\"]\n\n        workers = n_jobs_to_workers(n_jobs)\n        if n_jobs != -1 and workers < n_jobs:\n            warning_string = \"{} workers requested, but only {} workers created.\"\n            warning_string = warning_string.format(n_jobs, workers)\n            warnings.warn(warning_string)\n\n        # Distributed default memory_limit for worker is 'auto'. It calculates worker\n        # memory limit as total virtual memory divided by the number\n        # of cores available to the workers (alwasy 1 for featuretools setup).\n        # This means reducing the number of workers does not increase the memory\n        # limit for other workers.  Featuretools default is to calculate memory limit\n        # as total virtual memory divided by number of workers. To use distributed\n        # default memory limit, set dask_kwargs['memory_limit']='auto'\n        if \"memory_limit\" in dask_kwargs:\n            memory_limit = dask_kwargs[\"memory_limit\"]\n            del dask_kwargs[\"memory_limit\"]\n        else:\n            total_memory = psutil.virtual_memory().total\n            memory_limit = int(total_memory / float(workers))\n\n        cluster = LocalCluster(\n            n_workers=workers,\n            threads_per_worker=1,\n            diagnostics_port=diagnostics_port,\n            memory_limit=memory_limit,\n            **dask_kwargs,\n        )\n\n        # if cluster has bokeh port, notify user if unexpected port number\n        if diagnostics_port is not None:\n            if hasattr(cluster, \"scheduler\") and cluster.scheduler:\n                info = cluster.scheduler.identity()\n                if \"bokeh\" in info[\"services\"]:\n                    msg = \"Dashboard started on port {}\"\n                    print(msg.format(info[\"services\"][\"bokeh\"]))\n\n    client = Client(cluster)\n\n    warned_of_memory = False\n    for worker in list(client.scheduler_info()[\"workers\"].values()):\n        worker_limit = worker[\"memory_limit\"]\n        if worker_limit < entityset_size:\n            raise ValueError(\"Insufficient memory to use this many workers\")\n        elif worker_limit < 2 * entityset_size and not warned_of_memory:\n            logger.warning(\n                \"Worker memory is between 1 to 2 times the memory\"\n                \" size of the EntitySet. If errors occur that do\"\n                \" not occur with n_jobs equals 1, this may be the \"\n                \"cause.  See https://featuretools.alteryx.com/en/stable/guides/performance.html#parallel-feature-computation\"\n                \" for more information.\",\n            )\n            warned_of_memory = True\n\n    return client, cluster\n\n\ndef get_client_cluster():\n    \"\"\"\n    Separated out the imports to make it easier to mock during testing\n    \"\"\"\n    distributed = import_or_none(\"distributed\")\n    Client = distributed.Client\n    LocalCluster = distributed.LocalCluster\n\n    return Client, LocalCluster\n\n\nCutoffTimeType = typing.Union[pd.DataFrame, str, datetime]\n\n\ndef _validate_cutoff_time(\n    cutoff_time: CutoffTimeType,\n    target_dataframe,\n):\n    \"\"\"\n    Verify that the cutoff time is a single value or a pandas dataframe with the proper columns\n    containing no duplicate rows\n    \"\"\"\n    if isinstance(cutoff_time, pd.DataFrame):\n        cutoff_time = cutoff_time.reset_index(drop=True)\n\n        if \"instance_id\" not in cutoff_time.columns:\n            if target_dataframe.ww.index not in cutoff_time.columns:\n                raise AttributeError(\n                    \"Cutoff time DataFrame must contain a column with either the same name\"\n                    ' as the target dataframe index or a column named \"instance_id\"',\n                )\n            # rename to instance_id\n            cutoff_time.rename(\n                columns={target_dataframe.ww.index: \"instance_id\"},\n                inplace=True,\n            )\n\n        if \"time\" not in cutoff_time.columns:\n            if (\n                target_dataframe.ww.time_index\n                and target_dataframe.ww.time_index not in cutoff_time.columns\n            ):\n                raise AttributeError(\n                    \"Cutoff time DataFrame must contain a column with either the same name\"\n                    ' as the target dataframe time_index or a column named \"time\"',\n                )\n            # rename to time\n            cutoff_time.rename(\n                columns={target_dataframe.ww.time_index: \"time\"},\n                inplace=True,\n            )\n\n        # Make sure user supplies only one valid name for instance id and time columns\n        if (\n            \"instance_id\" in cutoff_time.columns\n            and target_dataframe.ww.index in cutoff_time.columns\n            and \"instance_id\" != target_dataframe.ww.index\n        ):\n            raise AttributeError(\n                'Cutoff time DataFrame cannot contain both a column named \"instance_id\" and a column'\n                \" with the same name as the target dataframe index\",\n            )\n        if (\n            \"time\" in cutoff_time.columns\n            and target_dataframe.ww.time_index in cutoff_time.columns\n            and \"time\" != target_dataframe.ww.time_index\n        ):\n            raise AttributeError(\n                'Cutoff time DataFrame cannot contain both a column named \"time\" and a column'\n                \" with the same name as the target dataframe time index\",\n            )\n\n        assert (\n            cutoff_time[[\"instance_id\", \"time\"]].duplicated().sum() == 0\n        ), \"Duplicated rows in cutoff time dataframe.\"\n    if isinstance(cutoff_time, str):\n        try:\n            cutoff_time = pd.to_datetime(cutoff_time)\n        except ValueError as e:\n            raise ValueError(f\"While parsing cutoff_time: {str(e)}\")\n        except OverflowError as e:\n            raise OverflowError(f\"While parsing cutoff_time: {str(e)}\")\n    else:\n        if isinstance(cutoff_time, list):\n            raise TypeError(\"cutoff_time must be a single value or DataFrame\")\n\n    return cutoff_time\n\n\ndef _check_cutoff_time_type(cutoff_time, es_time_type):\n    \"\"\"\n    Check that the cutoff time values are of the proper type given the entityset time type\n    \"\"\"\n    # Check that cutoff_time time type matches entityset time type\n    if isinstance(cutoff_time, tuple):\n        cutoff_time_value = cutoff_time[0]\n        time_type = _check_time_type(cutoff_time_value)\n        is_numeric = time_type == \"numeric\"\n        is_datetime = time_type == Datetime\n    else:\n        cutoff_time_col = cutoff_time.ww[\"time\"]\n        is_numeric = cutoff_time_col.ww.schema.is_numeric\n        is_datetime = cutoff_time_col.ww.schema.is_datetime\n\n    if es_time_type == \"numeric\" and not is_numeric:\n        raise TypeError(\n            \"cutoff_time times must be numeric: try casting \" \"via pd.to_numeric()\",\n        )\n    if es_time_type == Datetime and not is_datetime:\n        raise TypeError(\n            \"cutoff_time times must be datetime type: try casting \"\n            \"via pd.to_datetime()\",\n        )\n\n\ndef replace_inf_values(feature_matrix, replacement_value=np.nan, columns=None):\n    \"\"\"Replace all ``np.inf`` values in a feature matrix with the specified replacement value.\n\n    Args:\n        feature_matrix (DataFrame): DataFrame whose columns are feature names and rows are instances\n        replacement_value (int, float, str, optional): Value with which ``np.inf`` values will be replaced\n        columns (list[str], optional): A list specifying which columns should have values replaced. If None,\n            values will be replaced for all columns.\n\n    Returns:\n        feature_matrix\n\n    \"\"\"\n    if columns is None:\n        feature_matrix = feature_matrix.replace([np.inf, -np.inf], replacement_value)\n    else:\n        feature_matrix[columns] = feature_matrix[columns].replace(\n            [np.inf, -np.inf],\n            replacement_value,\n        )\n    return feature_matrix\n\n\ndef get_ww_types_from_features(\n    features,\n    entityset,\n    pass_columns=None,\n    cutoff_time=None,\n):\n    \"\"\"Given a list of features and entityset (and optionally a list of pass\n    through columns and the cutoff time dataframe), returns the logical types,\n    semantic tags,and origin of each column in the feature matrix.  Both\n    pass_columns and cutoff_time will need to be supplied in order to get the\n    type information for the pass through columns\n    \"\"\"\n    if pass_columns is None:\n        pass_columns = []\n    logical_types = {}\n    semantic_tags = {}\n    origins = {}\n\n    for feature in features:\n        names = feature.get_feature_names()\n        for name in names:\n            logical_types[name] = feature.column_schema.logical_type\n            semantic_tags[name] = feature.column_schema.semantic_tags.copy()\n            semantic_tags[name] -= {\"index\", \"time_index\"}\n\n            if logical_types[name] is None and \"numeric\" in semantic_tags[name]:\n                logical_types[name] = Double\n            if all([f.primitive is None for f in feature.get_dependencies(deep=True)]):\n                origins[name] = \"base\"\n            else:\n                origins[name] = \"engineered\"\n\n    if pass_columns:\n        cutoff_schema = cutoff_time.ww.schema\n        for column in pass_columns:\n            logical_types[column] = cutoff_schema.logical_types[column]\n            semantic_tags[column] = cutoff_schema.semantic_tags[column]\n            origins[column] = \"base\"\n\n    ww_init = {\n        \"logical_types\": logical_types,\n        \"semantic_tags\": semantic_tags,\n        \"column_origins\": origins,\n    }\n    return ww_init\n"
  },
  {
    "path": "featuretools/config_init.py",
    "content": "import copy\nimport logging\nimport os\nimport sys\n\n\ndef initialize_logging():\n    loggers = {}\n\n    # Check for environmental variables\n    logger_env_vars = {\n        \"FEATURETOOLS_LOG_LEVEL\": \"featuretools\",\n        \"FEATURETOOLS_ES_LOG_LEVEL\": \"featuretools.entityset\",\n        \"FEATURETOOLS_BACKEND_LOG_LEVEL\": \"featuretools.computation_backend\",\n    }\n    for logger_env, logger in logger_env_vars.items():\n        log_level = os.environ.get(logger_env, None)\n        if log_level is not None:\n            loggers[logger] = log_level\n\n    # Set log level to info if not otherwise specified.\n    loggers.setdefault(\"featuretools\", \"info\")\n    loggers.setdefault(\"featuretools.computation_backend\", \"info\")\n    loggers.setdefault(\"featuretools.entityset\", \"info\")\n\n    fmt = \"%(asctime)-15s %(name)s - %(levelname)s    %(message)s\"\n    out_handler = logging.StreamHandler(sys.stdout)\n    err_handler = logging.StreamHandler(sys.stdout)\n    out_handler.setFormatter(logging.Formatter(fmt))\n    err_handler.setFormatter(logging.Formatter(fmt))\n    err_levels = [\"WARNING\", \"ERROR\", \"CRITICAL\"]\n\n    for name, level in list(loggers.items()):\n        LEVEL = getattr(logging, level.upper())\n        logger = logging.getLogger(name)\n        logger.setLevel(LEVEL)\n        for _handler in logger.handlers:\n            logger.removeHandler(_handler)\n\n        if level in err_levels:\n            logger.addHandler(err_handler)\n        else:\n            logger.addHandler(out_handler)\n        logger.propagate = False\n\n\ninitialize_logging()\n\n\nclass Config:\n    def __init__(self):\n        self._data = {}\n        self.set_to_default()\n\n    def set_to_default(self):\n        PWD = os.path.dirname(__file__)\n        primitive_data_folder = os.path.join(PWD, \"primitives/data\")\n        self._data = {\n            \"primitive_data_folder\": primitive_data_folder,\n        }\n\n    def get(self, key):\n        return copy.deepcopy(self._data[key])\n\n    def get_all(self):\n        return copy.deepcopy(self._data)\n\n    def set(self, values):\n        self._data.update(values)\n\n\nconfig = Config()\n"
  },
  {
    "path": "featuretools/demo/__init__.py",
    "content": "# flake8: noqa\nfrom featuretools.demo.api import *\n"
  },
  {
    "path": "featuretools/demo/api.py",
    "content": "# flake8: noqa\nfrom featuretools.demo.flight import load_flight\nfrom featuretools.demo.mock_customer import load_mock_customer\nfrom featuretools.demo.retail import load_retail\nfrom featuretools.demo.weather import load_weather\n"
  },
  {
    "path": "featuretools/demo/flight.py",
    "content": "import math\nimport re\n\nimport pandas as pd\nfrom tqdm import tqdm\nfrom woodwork.logical_types import Boolean, Categorical, Ordinal\n\nimport featuretools as ft\n\n\ndef load_flight(\n    month_filter=None,\n    categorical_filter=None,\n    nrows=None,\n    demo=True,\n    return_single_table=False,\n    verbose=False,\n):\n    \"\"\"\n    Download, clean, and filter flight data from 2017.\n    The original dataset can be found `here <https://www.transtats.bts.gov/ot_delay/ot_delaycause1.asp>`_.\n\n    Args:\n\n        month_filter (list[int]): Only use data from these months (example is ``[1, 2]``).\n            To skip, set to None.\n        categorical_filter (dict[str->str]): Use only specified categorical values.\n            Example is ``{'dest_city': ['Boston, MA'], 'origin_city': ['Boston, MA']}``\n            which returns all flights in OR out of Boston. To skip, set to None.\n        nrows (int): Passed to nrows in ``pd.read_csv``. Used before filtering.\n        demo (bool): Use only two months of data. If False, use the whole year.\n        return_single_table (bool): Exit the function early and return a dataframe.\n        verbose (bool): Show a progress bar while loading the data.\n\n    Examples:\n\n        .. ipython::\n            :verbatim:\n\n            In [1]: import featuretools as ft\n\n            In [2]: es = ft.demo.load_flight(verbose=True,\n               ...:                          month_filter=[1],\n               ...:                          categorical_filter={'origin_city':['Boston, MA']})\n            100%|xxxxxxxxxxxxxxxxxxxxxxxxx| 100/100 [01:16<00:00,  1.31it/s]\n\n            In [3]: es\n            Out[3]:\n            Entityset: Flight Data\n              DataFrames:\n                airports [Rows: 55, Columns: 3]\n                flights [Rows: 613, Columns: 9]\n                trip_logs [Rows: 9456, Columns: 22]\n                airlines [Rows: 10, Columns: 1]\n              Relationships:\n                trip_logs.flight_id -> flights.flight_id\n                flights.carrier -> airlines.carrier\n                flights.dest -> airports.dest\n    \"\"\"\n\n    filename, csv_length = get_flight_filename(demo=demo)\n\n    print(\"Downloading data ...\")\n    url = \"https://oss.alteryx.com/datasets/{}?library=featuretools&version={}\".format(\n        filename,\n        ft.__version__,\n    )\n\n    chunksize = math.ceil(csv_length / 99)\n    pd.options.display.max_columns = 200\n    iter_csv = pd.read_csv(\n        url,\n        compression=\"zip\",\n        iterator=True,\n        nrows=nrows,\n        chunksize=chunksize,\n    )\n    if verbose:\n        iter_csv = tqdm(iter_csv, total=100)\n\n    partial_df_list = []\n    for chunk in iter_csv:\n        df = filter_data(\n            _clean_data(chunk),\n            month_filter=month_filter,\n            categorical_filter=categorical_filter,\n        )\n        partial_df_list.append(df)\n    data = pd.concat(partial_df_list)\n\n    if return_single_table:\n        return data\n\n    es = make_es(data)\n\n    return es\n\n\ndef make_es(data):\n    es = ft.EntitySet(\"Flight Data\")\n    arr_time_columns = [\n        \"arr_delay\",\n        \"dep_delay\",\n        \"carrier_delay\",\n        \"weather_delay\",\n        \"national_airspace_delay\",\n        \"security_delay\",\n        \"late_aircraft_delay\",\n        \"canceled\",\n        \"diverted\",\n        \"taxi_in\",\n        \"taxi_out\",\n        \"air_time\",\n        \"dep_time\",\n    ]\n\n    logical_types = {\n        \"flight_num\": Categorical,\n        \"distance_group\": Ordinal(order=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]),\n        \"canceled\": Boolean,\n        \"diverted\": Boolean,\n    }\n\n    es.add_dataframe(\n        data,\n        dataframe_name=\"trip_logs\",\n        index=\"trip_log_id\",\n        make_index=True,\n        time_index=\"date_scheduled\",\n        secondary_time_index={\"arr_time\": arr_time_columns},\n        logical_types=logical_types,\n    )\n\n    es.normalize_dataframe(\n        \"trip_logs\",\n        \"flights\",\n        \"flight_id\",\n        additional_columns=[\n            \"origin\",\n            \"origin_city\",\n            \"origin_state\",\n            \"dest\",\n            \"dest_city\",\n            \"dest_state\",\n            \"distance_group\",\n            \"carrier\",\n            \"flight_num\",\n        ],\n    )\n\n    es.normalize_dataframe(\"flights\", \"airlines\", \"carrier\", make_time_index=False)\n\n    es.normalize_dataframe(\n        \"flights\",\n        \"airports\",\n        \"dest\",\n        additional_columns=[\"dest_city\", \"dest_state\"],\n        make_time_index=False,\n    )\n    return es\n\n\ndef _clean_data(data):\n    # Make column names snake case\n    clean_data = data.rename(columns={col: convert(col) for col in data})\n\n    # Chance crs -> \"scheduled\" and other minor clarifications\n    clean_data = clean_data.rename(\n        columns={\n            \"crs_arr_time\": \"scheduled_arr_time\",\n            \"crs_dep_time\": \"scheduled_dep_time\",\n            \"crs_elapsed_time\": \"scheduled_elapsed_time\",\n            \"nas_delay\": \"national_airspace_delay\",\n            \"origin_city_name\": \"origin_city\",\n            \"dest_city_name\": \"dest_city\",\n            \"cancelled\": \"canceled\",\n        },\n    )\n\n    # Combine strings like 0130 (1:30 AM) with dates (2017-01-01)\n    clean_data[\"scheduled_dep_time\"] = clean_data[\"scheduled_dep_time\"].apply(\n        lambda x: str(x),\n    ) + clean_data[\"flight_date\"].astype(\"str\")\n\n    # Parse combined string as a date\n    clean_data.loc[:, \"scheduled_dep_time\"] = pd.to_datetime(\n        clean_data[\"scheduled_dep_time\"],\n        format=\"%H%M%Y-%m-%d\",\n        errors=\"coerce\",\n    )\n\n    clean_data[\"scheduled_elapsed_time\"] = pd.to_timedelta(\n        clean_data[\"scheduled_elapsed_time\"],\n        unit=\"m\",\n    )\n\n    clean_data = _reconstruct_times(clean_data)\n\n    # Create a time index 6 months before scheduled_dep\n    clean_data.loc[:, \"date_scheduled\"] = pd.to_datetime(\n        clean_data[\"scheduled_dep_time\"],\n    ).dt.date - pd.Timedelta(\"120d\")\n\n    # A null entry for a delay means no delay\n    clean_data = _fill_labels(clean_data)\n\n    # Nulls for scheduled values are too problematic. Remove them.\n    clean_data = clean_data.dropna(\n        axis=\"rows\",\n        subset=[\"scheduled_dep_time\", \"scheduled_arr_time\"],\n    )\n\n    # Make a flight id. Define a flight as a combination of:\n    # 1. carrier 2. flight number 3. origin airport 4. dest airport\n    clean_data.loc[:, \"flight_id\"] = (\n        clean_data[\"carrier\"]\n        + \"-\"\n        + clean_data[\"flight_num\"].apply(lambda x: str(x))\n        + \":\"\n        + clean_data[\"origin\"]\n        + \"->\"\n        + clean_data[\"dest\"]\n    )\n\n    column_order = [\n        \"flight_id\",\n        \"flight_num\",\n        \"date_scheduled\",\n        \"scheduled_dep_time\",\n        \"scheduled_arr_time\",\n        \"carrier\",\n        \"origin\",\n        \"origin_city\",\n        \"origin_state\",\n        \"dest\",\n        \"dest_city\",\n        \"dest_state\",\n        \"distance_group\",\n        \"dep_time\",\n        \"arr_time\",\n        \"dep_delay\",\n        \"taxi_out\",\n        \"taxi_in\",\n        \"arr_delay\",\n        \"diverted\",\n        \"scheduled_elapsed_time\",\n        \"air_time\",\n        \"distance\",\n        \"carrier_delay\",\n        \"weather_delay\",\n        \"national_airspace_delay\",\n        \"security_delay\",\n        \"late_aircraft_delay\",\n        \"canceled\",\n    ]\n\n    clean_data = clean_data[column_order]\n\n    return clean_data\n\n\ndef _fill_labels(clean_data):\n    labely_columns = [\n        \"arr_delay\",\n        \"dep_delay\",\n        \"carrier_delay\",\n        \"weather_delay\",\n        \"national_airspace_delay\",\n        \"security_delay\",\n        \"late_aircraft_delay\",\n        \"canceled\",\n        \"diverted\",\n        \"taxi_in\",\n        \"taxi_out\",\n        \"air_time\",\n    ]\n    for col in labely_columns:\n        clean_data.loc[:, col] = clean_data[col].fillna(0)\n\n    return clean_data\n\n\ndef _reconstruct_times(clean_data):\n    \"\"\"Reconstruct departure_time, scheduled_dep_time,\n    arrival_time and scheduled_arr_time by adding known delays\n    to known times. We do:\n        - dep_time is scheduled_dep + dep_delay\n        - arr_time is dep_time + taxiing and air_time\n        - scheduled arrival is scheduled_dep + scheduled_elapsed\n    \"\"\"\n    clean_data.loc[:, \"dep_time\"] = clean_data[\"scheduled_dep_time\"] + pd.to_timedelta(\n        clean_data[\"dep_delay\"],\n        unit=\"m\",\n    )\n\n    clean_data.loc[:, \"arr_time\"] = clean_data[\"dep_time\"] + pd.to_timedelta(\n        clean_data[\"taxi_out\"] + clean_data[\"air_time\"] + clean_data[\"taxi_in\"],\n        unit=\"m\",\n    )\n\n    clean_data.loc[:, \"scheduled_arr_time\"] = (\n        clean_data[\"scheduled_dep_time\"] + clean_data[\"scheduled_elapsed_time\"]\n    )\n    return clean_data\n\n\ndef filter_data(clean_data, month_filter=None, categorical_filter=None):\n    if month_filter is not None:\n        tmp = pd.to_datetime(clean_data[\"scheduled_dep_time\"]).dt.month.isin(\n            month_filter,\n        )\n        clean_data = clean_data[tmp]\n\n    if categorical_filter is not None:\n        tmp = False\n        for key, values in categorical_filter.items():\n            tmp = tmp | clean_data[key].isin(values)\n        clean_data = clean_data[tmp]\n\n    return clean_data\n\n\ndef convert(name):\n    # Rename columns to underscore\n    # Code via SO https://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-snake-case\n    s1 = re.sub(\"(.)([A-Z][a-z]+)\", r\"\\1_\\2\", name)\n    return re.sub(\"([a-z0-9])([A-Z])\", r\"\\1_\\2\", s1).lower()\n\n\ndef get_flight_filename(demo=True):\n    if demo:\n        filename = SMALL_FLIGHT_CSV\n        rows = 860457\n    else:\n        filename = BIG_FLIGHT_CSV\n        rows = 5162742\n\n    return filename, rows\n\n\nSMALL_FLIGHT_CSV = \"data_2017_jan_feb.csv.zip\"\nBIG_FLIGHT_CSV = \"data_all_2017.csv.zip\"\n"
  },
  {
    "path": "featuretools/demo/mock_customer.py",
    "content": "import pandas as pd\nfrom numpy import random\nfrom numpy.random import choice\nfrom woodwork.logical_types import Categorical, PostalCode\n\nimport featuretools as ft\n\n\ndef load_mock_customer(\n    n_customers=5,\n    n_products=5,\n    n_sessions=35,\n    n_transactions=500,\n    random_seed=0,\n    return_single_table=False,\n    return_entityset=False,\n):\n    \"\"\"Return dataframes of mock customer data\"\"\"\n\n    random.seed(random_seed)\n    last_date = pd.to_datetime(\"12/31/2013\")\n    first_date = pd.to_datetime(\"1/1/2008\")\n    first_bday = pd.to_datetime(\"1/1/1970\")\n\n    join_dates = [\n        random.uniform(0, 1) * (last_date - first_date) + first_date\n        for _ in range(n_customers)\n    ]\n    birth_dates = [\n        random.uniform(0, 1) * (first_date - first_bday) + first_bday\n        for _ in range(n_customers)\n    ]\n\n    customers_df = pd.DataFrame({\"customer_id\": range(1, n_customers + 1)})\n    customers_df[\"zip_code\"] = choice(\n        [\"60091\", \"13244\"],\n        n_customers,\n    )\n    customers_df[\"join_date\"] = pd.Series(join_dates).dt.round(\"1s\")\n    customers_df[\"birthday\"] = pd.Series(birth_dates).dt.round(\"1d\")\n\n    products_df = pd.DataFrame({\"product_id\": pd.Categorical(range(1, n_products + 1))})\n    products_df[\"brand\"] = choice([\"A\", \"B\", \"C\"], n_products)\n\n    sessions_df = pd.DataFrame({\"session_id\": range(1, n_sessions + 1)})\n    sessions_df[\"customer_id\"] = choice(customers_df[\"customer_id\"], n_sessions)\n    sessions_df[\"device\"] = choice([\"desktop\", \"mobile\", \"tablet\"], n_sessions)\n\n    transactions_df = pd.DataFrame({\"transaction_id\": range(1, n_transactions + 1)})\n    transactions_df[\"session_id\"] = choice(sessions_df[\"session_id\"], n_transactions)\n    transactions_df = transactions_df.sort_values(\"session_id\").reset_index(drop=True)\n    transactions_df[\"transaction_time\"] = pd.date_range(\n        \"1/1/2014\",\n        periods=n_transactions,\n        freq=\"65s\",\n    )  # todo make these less regular\n    transactions_df[\"product_id\"] = pd.Categorical(\n        choice(products_df[\"product_id\"], n_transactions),\n    )\n    transactions_df[\"amount\"] = random.randint(500, 15000, n_transactions) / 100\n\n    # calculate and merge in session start\n    # based on the times we came up with for transactions\n    session_starts = transactions_df.drop_duplicates(\"session_id\")[\n        [\"session_id\", \"transaction_time\"]\n    ].rename(columns={\"transaction_time\": \"session_start\"})\n    sessions_df = sessions_df.merge(session_starts)\n\n    if return_single_table:\n        return (\n            transactions_df.merge(sessions_df)\n            .merge(customers_df)\n            .merge(products_df)\n            .reset_index(drop=True)\n        )\n    elif return_entityset:\n        es = ft.EntitySet(id=\"transactions\")\n        es = es.add_dataframe(\n            dataframe_name=\"transactions\",\n            dataframe=transactions_df,\n            index=\"transaction_id\",\n            time_index=\"transaction_time\",\n            logical_types={\"product_id\": Categorical},\n        )\n\n        es = es.add_dataframe(\n            dataframe_name=\"products\",\n            dataframe=products_df,\n            index=\"product_id\",\n        )\n\n        es = es.add_dataframe(\n            dataframe_name=\"sessions\",\n            dataframe=sessions_df,\n            index=\"session_id\",\n            time_index=\"session_start\",\n        )\n\n        es = es.add_dataframe(\n            dataframe_name=\"customers\",\n            dataframe=customers_df,\n            index=\"customer_id\",\n            time_index=\"join_date\",\n            logical_types={\"zip_code\": PostalCode},\n        )\n\n        rels = [\n            (\"products\", \"product_id\", \"transactions\", \"product_id\"),\n            (\"sessions\", \"session_id\", \"transactions\", \"session_id\"),\n            (\"customers\", \"customer_id\", \"sessions\", \"customer_id\"),\n        ]\n        es = es.add_relationships(rels)\n        es.add_last_time_indexes()\n        return es\n\n    return {\n        \"customers\": customers_df,\n        \"sessions\": sessions_df,\n        \"transactions\": transactions_df,\n        \"products\": products_df,\n    }\n"
  },
  {
    "path": "featuretools/demo/retail.py",
    "content": "import pandas as pd\nfrom woodwork.logical_types import NaturalLanguage\n\nimport featuretools as ft\n\n\ndef load_retail(id=\"demo_retail_data\", nrows=None, return_single_table=False):\n    \"\"\"Returns the retail entityset example.\n    The original dataset can be found `here <https://archive.ics.uci.edu/ml/datasets/online+retail>`_.\n\n    We have also made some modifications to the data. We\n    changed the column names, converted the ``customer_id``\n    to a unique fake ``customer_name``, dropped duplicates,\n    added columns for ``total`` and ``cancelled`` and\n    converted amounts from GBP to USD. You can download the modified CSV in gz `compressed (7 MB)\n    <https://oss.alteryx.com/datasets/online-retail-logs-2018-08-28.csv.gz>`_\n    or `uncompressed (43 MB)\n    <https://oss.alteryx.com/datasets/online-retail-logs-2018-08-28.csv>`_ formats.\n\n    Args:\n        id (str):  Id to assign to EntitySet.\n        nrows (int):  Number of rows to load of the underlying CSV.\n            If None, load all.\n        return_single_table (bool): If True, return a CSV rather than an EntitySet. Default is False.\n\n    Examples:\n\n        .. ipython::\n            :verbatim:\n\n            In [1]: import featuretools as ft\n\n            In [2]: es = ft.demo.load_retail()\n\n            In [3]: es\n            Out[3]:\n            Entityset: demo_retail_data\n              DataFrames:\n                orders (shape = [22190, 3])\n                products (shape = [3684, 3])\n                customers (shape = [4372, 2])\n                order_products (shape = [401704, 7])\n\n        Load in subset of data\n\n        .. ipython::\n            :verbatim:\n\n            In [4]: es = ft.demo.load_retail(nrows=1000)\n\n            In [5]: es\n            Out[5]:\n            Entityset: demo_retail_data\n              DataFrames:\n                orders (shape = [67, 5])\n                products (shape = [606, 3])\n                customers (shape = [50, 2])\n                order_products (shape = [1000, 7])\n    \"\"\"\n    es = ft.EntitySet(id)\n    csv_s3_gz = (\n        \"https://oss.alteryx.com/datasets/online-retail-logs-2018-08-28.csv.gz?library=featuretools&version=\"\n        + ft.__version__\n    )\n    csv_s3 = (\n        \"https://oss.alteryx.com/datasets/online-retail-logs-2018-08-28.csv?library=featuretools&version=\"\n        + ft.__version__\n    )\n    # Try to read in gz compressed file\n    try:\n        df = pd.read_csv(csv_s3_gz, nrows=nrows, parse_dates=[\"order_date\"])\n    # Fall back to uncompressed\n    except Exception:\n        df = pd.read_csv(csv_s3, nrows=nrows, parse_dates=[\"order_date\"])\n    if return_single_table:\n        return df\n\n    es.add_dataframe(\n        dataframe_name=\"order_products\",\n        dataframe=df,\n        index=\"order_product_id\",\n        make_index=True,\n        time_index=\"order_date\",\n        logical_types={\"description\": NaturalLanguage},\n    )\n\n    es.normalize_dataframe(\n        new_dataframe_name=\"products\",\n        base_dataframe_name=\"order_products\",\n        index=\"product_id\",\n        additional_columns=[\"description\"],\n    )\n\n    es.normalize_dataframe(\n        new_dataframe_name=\"orders\",\n        base_dataframe_name=\"order_products\",\n        index=\"order_id\",\n        additional_columns=[\"customer_name\", \"country\", \"cancelled\"],\n    )\n\n    es.normalize_dataframe(\n        new_dataframe_name=\"customers\",\n        base_dataframe_name=\"orders\",\n        index=\"customer_name\",\n    )\n    es.add_last_time_indexes()\n\n    return es\n"
  },
  {
    "path": "featuretools/demo/weather.py",
    "content": "import pandas as pd\n\nimport featuretools as ft\n\n\ndef load_weather(nrows=None, return_single_table=False):\n    \"\"\"\n    Load the Australian daily-min-temperatures weather dataset.\n\n    Args:\n\n        nrows (int): Passed to nrows in ``pd.read_csv``.\n        return_single_table (bool): Exit the function early and return a dataframe.\n\n    \"\"\"\n    filename = \"daily-min-temperatures.csv\"\n    print(\"Downloading data ...\")\n    url = \"https://oss.alteryx.com/datasets/{}?library=featuretools&version={}\".format(\n        filename,\n        ft.__version__,\n    )\n    data = pd.read_csv(url, index_col=None, nrows=nrows)\n    if return_single_table:\n        return data\n    es = make_es(data)\n    return es\n\n\ndef make_es(data):\n    es = ft.EntitySet(\"Weather Data\")\n\n    es.add_dataframe(\n        data,\n        dataframe_name=\"temperatures\",\n        index=\"id\",\n        make_index=True,\n        time_index=\"Date\",\n    )\n    return es\n"
  },
  {
    "path": "featuretools/entityset/__init__.py",
    "content": "# flake8: noqa\nfrom featuretools.entityset.api import *\n"
  },
  {
    "path": "featuretools/entityset/api.py",
    "content": "# flake8: noqa\nfrom featuretools.entityset.deserialize import read_entityset\nfrom featuretools.entityset.entityset import EntitySet\nfrom featuretools.entityset.relationship import Relationship\nfrom featuretools.entityset.timedelta import Timedelta\n"
  },
  {
    "path": "featuretools/entityset/deserialize.py",
    "content": "import json\nimport os\nimport tarfile\nimport tempfile\nfrom inspect import getfullargspec\n\nimport pandas as pd\nimport woodwork.type_sys.type_system as ww_type_system\nfrom woodwork.deserialize import read_woodwork_table\n\nfrom featuretools.entityset.relationship import Relationship\nfrom featuretools.utils.s3_utils import get_transport_params, use_smartopen_es\nfrom featuretools.utils.schema_utils import check_schema_version\nfrom featuretools.utils.wrangle import _is_local_tar, _is_s3, _is_url\n\n\ndef description_to_entityset(description, **kwargs):\n    \"\"\"Deserialize entityset from data description.\n\n    Args:\n        description (dict) : Description of an :class:`.EntitySet`. Likely generated using :meth:`.serialize.entityset_to_description`\n        kwargs (keywords): Additional keyword arguments to pass as keywords arguments to the underlying deserialization method.\n\n    Returns:\n        entityset (EntitySet) : Instance of :class:`.EntitySet`.\n    \"\"\"\n    check_schema_version(description, \"entityset\")\n\n    from featuretools.entityset import EntitySet\n\n    # If data description was not read from disk, path is None.\n    path = description.get(\"path\")\n    entityset = EntitySet(description[\"id\"])\n\n    for df in description[\"dataframes\"].values():\n        if path is not None:\n            data_path = os.path.join(path, \"data\", df[\"name\"])\n            format = description.get(\"format\")\n            if format is not None:\n                kwargs[\"format\"] = format\n                if format == \"parquet\" and df[\"loading_info\"][\"table_type\"] == \"pandas\":\n                    kwargs[\"filename\"] = df[\"name\"] + \".parquet\"\n            dataframe = read_woodwork_table(data_path, validate=False, **kwargs)\n        else:\n            dataframe = empty_dataframe(df)\n\n        entityset.add_dataframe(dataframe)\n\n    for relationship in description[\"relationships\"]:\n        rel = Relationship.from_dictionary(relationship, entityset)\n        entityset.add_relationship(relationship=rel)\n\n    return entityset\n\n\ndef empty_dataframe(description):\n    \"\"\"Deserialize empty dataframe from dataframe description.\n\n    Args:\n        description (dict) : Description of dataframe.\n\n    Returns:\n        df (DataFrame) : Empty dataframe with Woodwork initialized.\n    \"\"\"\n    # TODO: Can we update Woodwork to return an empty initialized dataframe from a description\n    # instead of using this function? Or otherwise eliminate? Issue #1476\n    logical_types = {}\n    semantic_tags = {}\n    column_descriptions = {}\n    column_metadata = {}\n    use_standard_tags = {}\n    category_dtypes = {}\n    columns = []\n    for col in description[\"column_typing_info\"]:\n        col_name = col[\"name\"]\n        columns.append(col_name)\n\n        ltype_metadata = col[\"logical_type\"]\n        ltype = ww_type_system.str_to_logical_type(\n            ltype_metadata[\"type\"],\n            params=ltype_metadata[\"parameters\"],\n        )\n\n        tags = col[\"semantic_tags\"]\n\n        if \"index\" in tags:\n            tags.remove(\"index\")\n        elif \"time_index\" in tags:\n            tags.remove(\"time_index\")\n\n        logical_types[col_name] = ltype\n        semantic_tags[col_name] = tags\n        column_descriptions[col_name] = col[\"description\"]\n        column_metadata[col_name] = col[\"metadata\"]\n        use_standard_tags[col_name] = col[\"use_standard_tags\"]\n\n        if col[\"physical_type\"][\"type\"] == \"category\":\n            # Make sure categories are recreated properly\n            cat_values = col[\"physical_type\"][\"cat_values\"]\n            cat_dtype = col[\"physical_type\"][\"cat_dtype\"]\n            cat_object = pd.CategoricalDtype(pd.Index(cat_values, dtype=cat_dtype))\n            category_dtypes[col_name] = cat_object\n\n    dataframe = pd.DataFrame(columns=columns).astype(category_dtypes)\n\n    dataframe.ww.init(\n        name=description.get(\"name\"),\n        index=description.get(\"index\"),\n        time_index=description.get(\"time_index\"),\n        logical_types=logical_types,\n        semantic_tags=semantic_tags,\n        use_standard_tags=use_standard_tags,\n        table_metadata=description.get(\"table_metadata\"),\n        column_metadata=column_metadata,\n        column_descriptions=column_descriptions,\n        validate=False,\n    )\n\n    return dataframe\n\n\ndef read_data_description(path):\n    \"\"\"Read data description from disk, S3 path, or URL.\n\n    Args:\n        path (str): Location on disk, S3 path, or URL to read `data_description.json`.\n\n    Returns:\n        description (dict) : Description of :class:`.EntitySet`.\n    \"\"\"\n\n    path = os.path.abspath(path)\n    assert os.path.exists(path), '\"{}\" does not exist'.format(path)\n    filepath = os.path.join(path, \"data_description.json\")\n    with open(filepath, \"r\") as file:\n        description = json.load(file)\n    description[\"path\"] = path\n    return description\n\n\ndef read_entityset(path, profile_name=None, **kwargs):\n    \"\"\"Read entityset from disk, S3 path, or URL.\n\n    NOTE: Never attempt to read an archived EntitySet from an untrusted source.\n\n    Args:\n        path (str): Directory on disk, S3 path, or URL to read `data_description.json`.\n        profile_name (str, bool): The AWS profile specified to write to S3. Will default to None and search for AWS credentials.\n            Set to False to use an anonymous profile.\n        kwargs (keywords): Additional keyword arguments to pass as keyword arguments to the underlying deserialization method.\n    \"\"\"\n    if _is_url(path) or _is_s3(path) or _is_local_tar(str(path)):\n        with tempfile.TemporaryDirectory() as tmpdir:\n            local_path = path\n            transport_params = None\n\n            if _is_s3(path):\n                transport_params = get_transport_params(profile_name)\n\n            if _is_s3(path) or _is_url(path):\n                local_path = os.path.join(tmpdir, \"temporary_es\")\n                use_smartopen_es(local_path, path, transport_params)\n\n            with tarfile.open(str(local_path)) as tar:\n                if \"filter\" in getfullargspec(tar.extractall).kwonlyargs:\n                    tar.extractall(path=tmpdir, filter=\"data\")\n                else:\n                    raise RuntimeError(\n                        \"Please upgrade your Python version to the latest patch release to allow for safe extraction of the EntitySet archive.\",\n                    )\n\n            data_description = read_data_description(tmpdir)\n            return description_to_entityset(data_description, **kwargs)\n    else:\n        data_description = read_data_description(path)\n        return description_to_entityset(data_description, **kwargs)\n"
  },
  {
    "path": "featuretools/entityset/entityset.py",
    "content": "import copy\nimport logging\nimport warnings\nfrom collections import defaultdict\n\nimport numpy as np\nimport pandas as pd\nfrom woodwork import init_series\nfrom woodwork.logical_types import Datetime, LatLong\n\nfrom featuretools.entityset import deserialize, serialize\nfrom featuretools.entityset.relationship import Relationship, RelationshipPath\nfrom featuretools.feature_base.feature_base import _ES_REF\nfrom featuretools.utils.plot_utils import (\n    check_graphviz,\n    get_graphviz_format,\n    save_graph,\n)\nfrom featuretools.utils.wrangle import _check_timedelta\n\npd.options.mode.chained_assignment = None  # default='warn'\nlogger = logging.getLogger(\"featuretools.entityset\")\n\nLTI_COLUMN_NAME = \"_ft_last_time\"\nWW_SCHEMA_KEY = \"_ww__getstate__schemas\"\n\n\nclass EntitySet(object):\n    \"\"\"\n    Stores all actual data and typing information for an entityset\n\n    Attributes:\n        id\n        dataframe_dict\n        relationships\n        time_type\n\n    Properties:\n        metadata\n\n    \"\"\"\n\n    def __init__(self, id=None, dataframes=None, relationships=None):\n        \"\"\"Creates EntitySet\n\n        Args:\n            id (str) : Unique identifier to associate with this instance\n            dataframes (dict[str -> tuple(DataFrame, str, str, dict[str -> str/Woodwork.LogicalType], dict[str->str/set], boolean)]):\n                Dictionary of DataFrames. Entries take the format\n                {dataframe name -> (dataframe, index column, time_index, logical_types, semantic_tags, make_index)}.\n                Note that only the dataframe is required. If a Woodwork DataFrame is supplied, any other parameters\n                will be ignored.\n            relationships (list[(str, str, str, str)]): List of relationships\n                between dataframes. List items are a tuple with the format\n                (parent dataframe name, parent column, child dataframe name, child column).\n\n        Example:\n\n            .. code-block:: python\n\n                dataframes = {\n                    \"cards\" : (card_df, \"id\"),\n                    \"transactions\" : (transactions_df, \"id\", \"transaction_time\")\n                }\n\n                relationships = [(\"cards\", \"id\", \"transactions\", \"card_id\")]\n\n                ft.EntitySet(\"my-entity-set\", dataframes, relationships)\n        \"\"\"\n        self.id = id\n        self.dataframe_dict = {}\n        self.relationships = []\n        self.time_type = None\n\n        dataframes = dataframes or {}\n        relationships = relationships or []\n        for df_name in dataframes:\n            df = dataframes[df_name][0]\n            if df.ww.schema is not None and df.ww.name != df_name:\n                raise ValueError(\n                    f\"Naming conflict in dataframes dictionary: dictionary key '{df_name}' does not match dataframe name '{df.ww.name}'\",\n                )\n\n            index_column = None\n            time_index = None\n            make_index = False\n            semantic_tags = None\n            logical_types = None\n            if len(dataframes[df_name]) > 1:\n                index_column = dataframes[df_name][1]\n            if len(dataframes[df_name]) > 2:\n                time_index = dataframes[df_name][2]\n            if len(dataframes[df_name]) > 3:\n                logical_types = dataframes[df_name][3]\n            if len(dataframes[df_name]) > 4:\n                semantic_tags = dataframes[df_name][4]\n            if len(dataframes[df_name]) > 5:\n                make_index = dataframes[df_name][5]\n            self.add_dataframe(\n                dataframe_name=df_name,\n                dataframe=df,\n                index=index_column,\n                time_index=time_index,\n                logical_types=logical_types,\n                semantic_tags=semantic_tags,\n                make_index=make_index,\n            )\n\n        for relationship in relationships:\n            parent_df, parent_column, child_df, child_column = relationship\n            self.add_relationship(parent_df, parent_column, child_df, child_column)\n\n        self.reset_data_description()\n        _ES_REF[self.id] = self\n\n    def __sizeof__(self):\n        return sum([df.__sizeof__() for df in self.dataframes])\n\n    def __dask_tokenize__(self):\n        return (EntitySet, serialize.entityset_to_description(self.metadata))\n\n    def __eq__(self, other, deep=False):\n        if self.id != other.id:\n            return False\n        if self.time_type != other.time_type:\n            return False\n        if len(self.dataframe_dict) != len(other.dataframe_dict):\n            return False\n        for df_name, df in self.dataframe_dict.items():\n            if df_name not in other.dataframe_dict:\n                return False\n            if not df.ww.__eq__(other[df_name].ww, deep=deep):\n                return False\n        if not len(self.relationships) == len(other.relationships):\n            return False\n        for r in self.relationships:\n            if r not in other.relationships:\n                return False\n        return True\n\n    def __ne__(self, other, deep=False):\n        return not self.__eq__(other, deep=deep)\n\n    def __getitem__(self, dataframe_name):\n        \"\"\"Get dataframe instance from entityset\n\n        Args:\n            dataframe_name (str): Name of dataframe.\n\n        Returns:\n            :class:`.DataFrame` : Instance of dataframe with Woodwork typing information. None if dataframe doesn't\n                exist on the entityset.\n        \"\"\"\n        if dataframe_name in self.dataframe_dict:\n            return self.dataframe_dict[dataframe_name]\n        name = self.id or \"entity set\"\n        raise KeyError(\"DataFrame %s does not exist in %s\" % (dataframe_name, name))\n\n    def __deepcopy__(self, memo):\n        cls = self.__class__\n        result = cls.__new__(cls)\n        memo[id(self)] = result\n        for k, v in self.__dict__.items():\n            if k == \"dataframe_dict\":\n                # Copy the DataFrames, retaining Woodwork typing information\n                copied_attr = copy.copy(v)\n                for df_name, df in copied_attr.items():\n                    copied_attr[df_name] = df.ww.copy()\n            else:\n                copied_attr = copy.deepcopy(v, memo)\n\n            setattr(result, k, copied_attr)\n\n        for df in result.dataframe_dict.values():\n            result._add_references_to_metadata(df)\n        return result\n\n    @property\n    def dataframes(self):\n        return list(self.dataframe_dict.values())\n\n    @property\n    def metadata(self):\n        \"\"\"Returns the metadata for this EntitySet. The metadata will be recomputed if it does not exist.\"\"\"\n        if self._data_description is None:\n            description = serialize.entityset_to_description(self)\n            self._data_description = deserialize.description_to_entityset(description)\n\n        return self._data_description\n\n    def reset_data_description(self):\n        self._data_description = None\n\n    def to_pickle(self, path, compression=None, profile_name=None):\n        \"\"\"Write entityset in the pickle format, location specified by `path`.\n        Path could be a local path or a S3 path.\n        If writing to S3 a tar archive of files will be written.\n\n        Args:\n            path (str): location on disk to write to (will be created as a directory)\n            compression (str) : Name of the compression to use. Possible values are: {'gzip', 'bz2', 'zip', 'xz', None}.\n            profile_name (str) : Name of AWS profile to use, False to use an anonymous profile, or None.\n        \"\"\"\n        serialize.write_data_description(\n            self,\n            path,\n            format=\"pickle\",\n            compression=compression,\n            profile_name=profile_name,\n        )\n        return self\n\n    def to_parquet(self, path, engine=\"auto\", compression=None, profile_name=None):\n        \"\"\"Write entityset to disk in the parquet format, location specified by `path`.\n        Path could be a local path or a S3 path.\n        If writing to S3 a tar archive of files will be written.\n\n        Args:\n            path (str): location on disk to write to (will be created as a directory)\n            engine (str) : Name of the engine to use. Possible values are: {'auto', 'pyarrow'}.\n            compression (str) : Name of the compression to use. Possible values are: {'snappy', 'gzip', 'brotli', None}.\n            profile_name (str) : Name of AWS profile to use, False to use an anonymous profile, or None.\n        \"\"\"\n        serialize.write_data_description(\n            self,\n            path,\n            format=\"parquet\",\n            engine=engine,\n            compression=compression,\n            profile_name=profile_name,\n        )\n        return self\n\n    def to_csv(\n        self,\n        path,\n        sep=\",\",\n        encoding=\"utf-8\",\n        engine=\"python\",\n        compression=None,\n        profile_name=None,\n    ):\n        \"\"\"Write entityset to disk in the csv format, location specified by `path`.\n        Path could be a local path or a S3 path.\n        If writing to S3 a tar archive of files will be written.\n\n        Args:\n            path (str) : Location on disk to write to (will be created as a directory)\n            sep (str) : String of length 1. Field delimiter for the output file.\n            encoding (str) : A string representing the encoding to use in the output file, defaults to 'utf-8'.\n            engine (str) : Name of the engine to use. Possible values are: {'c', 'python'}.\n            compression (str) : Name of the compression to use. Possible values are: {'gzip', 'bz2', 'zip', 'xz', None}.\n            profile_name (str) : Name of AWS profile to use, False to use an anonymous profile, or None.\n        \"\"\"\n        serialize.write_data_description(\n            self,\n            path,\n            format=\"csv\",\n            index=False,\n            sep=sep,\n            encoding=encoding,\n            engine=engine,\n            compression=compression,\n            profile_name=profile_name,\n        )\n        return self\n\n    def to_dictionary(self):\n        return serialize.entityset_to_description(self)\n\n    ###########################################################################\n    #   Public getter/setter methods  #########################################\n    ###########################################################################\n\n    def __repr__(self):\n        repr_out = \"Entityset: {}\\n\".format(self.id)\n        repr_out += \"  DataFrames:\"\n        for df in self.dataframes:\n            if df.shape:\n                repr_out += \"\\n    {} [Rows: {}, Columns: {}]\".format(\n                    df.ww.name,\n                    df.shape[0],\n                    df.shape[1],\n                )\n            else:\n                repr_out += \"\\n    {} [Rows: None, Columns: None]\".format(df.ww.name)\n        repr_out += \"\\n  Relationships:\"\n\n        if len(self.relationships) == 0:\n            repr_out += \"\\n    No relationships\"\n\n        for r in self.relationships:\n            repr_out += \"\\n    %s.%s -> %s.%s\" % (\n                r._child_dataframe_name,\n                r._child_column_name,\n                r._parent_dataframe_name,\n                r._parent_column_name,\n            )\n\n        return repr_out\n\n    def add_relationships(self, relationships):\n        \"\"\"Add multiple new relationships to a entityset\n\n        Args:\n            relationships (list[tuple(str, str, str, str)] or list[Relationship]) : List of\n                new relationships to add. Relationships are specified either as a :class:`.Relationship`\n                object or a four element tuple identifying the parent and child columns:\n                (parent_dataframe_name, parent_column_name, child_dataframe_name, child_column_name)\n        \"\"\"\n        for rel in relationships:\n            if isinstance(rel, Relationship):\n                self.add_relationship(relationship=rel)\n            else:\n                self.add_relationship(*rel)\n        return self\n\n    def add_relationship(\n        self,\n        parent_dataframe_name=None,\n        parent_column_name=None,\n        child_dataframe_name=None,\n        child_column_name=None,\n        relationship=None,\n    ):\n        \"\"\"Add a new relationship between dataframes in the entityset. Relationships can be specified\n        by passing dataframe and columns names or by passing a :class:`.Relationship` object.\n\n        Args:\n            parent_dataframe_name (str): Name of the parent dataframe in the EntitySet. Must be specified\n                if relationship is not.\n            parent_column_name (str): Name of the parent column. Must be specified if relationship is not.\n            child_dataframe_name (str): Name of the child dataframe in the EntitySet. Must be specified\n                if relationship is not.\n            child_column_name (str): Name of the child column. Must be specified if relationship is not.\n            relationship (Relationship): Instance of new relationship to be added. Must be specified\n                if dataframe and column names are not supplied.\n        \"\"\"\n        if relationship and (\n            parent_dataframe_name\n            or parent_column_name\n            or child_dataframe_name\n            or child_column_name\n        ):\n            raise ValueError(\n                \"Cannot specify dataframe and column name values and also supply a Relationship\",\n            )\n\n        if not relationship:\n            relationship = Relationship(\n                self,\n                parent_dataframe_name,\n                parent_column_name,\n                child_dataframe_name,\n                child_column_name,\n            )\n        if relationship in self.relationships:\n            warnings.warn(\"Not adding duplicate relationship: \" + str(relationship))\n            return self\n\n        # _operations?\n\n        # this is a new pair of dataframes\n        child_df = relationship.child_dataframe\n        child_column = relationship._child_column_name\n        if child_df.ww.index == child_column:\n            msg = \"Unable to add relationship because child column '{}' in '{}' is also its index\"\n            raise ValueError(msg.format(child_column, child_df.ww.name))\n        parent_df = relationship.parent_dataframe\n        parent_column = relationship._parent_column_name\n\n        if parent_df.ww.index != parent_column:\n            parent_df.ww.set_index(parent_column)\n\n        # Empty dataframes (as a result of accessing metadata)\n        # default to object dtypes for categorical columns, but\n        # indexes/foreign keys default to ints. In this case, we convert\n        # the empty column's type to int\n        if (\n            child_df.empty\n            and child_df[child_column].dtype == object\n            and parent_df.ww.columns[parent_column].is_numeric\n        ):\n            child_df.ww[child_column] = pd.Series(name=child_column, dtype=np.int64)\n\n        parent_ltype = parent_df.ww.logical_types[parent_column]\n        child_ltype = child_df.ww.logical_types[child_column]\n        if parent_ltype != child_ltype:\n            difference_msg = \"\"\n            if str(parent_ltype) == str(child_ltype):\n                difference_msg = \"There is a conflict between the parameters. \"\n\n            warnings.warn(\n                f\"Logical type {child_ltype} for child column {child_column} does not match \"\n                f\"parent column {parent_column} logical type {parent_ltype}. {difference_msg}\"\n                \"Changing child logical type to match parent.\",\n            )\n            child_df.ww.set_types(logical_types={child_column: parent_ltype})\n\n        if \"foreign_key\" not in child_df.ww.semantic_tags[child_column]:\n            child_df.ww.add_semantic_tags({child_column: \"foreign_key\"})\n\n        self.relationships.append(relationship)\n        self.reset_data_description()\n        return self\n\n    def set_secondary_time_index(self, dataframe_name, secondary_time_index):\n        \"\"\"\n        Set the secondary time index for a dataframe in the EntitySet using its dataframe name.\n\n        Args:\n            dataframe_name (str) : name of the dataframe for which to set the secondary time index.\n            secondary_time_index (dict[str-> list[str]]): Name of column containing time data to\n                be used as a secondary time index mapped to a list of the columns in the dataframe\n                associated with that secondary time index.\n        \"\"\"\n        dataframe = self[dataframe_name]\n        self._set_secondary_time_index(dataframe, secondary_time_index)\n\n    def _set_secondary_time_index(self, dataframe, secondary_time_index):\n        \"\"\"Sets the secondary time index for a Woodwork dataframe passed in\"\"\"\n        assert (\n            dataframe.ww.schema is not None\n        ), \"Cannot set secondary time index if Woodwork is not initialized\"\n        self._check_secondary_time_index(dataframe, secondary_time_index)\n        if secondary_time_index is not None:\n            dataframe.ww.metadata[\"secondary_time_index\"] = secondary_time_index\n\n    ###########################################################################\n    #   Relationship access/helper methods  ###################################\n    ###########################################################################\n\n    def find_forward_paths(self, start_dataframe_name, goal_dataframe_name):\n        \"\"\"\n        Generator which yields all forward paths between a start and goal\n        dataframe. Does not include paths which contain cycles.\n\n        Args:\n            start_dataframe_name (str) : name of dataframe to start the search from\n            goal_dataframe_name  (str) : name of dataframe to find forward path to\n\n        See Also:\n            :func:`BaseEntitySet.find_backward_paths`\n        \"\"\"\n        for sub_dataframe_name, path in self._forward_dataframe_paths(\n            start_dataframe_name,\n        ):\n            if sub_dataframe_name == goal_dataframe_name:\n                yield path\n\n    def find_backward_paths(self, start_dataframe_name, goal_dataframe_name):\n        \"\"\"\n        Generator which yields all backward paths between a start and goal\n        dataframe. Does not include paths which contain cycles.\n\n        Args:\n            start_dataframe_name (str) : Name of dataframe to start the search from.\n            goal_dataframe_name  (str) : Name of dataframe to find backward path to.\n\n        See Also:\n            :func:`BaseEntitySet.find_forward_paths`\n        \"\"\"\n        for path in self.find_forward_paths(goal_dataframe_name, start_dataframe_name):\n            # Reverse path\n            yield path[::-1]\n\n    def _forward_dataframe_paths(self, start_dataframe_name, seen_dataframes=None):\n        \"\"\"\n        Generator which yields the names of all dataframes connected through forward\n        relationships, and the path taken to each. A dataframe will be yielded\n        multiple times if there are multiple paths to it.\n\n        Implemented using depth first search.\n        \"\"\"\n        if seen_dataframes is None:\n            seen_dataframes = set()\n\n        if start_dataframe_name in seen_dataframes:\n            return\n\n        seen_dataframes.add(start_dataframe_name)\n\n        yield start_dataframe_name, []\n\n        for relationship in self.get_forward_relationships(start_dataframe_name):\n            next_dataframe = relationship._parent_dataframe_name\n            # Copy seen dataframes for each next node to allow multiple paths (but\n            # not cycles).\n            descendants = self._forward_dataframe_paths(\n                next_dataframe,\n                seen_dataframes.copy(),\n            )\n            for sub_dataframe_name, sub_path in descendants:\n                yield sub_dataframe_name, [relationship] + sub_path\n\n    def get_forward_dataframes(self, dataframe_name, deep=False):\n        \"\"\"\n        Get dataframes that are in a forward relationship with dataframe\n\n        Args:\n            dataframe_name (str): Name of dataframe to search from.\n            deep (bool): if True, recursively find forward dataframes.\n\n        Yields a tuple of (descendent_name, path from dataframe_name to descendant).\n        \"\"\"\n        for relationship in self.get_forward_relationships(dataframe_name):\n            parent_dataframe_name = relationship._parent_dataframe_name\n            direct_path = RelationshipPath([(True, relationship)])\n            yield parent_dataframe_name, direct_path\n\n            if deep:\n                sub_dataframes = self.get_forward_dataframes(\n                    parent_dataframe_name,\n                    deep=True,\n                )\n                for sub_dataframe_name, path in sub_dataframes:\n                    yield sub_dataframe_name, direct_path + path\n\n    def get_backward_dataframes(self, dataframe_name, deep=False):\n        \"\"\"\n        Get dataframes that are in a backward relationship with dataframe\n\n        Args:\n            dataframe_name (str): Name of dataframe to search from.\n            deep (bool): if True, recursively find backward dataframes.\n\n        Yields a tuple of (descendent_name, path from dataframe_name to descendant).\n        \"\"\"\n        for relationship in self.get_backward_relationships(dataframe_name):\n            child_dataframe_name = relationship._child_dataframe_name\n            direct_path = RelationshipPath([(False, relationship)])\n            yield child_dataframe_name, direct_path\n\n            if deep:\n                sub_dataframes = self.get_backward_dataframes(\n                    child_dataframe_name,\n                    deep=True,\n                )\n                for sub_dataframe_name, path in sub_dataframes:\n                    yield sub_dataframe_name, direct_path + path\n\n    def get_forward_relationships(self, dataframe_name):\n        \"\"\"Get relationships where dataframe \"dataframe_name\" is the child\n\n        Args:\n            dataframe_name (str): Name of dataframe to get relationships for.\n\n        Returns:\n            list[:class:`.Relationship`]: List of forward relationships.\n        \"\"\"\n        return [\n            r for r in self.relationships if r._child_dataframe_name == dataframe_name\n        ]\n\n    def get_backward_relationships(self, dataframe_name):\n        \"\"\"\n        get relationships where dataframe \"dataframe_name\" is the parent.\n\n        Args:\n            dataframe_name (str): Name of dataframe to get relationships for.\n\n        Returns:\n            list[:class:`.Relationship`]: list of backward relationships\n        \"\"\"\n        return [\n            r for r in self.relationships if r._parent_dataframe_name == dataframe_name\n        ]\n\n    def has_unique_forward_path(self, start_dataframe_name, end_dataframe_name):\n        \"\"\"\n        Is the forward path from start to end unique?\n\n        This will raise if there is no such path.\n        \"\"\"\n        paths = self.find_forward_paths(start_dataframe_name, end_dataframe_name)\n\n        next(paths)\n        second_path = next(paths, None)\n\n        return not second_path\n\n    ###########################################################################\n    #  DataFrame creation methods  ##############################################\n    ###########################################################################\n\n    def add_dataframe(\n        self,\n        dataframe,\n        dataframe_name=None,\n        index=None,\n        logical_types=None,\n        semantic_tags=None,\n        make_index=False,\n        time_index=None,\n        secondary_time_index=None,\n        already_sorted=False,\n    ):\n        \"\"\"\n        Add a DataFrame to the EntitySet with Woodwork typing information.\n\n        Args:\n            dataframe (pandas.DataFrame) : Dataframe containing the data.\n\n            dataframe_name (str, optional) : Unique name to associate with this dataframe. Must be\n                provided if Woodwork is not initialized on the input DataFrame.\n\n            index (str, optional): Name of the column used to index the dataframe.\n                Must be unique. If None, take the first column.\n\n            logical_types (dict[str -> Woodwork.LogicalTypes/str, optional]):\n                Keys are column names and values are logical types. Will be inferred if not specified.\n\n            semantic_tags (dict[str -> str/set], optional):\n                Keys are column names and values are semantic tags.\n\n            make_index (bool, optional) : If True, assume index does not\n                exist as a column in dataframe, and create a new column of that name\n                using integers. Otherwise, assume index exists.\n\n            time_index (str, optional): Name of the column containing\n                time data. Type must be numeric or datetime in nature.\n\n            secondary_time_index (dict[str -> list[str]]): Name of column containing time data to\n                be used as a secondary time index mapped to a list of the columns in the dataframe\n                associated with that secondary time index.\n\n            already_sorted (bool, optional) : If True, assumes that input dataframe\n                is already sorted by time. Defaults to False.\n\n        Notes:\n\n            Will infer logical types from the data.\n\n        Example:\n            .. ipython:: python\n\n                import featuretools as ft\n                import pandas as pd\n                transactions_df = pd.DataFrame({\"id\": [1, 2, 3, 4, 5, 6],\n                                                \"session_id\": [1, 2, 1, 3, 4, 5],\n                                                \"amount\": [100.40, 20.63, 33.32, 13.12, 67.22, 1.00],\n                                                \"transaction_time\": pd.date_range(start=\"10:00\", periods=6, freq=\"10s\"),\n                                                \"fraud\": [True, False, True, False, True, True]})\n                es = ft.EntitySet(\"example\")\n                es.add_dataframe(dataframe_name=\"transactions\",\n                                 index=\"id\",\n                                 time_index=\"transaction_time\",\n                                 dataframe=transactions_df)\n\n                es[\"transactions\"]\n\n        \"\"\"\n        logical_types = logical_types or {}\n        semantic_tags = semantic_tags or {}\n\n        if len(self.dataframes) > 0:\n            if not isinstance(dataframe, type(self.dataframes[0])):\n                raise ValueError(\n                    \"All dataframes must be of the same type. \"\n                    \"Cannot add dataframe of type {} to an entityset with existing dataframes \"\n                    \"of type {}\".format(type(dataframe), type(self.dataframes[0])),\n                )\n\n        # Only allow string column names\n        non_string_names = [\n            name for name in dataframe.columns if not isinstance(name, str)\n        ]\n        if non_string_names:\n            raise ValueError(\n                \"All column names must be strings (Columns {} \"\n                \"are not strings)\".format(non_string_names),\n            )\n\n        if dataframe.ww.schema is None:\n            if dataframe_name is None:\n                raise ValueError(\n                    \"Cannot add dataframe to EntitySet without a name. \"\n                    \"Please provide a value for the dataframe_name parameter.\",\n                )\n\n            index_was_created, index, dataframe = _get_or_create_index(\n                index,\n                make_index,\n                dataframe,\n            )\n\n            dataframe.ww.init(\n                name=dataframe_name,\n                index=index,\n                time_index=time_index,\n                logical_types=logical_types,\n                semantic_tags=semantic_tags,\n                already_sorted=already_sorted,\n            )\n            if index_was_created:\n                dataframe.ww.metadata[\"created_index\"] = index\n\n        else:\n            if dataframe.ww.name is None:\n                raise ValueError(\n                    \"Cannot add a Woodwork DataFrame to EntitySet without a name\",\n                )\n            if dataframe.ww.index is None:\n                raise ValueError(\n                    \"Cannot add Woodwork DataFrame to EntitySet without index\",\n                )\n\n            extra_params = []\n            if index is not None:\n                extra_params.append(\"index\")\n            if time_index is not None:\n                extra_params.append(\"time_index\")\n            if logical_types:\n                extra_params.append(\"logical_types\")\n            if make_index:\n                extra_params.append(\"make_index\")\n            if semantic_tags:\n                extra_params.append(\"semantic_tags\")\n            if already_sorted:\n                extra_params.append(\"already_sorted\")\n            if dataframe_name is not None and dataframe_name != dataframe.ww.name:\n                extra_params.append(\"dataframe_name\")\n            if extra_params:\n                warnings.warn(\n                    \"A Woodwork-initialized DataFrame was provided, so the following parameters were ignored: \"\n                    + \", \".join(extra_params),\n                )\n\n        if dataframe.ww.time_index is not None:\n            self._check_uniform_time_index(dataframe)\n            self._check_secondary_time_index(dataframe)\n\n        if secondary_time_index:\n            self._set_secondary_time_index(\n                dataframe,\n                secondary_time_index=secondary_time_index,\n            )\n\n        dataframe = self._normalize_values(dataframe)\n\n        self.dataframe_dict[dataframe.ww.name] = dataframe\n        self.reset_data_description()\n        self._add_references_to_metadata(dataframe)\n\n        return self\n\n    def __setitem__(self, key, value):\n        self.add_dataframe(dataframe=value, dataframe_name=key)\n\n    def normalize_dataframe(\n        self,\n        base_dataframe_name,\n        new_dataframe_name,\n        index,\n        additional_columns=None,\n        copy_columns=None,\n        make_time_index=None,\n        make_secondary_time_index=None,\n        new_dataframe_time_index=None,\n        new_dataframe_secondary_time_index=None,\n    ):\n        \"\"\"Create a new dataframe and relationship from unique values of an existing column.\n\n        Args:\n            base_dataframe_name (str) : Dataframe name from which to split.\n\n            new_dataframe_name (str): Name of the new dataframe.\n\n            index (str): Column in old dataframe\n                that will become index of new dataframe. Relationship\n                will be created across this column.\n\n            additional_columns (list[str]):\n                List of column names to remove from\n                base_dataframe and move to new dataframe.\n\n            copy_columns (list[str]): List of\n                column names to copy from old dataframe\n                and move to new dataframe.\n\n            make_time_index (bool or str, optional): Create time index for new dataframe based\n                on time index in base_dataframe, optionally specifying which column in base_dataframe\n                to use for time_index. If specified as True without a specific column name,\n                uses the primary time index. Defaults to True if base dataframe has a time index.\n\n            make_secondary_time_index (dict[str -> list[str]], optional): Create a secondary time index\n                from key. Values of dictionary are the columns to associate with a secondary time index.\n                Only one secondary time index is allowed. If None, only associate the time index.\n\n            new_dataframe_time_index (str, optional): Rename new dataframe time index.\n\n            new_dataframe_secondary_time_index (str, optional): Rename new dataframe secondary time index.\n\n        \"\"\"\n        base_dataframe = self.dataframe_dict[base_dataframe_name]\n        additional_columns = additional_columns or []\n        copy_columns = copy_columns or []\n\n        for list_name, col_list in {\n            \"copy_columns\": copy_columns,\n            \"additional_columns\": additional_columns,\n        }.items():\n            if not isinstance(col_list, list):\n                raise TypeError(\n                    \"'{}' must be a list, but received type {}\".format(\n                        list_name,\n                        type(col_list),\n                    ),\n                )\n            if len(col_list) != len(set(col_list)):\n                raise ValueError(\n                    f\"'{list_name}' contains duplicate columns. All columns must be unique.\",\n                )\n            for col_name in col_list:\n                if col_name == index:\n                    raise ValueError(\n                        \"Not adding {} as both index and column in {}\".format(\n                            col_name,\n                            list_name,\n                        ),\n                    )\n\n        for col in additional_columns:\n            if col == base_dataframe.ww.time_index:\n                raise ValueError(\n                    \"Not moving {} as it is the base time index column. Perhaps, move the column to the copy_columns.\".format(\n                        col,\n                    ),\n                )\n\n        if isinstance(make_time_index, str):\n            if make_time_index not in base_dataframe.columns:\n                raise ValueError(\n                    \"'make_time_index' must be a column in the base dataframe\",\n                )\n            elif make_time_index not in additional_columns + copy_columns:\n                raise ValueError(\n                    \"'make_time_index' must be specified in 'additional_columns' or 'copy_columns'\",\n                )\n        if index == base_dataframe.ww.index:\n            raise ValueError(\n                \"'index' must be different from the index column of the base dataframe\",\n            )\n\n        transfer_types = {}\n        # Types will be a tuple of (logical_type, semantic_tags, column_metadata, column_description)\n        transfer_types[index] = (\n            base_dataframe.ww.logical_types[index],\n            base_dataframe.ww.semantic_tags[index],\n            base_dataframe.ww.columns[index].metadata,\n            base_dataframe.ww.columns[index].description,\n        )\n        for col_name in additional_columns + copy_columns:\n            # Remove any existing time index tags\n            transfer_types[col_name] = (\n                base_dataframe.ww.logical_types[col_name],\n                (base_dataframe.ww.semantic_tags[col_name] - {\"time_index\"}),\n                base_dataframe.ww.columns[col_name].metadata,\n                base_dataframe.ww.columns[col_name].description,\n            )\n\n        # create and add new dataframe\n        new_dataframe = self[base_dataframe_name].copy()\n\n        if make_time_index is None and base_dataframe.ww.time_index is not None:\n            make_time_index = True\n\n        if isinstance(make_time_index, str):\n            # Set the new time index to make_time_index.\n            base_time_index = make_time_index\n            new_dataframe_time_index = make_time_index\n            already_sorted = new_dataframe_time_index == base_dataframe.ww.time_index\n        elif make_time_index:\n            # Create a new time index based on the base dataframe time index.\n            base_time_index = base_dataframe.ww.time_index\n            if new_dataframe_time_index is None:\n                new_dataframe_time_index = \"first_%s_time\" % (base_dataframe.ww.name)\n\n            already_sorted = True\n\n            assert (\n                base_dataframe.ww.time_index is not None\n            ), \"Base dataframe doesn't have time_index defined\"\n\n            if base_time_index not in [col for col in copy_columns]:\n                copy_columns.append(base_time_index)\n\n                time_index_types = (\n                    base_dataframe.ww.logical_types[base_dataframe.ww.time_index],\n                    base_dataframe.ww.semantic_tags[base_dataframe.ww.time_index],\n                    base_dataframe.ww.columns[base_dataframe.ww.time_index].metadata,\n                    base_dataframe.ww.columns[base_dataframe.ww.time_index].description,\n                )\n            else:\n                # If base_time_index is in copy_columns then we've already added the transfer types\n                # but since we're changing the name, we have to remove it\n                time_index_types = transfer_types[base_dataframe.ww.time_index]\n                del transfer_types[base_dataframe.ww.time_index]\n\n            transfer_types[new_dataframe_time_index] = time_index_types\n\n        else:\n            new_dataframe_time_index = None\n            already_sorted = False\n\n        if new_dataframe_time_index is not None and new_dataframe_time_index == index:\n            raise ValueError(\n                \"time_index and index cannot be the same value, %s\"\n                % (new_dataframe_time_index),\n            )\n\n        selected_columns = (\n            [index]\n            + [col for col in additional_columns]\n            + [col for col in copy_columns]\n        )\n\n        new_dataframe = new_dataframe.dropna(subset=[index])\n        new_dataframe2 = new_dataframe.drop_duplicates(index, keep=\"first\")[\n            selected_columns\n        ]\n\n        if make_time_index:\n            new_dataframe2 = new_dataframe2.rename(\n                columns={base_time_index: new_dataframe_time_index},\n            )\n        if make_secondary_time_index:\n            assert (\n                len(make_secondary_time_index) == 1\n            ), \"Can only provide 1 secondary time index\"\n            secondary_time_index = list(make_secondary_time_index.keys())[0]\n\n            secondary_columns = [index, secondary_time_index] + list(\n                make_secondary_time_index.values(),\n            )[0]\n            secondary_df = new_dataframe.drop_duplicates(index, keep=\"last\")[\n                secondary_columns\n            ]\n            if new_dataframe_secondary_time_index:\n                secondary_df = secondary_df.rename(\n                    columns={secondary_time_index: new_dataframe_secondary_time_index},\n                )\n                secondary_time_index = new_dataframe_secondary_time_index\n            else:\n                new_dataframe_secondary_time_index = secondary_time_index\n            secondary_df = secondary_df.set_index(index)\n            new_dataframe = new_dataframe2.join(secondary_df, on=index)\n        else:\n            new_dataframe = new_dataframe2\n\n        base_dataframe_index = index\n\n        if make_secondary_time_index:\n            old_ti_name = list(make_secondary_time_index.keys())[0]\n            ti_cols = list(make_secondary_time_index.values())[0]\n            ti_cols = [c if c != old_ti_name else secondary_time_index for c in ti_cols]\n            make_secondary_time_index = {secondary_time_index: ti_cols}\n\n        # will initialize Woodwork on this DataFrame\n        logical_types = {}\n        semantic_tags = {}\n        column_metadata = {}\n        column_descriptions = {}\n        for col_name, (ltype, tags, metadata, description) in transfer_types.items():\n            logical_types[col_name] = ltype\n            semantic_tags[col_name] = tags - {\"time_index\"}\n            column_metadata[col_name] = copy.deepcopy(metadata)\n            column_descriptions[col_name] = description\n\n        new_dataframe.ww.init(\n            name=new_dataframe_name,\n            index=index,\n            already_sorted=already_sorted,\n            time_index=new_dataframe_time_index,\n            logical_types=logical_types,\n            semantic_tags=semantic_tags,\n            column_metadata=column_metadata,\n            column_descriptions=column_descriptions,\n        )\n\n        self.add_dataframe(\n            new_dataframe,\n            secondary_time_index=make_secondary_time_index,\n        )\n\n        self.dataframe_dict[base_dataframe_name] = self.dataframe_dict[\n            base_dataframe_name\n        ].ww.drop(additional_columns)\n\n        self.dataframe_dict[base_dataframe_name].ww.add_semantic_tags(\n            {base_dataframe_index: \"foreign_key\"},\n        )\n\n        self.add_relationship(\n            new_dataframe_name,\n            index,\n            base_dataframe_name,\n            base_dataframe_index,\n        )\n        self.reset_data_description()\n        return self\n\n    # ###########################################################################\n    # #  Data wrangling methods  ###############################################\n    # ###########################################################################\n\n    def concat(self, other, inplace=False):\n        \"\"\"Combine entityset with another to create a new entityset with the\n        combined data of both entitysets.\n        \"\"\"\n        if not self.__eq__(other):\n            raise ValueError(\n                \"Entitysets must have the same dataframes, relationships\"\n                \", and column names\",\n            )\n\n        if inplace:\n            combined_es = self\n        else:\n            combined_es = copy.deepcopy(self)\n\n        has_last_time_index = []\n        for df in self.dataframes:\n            self_df = df\n            other_df = other[df.ww.name]\n            combined_df = pd.concat([self_df, other_df])\n            # If both DataFrames have made indexes, there will likely\n            # be overlap in the index column, so we use the other values\n            if self_df.ww.metadata.get(\"created_index\") or other_df.ww.metadata.get(\n                \"created_index\",\n            ):\n                columns = [\n                    col\n                    for col in combined_df.columns\n                    if col != df.ww.index or col != df.ww.time_index\n                ]\n            else:\n                columns = [df.ww.index]\n            combined_df.drop_duplicates(columns, inplace=True)\n\n            self_lti_col = df.ww.metadata.get(\"last_time_index\")\n            other_lti_col = other[df.ww.name].ww.metadata.get(\"last_time_index\")\n            if self_lti_col is not None or other_lti_col is not None:\n                has_last_time_index.append(df.ww.name)\n\n            combined_es.replace_dataframe(\n                dataframe_name=df.ww.name,\n                df=combined_df,\n                recalculate_last_time_indexes=False,\n                already_sorted=False,\n            )\n\n        if has_last_time_index:\n            combined_es.add_last_time_indexes(updated_dataframes=has_last_time_index)\n\n        combined_es.reset_data_description()\n\n        return combined_es\n\n    ###########################################################################\n    #  Indexing methods  ###############################################\n    ###########################################################################\n    def add_last_time_indexes(self, updated_dataframes=None):\n        \"\"\"\n        Calculates the last time index values for each dataframe (the last time\n        an instance or children of that instance were observed).  Used when\n        calculating features using training windows. Adds the last time index as\n        a series named _ft_last_time on the dataframe.\n\n        Args:\n            updated_dataframes (list[str]): List of dataframe names to update last_time_index for\n                (will update all parents of those dataframes as well)\n        \"\"\"\n        # Generate graph of dataframes to find leaf dataframes\n        children = defaultdict(list)  # parent --> child mapping\n        child_cols = defaultdict(dict)\n        for r in self.relationships:\n            children[r._parent_dataframe_name].append(r.child_dataframe)\n            child_cols[r._parent_dataframe_name][r._child_dataframe_name] = (\n                r.child_column\n            )\n\n        updated_dataframes = updated_dataframes or []\n        if updated_dataframes:\n            # find parents of updated_dataframes\n            parent_queue = updated_dataframes[:]\n            parents = set()\n            while len(parent_queue):\n                df_name = parent_queue.pop(0)\n                if df_name in parents:\n                    continue\n                parents.add(df_name)\n\n                for parent_name, _ in self.get_forward_dataframes(df_name):\n                    parent_queue.append(parent_name)\n\n            queue = [self[p] for p in parents]\n            to_explore = parents\n        else:\n            to_explore = set(self.dataframe_dict.keys())\n            queue = self.dataframes[:]\n\n        explored = set()\n        # Store the last time indexes for the entire entityset in a dictionary to update\n        es_lti_dict = {}\n        for df in self.dataframes:\n            lti_col = df.ww.metadata.get(\"last_time_index\")\n            if lti_col is not None:\n                lti_col = df[lti_col]\n            es_lti_dict[df.ww.name] = lti_col\n\n        for df in queue:\n            es_lti_dict[df.ww.name] = None\n\n        # We will explore children of dataframes on the queue,\n        # which may not be in the to_explore set. Therefore,\n        # we check whether all elements of to_explore are in\n        # explored, rather than just comparing length\n        while not to_explore.issubset(explored):\n            dataframe = queue.pop(0)\n\n            if es_lti_dict[dataframe.ww.name] is None:\n                if dataframe.ww.time_index is not None:\n                    lti = dataframe[dataframe.ww.time_index].copy()\n                else:\n                    lti = dataframe.ww[dataframe.ww.index].copy()\n                    # Cannot have a category dtype with nans when calculating last time index\n                    lti = lti.astype(\"object\")\n                    lti[:] = None\n\n                es_lti_dict[dataframe.ww.name] = lti\n\n            if dataframe.ww.name in children:\n                child_dataframes = children[dataframe.ww.name]\n\n                # if all children not explored, skip for now\n                if not set([df.ww.name for df in child_dataframes]).issubset(explored):\n                    # Now there is a possibility that a child dataframe\n                    # was not explicitly provided in updated_dataframes,\n                    # and never made it onto the queue. If updated_dataframes\n                    # is None then we just load all dataframes onto the queue\n                    # so we didn't need this logic\n                    for df in child_dataframes:\n                        if df.ww.name not in explored and df.ww.name not in [\n                            q.ww.name for q in queue\n                        ]:\n                            # must also reset last time index here\n                            es_lti_dict[df.ww.name] = None\n                            queue.append(df)\n                    queue.append(dataframe)\n                    continue\n\n                # updated last time from all children\n                for child_df in child_dataframes:\n                    if es_lti_dict[child_df.ww.name] is None:\n                        continue\n                    link_col = child_cols[dataframe.ww.name][child_df.ww.name].name\n\n                    lti_df = pd.DataFrame(\n                        {\n                            \"last_time\": es_lti_dict[child_df.ww.name],\n                            dataframe.ww.index: child_df[link_col],\n                        },\n                    )\n\n                    # sort by time and keep only the most recent\n                    lti_df.sort_values(\n                        [\"last_time\", dataframe.ww.index],\n                        kind=\"mergesort\",\n                        inplace=True,\n                    )\n\n                    lti_df.drop_duplicates(\n                        dataframe.ww.index,\n                        keep=\"last\",\n                        inplace=True,\n                    )\n\n                    lti_df.set_index(dataframe.ww.index, inplace=True)\n                    lti_df = lti_df.reindex(es_lti_dict[dataframe.ww.name].index)\n                    lti_df[\"last_time_old\"] = es_lti_dict[dataframe.ww.name]\n                    if lti_df.empty:\n                        # Pandas errors out if it tries to do fillna and then max on an empty dataframe\n                        lti_df = pd.Series([], dtype=\"object\")\n                    else:\n                        lti_df[\"last_time\"] = lti_df[\"last_time\"].astype(\n                            \"datetime64[ns]\",\n                        )\n                        lti_df[\"last_time_old\"] = lti_df[\"last_time_old\"].astype(\n                            \"datetime64[ns]\",\n                        )\n                        lti_df = lti_df.fillna(\n                            pd.to_datetime(\"1800-01-01 00:00\"),\n                        ).max(axis=1)\n                        lti_df = lti_df.replace(\n                            pd.to_datetime(\"1800-01-01 00:00\"),\n                            pd.NaT,\n                        )\n\n                    es_lti_dict[dataframe.ww.name] = lti_df\n                    es_lti_dict[dataframe.ww.name].name = \"last_time\"\n\n            explored.add(dataframe.ww.name)\n\n        # Store the last time index on the DataFrames\n        dfs_to_update = {}\n        for df in self.dataframes:\n            lti = es_lti_dict[df.ww.name]\n            if lti is not None:\n                if self.time_type == \"numeric\":\n                    if lti.dtype == \"datetime64[ns]\":\n                        # Woodwork cannot convert from datetime to numeric\n                        lti = lti.apply(lambda x: x.value)\n                    lti = init_series(lti, logical_type=\"Double\")\n                else:\n                    lti = init_series(lti, logical_type=\"Datetime\")\n\n                lti.name = LTI_COLUMN_NAME\n\n                if LTI_COLUMN_NAME in df.columns:\n                    if \"last_time_index\" in df.ww.semantic_tags[LTI_COLUMN_NAME]:\n                        # Remove any previous last time index placed by featuretools\n                        df.ww.pop(LTI_COLUMN_NAME)\n                    else:\n                        raise ValueError(\n                            \"Cannot add a last time index on DataFrame with an existing \"\n                            f\"'{LTI_COLUMN_NAME}' column. Please rename '{LTI_COLUMN_NAME}'.\",\n                        )\n\n                # Add the new column to the DataFrame\n                df.ww[LTI_COLUMN_NAME] = lti\n                if \"last_time_index\" not in df.ww.semantic_tags[LTI_COLUMN_NAME]:\n                    df.ww.add_semantic_tags({LTI_COLUMN_NAME: \"last_time_index\"})\n                df.ww.metadata[\"last_time_index\"] = LTI_COLUMN_NAME\n\n        for df in dfs_to_update.values():\n            df.ww.add_semantic_tags({LTI_COLUMN_NAME: \"last_time_index\"})\n            df.ww.metadata[\"last_time_index\"] = LTI_COLUMN_NAME\n            self.dataframe_dict[df.ww.name] = df\n\n        self.reset_data_description()\n        for df in self.dataframes:\n            self._add_references_to_metadata(df)\n\n    # ###########################################################################\n    # #  Pickling ###############################################\n    # ###########################################################################\n    def __getstate__(self):\n        return {\n            **self.__dict__,\n            WW_SCHEMA_KEY: {\n                df_name: df.ww.schema for df_name, df in self.dataframe_dict.items()\n            },\n        }\n\n    def __setstate__(self, state):\n        ww_schemas = state.pop(WW_SCHEMA_KEY)\n        for df_name, df in state.get(\"dataframe_dict\", {}).items():\n            if ww_schemas[df_name] is not None:\n                df.ww.init(schema=ww_schemas[df_name], validate=False)\n\n        self.__dict__.update(state)\n\n    # ###########################################################################\n    # #  Other ###############################################\n    # ###########################################################################\n    def add_interesting_values(\n        self,\n        max_values=5,\n        verbose=False,\n        dataframe_name=None,\n        values=None,\n    ):\n        \"\"\"Find or set interesting values for categorical columns, to be used to generate \"where\" clauses\n\n        Args:\n            max_values (int) : Maximum number of values per column to add.\n            verbose (bool) : If True, print summary of interesting values found.\n            dataframe_name (str) : The dataframe in the EntitySet for which to add interesting values.\n                If not specified interesting values will be added for all dataframes.\n            values (dict): A dictionary mapping column names to the interesting values to set\n                for the column. If specified, a corresponding dataframe_name must also be provided.\n                If not specified, interesting values will be set for all eligible columns. If values\n                are specified, max_values and verbose parameters will be ignored.\n\n        Returns:\n            None\n\n        \"\"\"\n        if dataframe_name is None and values is not None:\n            raise ValueError(\"dataframe_name must be specified if values are provided\")\n\n        if dataframe_name is not None and values is not None:\n            for column, vals in values.items():\n                self[dataframe_name].ww.columns[column].metadata[\n                    \"interesting_values\"\n                ] = vals\n            return\n\n        if dataframe_name:\n            dataframes = [self[dataframe_name]]\n        else:\n            dataframes = self.dataframes\n\n        def add_value(df, col, val, verbose):\n            if verbose:\n                msg = \"Column {}: Marking {} as an interesting value\"\n                logger.info(msg.format(col, val))\n            interesting_vals = df.ww.columns[col].metadata.get(\"interesting_values\", [])\n            interesting_vals.append(val)\n            df.ww.columns[col].metadata[\"interesting_values\"] = interesting_vals\n\n        for df in dataframes:\n            value_counts = df.ww.value_counts(top_n=max(25, max_values), dropna=True)\n            total_count = len(df)\n\n            for col, counts in value_counts.items():\n                if {\"index\", \"foreign_key\"}.intersection(df.ww.semantic_tags[col]):\n                    continue\n\n                for i in range(min(max_values, len(counts))):\n                    # Categorical columns will include counts of 0 for all values\n                    # in categories. Stop when we encounter a 0 count.\n                    if counts[i][\"count\"] == 0:\n                        break\n                    if len(counts) < 25:\n                        value = counts[i][\"value\"]\n                        add_value(df, col, value, verbose)\n                    else:\n                        fraction = counts[i][\"count\"] / total_count\n                        if fraction > 0.05 and fraction < 0.95:\n                            value = counts[i][\"value\"]\n                            add_value(df, col, value, verbose)\n                        else:\n                            break\n\n        self.reset_data_description()\n\n    def plot(self, to_file=None):\n        \"\"\"\n        Create a UML diagram-ish graph of the EntitySet.\n\n        Args:\n            to_file (str, optional) : Path to where the plot should be saved.\n                If set to None (as by default), the plot will not be saved.\n\n        Returns:\n            graphviz.Digraph : Graph object that can directly be displayed in\n                Jupyter notebooks. Nodes of the graph correspond to the DataFrames\n                in the EntitySet, showing the typing information for each column.\n\n        Note:\n            The typing information displayed for each column is based off of the Woodwork\n            ColumnSchema for that column and is represented as ``LogicalType; semantic_tags``,\n            but the standard semantic tags have been removed for brevity.\n        \"\"\"\n        graphviz = check_graphviz()\n        format_ = get_graphviz_format(graphviz=graphviz, to_file=to_file)\n\n        # Initialize a new directed graph\n        graph = graphviz.Digraph(\n            self.id,\n            format=format_,\n            graph_attr={\"splines\": \"ortho\"},\n        )\n\n        # Draw dataframes\n        for df in self.dataframes:\n            column_typing_info = []\n            for col_name, col_schema in df.ww.columns.items():\n                col_string = col_name + \" : \" + str(col_schema.logical_type)\n\n                tags = col_schema.semantic_tags - col_schema.logical_type.standard_tags\n                if tags:\n                    col_string += \"; \"\n                    col_string += \", \".join(tags)\n                column_typing_info.append(col_string)\n\n            columns_string = \"\\l\".join(column_typing_info)  # noqa: W605\n            nrows = df.shape[0]\n            label = \"{%s (%d row%s)|%s\\l}\" % (  # noqa: W605\n                df.ww.name,\n                nrows,\n                \"s\" * (nrows > 1),\n                columns_string,\n            )\n            graph.node(df.ww.name, shape=\"record\", label=label)\n\n        # Draw relationships\n        for rel in self.relationships:\n            # Display the key only once if is the same for both related dataframes\n            if rel._parent_column_name == rel._child_column_name:\n                label = rel._parent_column_name\n            else:\n                label = \"%s -> %s\" % (rel._parent_column_name, rel._child_column_name)\n\n            graph.edge(\n                rel._child_dataframe_name,\n                rel._parent_dataframe_name,\n                xlabel=label,\n            )\n\n        if to_file:\n            save_graph(graph, to_file, format_)\n        return graph\n\n    def _handle_time(\n        self,\n        dataframe_name,\n        df,\n        time_last=None,\n        training_window=None,\n        include_cutoff_time=True,\n    ):\n        \"\"\"\n        Filter a dataframe for all instances before time_last.\n        If the dataframe does not have a time index, return the original\n        dataframe.\n        \"\"\"\n\n        schema = self[dataframe_name].ww.schema\n        if schema.time_index:\n            df_empty = df.empty\n            if time_last is not None and not df_empty:\n                if include_cutoff_time:\n                    df = df[df[schema.time_index] <= time_last]\n                else:\n                    df = df[df[schema.time_index] < time_last]\n                if training_window is not None:\n                    training_window = _check_timedelta(training_window)\n                    if include_cutoff_time:\n                        mask = df[schema.time_index] > time_last - training_window\n                    else:\n                        mask = df[schema.time_index] >= time_last - training_window\n                    lti_col = schema.metadata.get(\"last_time_index\")\n                    if lti_col is not None:\n                        if include_cutoff_time:\n                            lti_mask = df[lti_col] > time_last - training_window\n                        else:\n                            lti_mask = df[lti_col] >= time_last - training_window\n                        mask = mask | lti_mask\n                    else:\n                        warnings.warn(\n                            \"Using training_window but last_time_index is \"\n                            \"not set for dataframe %s\" % (dataframe_name),\n                        )\n\n                    df = df[mask]\n\n        secondary_time_indexes = schema.metadata.get(\"secondary_time_index\") or {}\n        for secondary_time_index, columns in secondary_time_indexes.items():\n            # should we use ignore time last here?\n            if time_last is not None and not df.empty:\n                mask = df[secondary_time_index] >= time_last\n                df.loc[mask, columns] = np.nan\n\n        return df\n\n    def query_by_values(\n        self,\n        dataframe_name,\n        instance_vals,\n        column_name=None,\n        columns=None,\n        time_last=None,\n        training_window=None,\n        include_cutoff_time=True,\n    ):\n        \"\"\"Query instances that have column with given value\n\n        Args:\n            dataframe_name (str): The id of the dataframe to query\n            instance_vals (pd.Dataframe, pd.Series, list[str] or str) :\n                Instance(s) to match.\n            column_name (str) : Column to query on. If None, query on index.\n            columns (list[str]) : Columns to return. Return all columns if None.\n            time_last (pd.TimeStamp) : Query data up to and including this\n                time. Only applies if dataframe has a time index.\n            training_window (Timedelta, optional):\n                Window defining how much time before the cutoff time data\n                can be used when calculating features. If None, all data before cutoff time is used.\n            include_cutoff_time (bool):\n                If True, data at cutoff time are included in calculating features\n\n        Returns:\n            pd.DataFrame : instances that match constraints with ids in order of underlying dataframe\n        \"\"\"\n        dataframe = self[dataframe_name]\n        if not column_name:\n            column_name = dataframe.ww.index\n\n        instance_vals = _vals_to_series(instance_vals, column_name)\n\n        training_window = _check_timedelta(training_window)\n\n        if training_window is not None:\n            assert (\n                training_window.has_no_observations()\n            ), \"Training window cannot be in observations\"\n\n        if instance_vals is None:\n            df = dataframe.copy()\n\n        elif isinstance(instance_vals, pd.Series) and instance_vals.empty:\n            df = dataframe.head(0)\n\n        else:\n            df = dataframe[dataframe[column_name].isin(instance_vals)]\n            df = df.set_index(dataframe.ww.index, drop=False)\n\n            # ensure filtered df has same categories as original\n            # workaround for issue below\n            # github.com/pandas-dev/pandas/issues/22501#issuecomment-415982538\n            #\n            # Pandas claims that bug is fixed but it still shows up in some\n            # cases.  More investigation needed.\n            if dataframe.ww.columns[column_name].is_categorical:\n                categories = pd.api.types.CategoricalDtype(\n                    categories=dataframe[column_name].cat.categories,\n                )\n                df[column_name] = df[column_name].astype(categories)\n\n        df = self._handle_time(\n            dataframe_name=dataframe_name,\n            df=df,\n            time_last=time_last,\n            training_window=training_window,\n            include_cutoff_time=include_cutoff_time,\n        )\n\n        if columns is not None:\n            df = df[columns]\n\n        return df\n\n    def replace_dataframe(\n        self,\n        dataframe_name,\n        df,\n        already_sorted=False,\n        recalculate_last_time_indexes=True,\n    ):\n        \"\"\"Replace the internal dataframe of an EntitySet table, keeping Woodwork typing information the same.\n        Optionally makes sure that data is sorted, that reference indexes to other dataframes are consistent,\n        and that last_time_indexes are updated to reflect the new data. If an index was created for the original\n        dataframe and is not present on the new dataframe, an index column of the same name will be added to the\n        new dataframe.\n        \"\"\"\n        if not isinstance(df, type(self[dataframe_name])):\n            raise TypeError(\"Incorrect DataFrame type used\")\n\n        # If the original DataFrame has a last time index column and the new one doesnt\n        # remove the column and the reference to last time index from that dataframe\n        last_time_index_column = self[dataframe_name].ww.metadata.get(\"last_time_index\")\n        if (\n            last_time_index_column is not None\n            and last_time_index_column not in df.columns\n        ):\n            self[dataframe_name].ww.pop(last_time_index_column)\n            del self[dataframe_name].ww.metadata[\"last_time_index\"]\n\n        # If the original DataFrame had an index created via make_index,\n        # we may need to remake the index if it's not in the new DataFrame\n        created_index = self[dataframe_name].ww.metadata.get(\"created_index\")\n        if created_index is not None and created_index not in df.columns:\n            df = _create_index(df, created_index)\n\n        old_column_names = list(self[dataframe_name].columns)\n        if len(df.columns) != len(old_column_names):\n            raise ValueError(\n                \"New dataframe contains {} columns, expecting {}\".format(\n                    len(df.columns),\n                    len(old_column_names),\n                ),\n            )\n        for col_name in old_column_names:\n            if col_name not in df.columns:\n                raise ValueError(\n                    \"New dataframe is missing new {} column\".format(col_name),\n                )\n\n        if df.ww.schema is not None:\n            warnings.warn(\n                \"Woodwork typing information on new dataframe will be replaced \"\n                f\"with existing typing information from {dataframe_name}\",\n            )\n\n        df.ww.init(\n            schema=self[dataframe_name].ww._schema,\n            already_sorted=already_sorted,\n        )\n        # Make sure column ordering matches original ordering\n        df = df.ww[old_column_names]\n\n        df = self._normalize_values(df)\n\n        self.dataframe_dict[dataframe_name] = df\n\n        if self[dataframe_name].ww.time_index is not None:\n            self._check_uniform_time_index(self[dataframe_name])\n\n        df_metadata = self[dataframe_name].ww.metadata\n        self.set_secondary_time_index(\n            dataframe_name,\n            df_metadata.get(\"secondary_time_index\"),\n        )\n        if recalculate_last_time_indexes and last_time_index_column is not None:\n            self.add_last_time_indexes(updated_dataframes=[dataframe_name])\n        self.reset_data_description()\n        self._add_references_to_metadata(df)\n\n    def _check_time_indexes(self):\n        for dataframe in self.dataframe_dict.values():\n            self._check_uniform_time_index(dataframe)\n            self._check_secondary_time_index(dataframe)\n\n    def _check_secondary_time_index(self, dataframe, secondary_time_index=None):\n        secondary_time_index = secondary_time_index or dataframe.ww.metadata.get(\n            \"secondary_time_index\",\n            {},\n        )\n\n        if secondary_time_index and dataframe.ww.time_index is None:\n            raise ValueError(\n                \"Cannot set secondary time index on a DataFrame that has no primary time index.\",\n            )\n\n        for time_index, columns in secondary_time_index.items():\n            self._check_uniform_time_index(dataframe, column_name=time_index)\n            if time_index not in columns:\n                columns.append(time_index)\n\n    def _check_uniform_time_index(self, dataframe, column_name=None):\n        column_name = column_name or dataframe.ww.time_index\n        if column_name is None:\n            return\n\n        time_type = self._get_time_type(dataframe, column_name)\n        if self.time_type is None:\n            self.time_type = time_type\n        elif self.time_type != time_type:\n            info = \"%s time index is %s type which differs from other entityset time indexes\"\n            raise TypeError(info % (dataframe.ww.name, time_type))\n\n    def _get_time_type(self, dataframe, column_name=None):\n        column_name = column_name or dataframe.ww.time_index\n\n        column_schema = dataframe.ww.columns[column_name]\n\n        time_type = None\n        if column_schema.is_numeric:\n            time_type = \"numeric\"\n        elif column_schema.is_datetime:\n            time_type = Datetime\n\n        if time_type is None:\n            info = \"%s time index not recognized as numeric or datetime\"\n            raise TypeError(info % dataframe.ww.name)\n        return time_type\n\n    def _add_references_to_metadata(self, dataframe):\n        dataframe.ww.metadata.update(entityset_id=self.id)\n        for column in dataframe.columns:\n            metadata = dataframe.ww._schema.columns[column].metadata\n            metadata.update(dataframe_name=dataframe.ww.name)\n            metadata.update(entityset_id=self.id)\n        _ES_REF[self.id] = self\n\n    def _normalize_values(self, dataframe):\n        def replace(x):\n            if not isinstance(x, (list, tuple, np.ndarray)) and pd.isna(x):\n                return (np.nan, np.nan)\n            else:\n                return x\n\n        for column, logical_type in dataframe.ww.logical_types.items():\n            if isinstance(logical_type, LatLong):\n                dataframe[column] = dataframe[column].apply(replace)\n        return dataframe\n\n\ndef _vals_to_series(instance_vals, column_id):\n    \"\"\"\n    instance_vals may be a pd.Dataframe, a pd.Series, a list, a single\n    value, or None. This function always returns a Series or None.\n    \"\"\"\n    if instance_vals is None:\n        return None\n\n    # If this is a single value, make it a list\n    if not hasattr(instance_vals, \"__iter__\"):\n        instance_vals = [instance_vals]\n\n    # convert iterable to pd.Series\n    if isinstance(instance_vals, pd.DataFrame):\n        out_vals = instance_vals[column_id]\n    else:\n        out_vals = pd.Series(instance_vals)\n\n    # no duplicates or NaN values\n    out_vals = out_vals.drop_duplicates().dropna()\n\n    # want index to have no name for the merge in query_by_values\n    out_vals.index.name = None\n\n    return out_vals\n\n\ndef _get_or_create_index(index, make_index, df):\n    \"\"\"Handles index creation logic base on user input\"\"\"\n    index_was_created = False\n\n    if index is None:\n        # Case 1: user wanted to make index but did not specify column name\n        assert not make_index, \"Must specify an index name if make_index is True\"\n        # Case 2: make_index not specified but no index supplied, use first column\n        warnings.warn(\n            (\n                \"Using first column as index. \"\n                \"To change this, specify the index parameter\"\n            ),\n        )\n        index = df.columns[0]\n    elif make_index and index in df.columns:\n        # Case 3: user wanted to make index but column already exists\n        raise RuntimeError(\n            f\"Cannot make index: column with name {index} already present\",\n        )\n    elif index not in df.columns:\n        if not make_index:\n            # Case 4: user names index, it is not in df. does not specify\n            # make_index.  Make new index column and warn\n            warnings.warn(\n                \"index {} not found in dataframe, creating new \"\n                \"integer column\".format(index),\n            )\n        # Case 5: make_index with no errors or warnings\n        # (Case 4 also uses this code path)\n        df = _create_index(df, index)\n        index_was_created = True\n    # Case 6: user specified index, which is already in df. No action needed.\n    return index_was_created, index, df\n\n\ndef _create_index(df, index):\n    df.insert(0, index, range(len(df)))\n    return df\n"
  },
  {
    "path": "featuretools/entityset/relationship.py",
    "content": "class Relationship(object):\n    \"\"\"Class to represent a relationship between dataframes\n\n    See Also:\n        :class:`.EntitySet`\n    \"\"\"\n\n    def __init__(\n        self,\n        entityset,\n        parent_dataframe_name,\n        parent_column_name,\n        child_dataframe_name,\n        child_column_name,\n    ):\n        \"\"\"Create a relationship\n\n        Args:\n            entityset (:class:`.EntitySet`): EntitySet to which the relationship belongs\n            parent_dataframe_name (str): Name of the parent dataframe in the EntitySet\n            parent_column_name (str): Name of the parent column\n            child_dataframe_name (str): Name of the child dataframe in the EntitySet\n            child_column_name (str): Name of the child column\n        \"\"\"\n\n        self.entityset = entityset\n        self._parent_dataframe_name = parent_dataframe_name\n        self._child_dataframe_name = child_dataframe_name\n        self._parent_column_name = parent_column_name\n        self._child_column_name = child_column_name\n\n        if (\n            self.parent_dataframe.ww.index is not None\n            and self._parent_column_name != self.parent_dataframe.ww.index\n        ):\n            raise AttributeError(\n                f\"Parent column '{self._parent_column_name}' is not the index of \"\n                f\"dataframe {self._parent_dataframe_name}\",\n            )\n\n    @classmethod\n    def from_dictionary(cls, arguments, es):\n        parent_dataframe = arguments[\"parent_dataframe_name\"]\n        child_dataframe = arguments[\"child_dataframe_name\"]\n        parent_column = arguments[\"parent_column_name\"]\n        child_column = arguments[\"child_column_name\"]\n        return cls(es, parent_dataframe, parent_column, child_dataframe, child_column)\n\n    def __repr__(self):\n        ret = \"<Relationship: %s.%s -> %s.%s>\" % (\n            self._child_dataframe_name,\n            self._child_column_name,\n            self._parent_dataframe_name,\n            self._parent_column_name,\n        )\n\n        return ret\n\n    def __eq__(self, other):\n        if not isinstance(other, self.__class__):\n            return False\n\n        return (\n            self._parent_dataframe_name == other._parent_dataframe_name\n            and self._child_dataframe_name == other._child_dataframe_name\n            and self._parent_column_name == other._parent_column_name\n            and self._child_column_name == other._child_column_name\n        )\n\n    def __hash__(self):\n        return hash(\n            (\n                self._parent_dataframe_name,\n                self._child_dataframe_name,\n                self._parent_column_name,\n                self._child_column_name,\n            ),\n        )\n\n    @property\n    def parent_dataframe(self):\n        \"\"\"Parent dataframe object\"\"\"\n        return self.entityset[self._parent_dataframe_name]\n\n    @property\n    def child_dataframe(self):\n        \"\"\"Child dataframe object\"\"\"\n        return self.entityset[self._child_dataframe_name]\n\n    @property\n    def parent_column(self):\n        \"\"\"Column in parent dataframe\"\"\"\n        return self.parent_dataframe.ww[self._parent_column_name]\n\n    @property\n    def child_column(self):\n        \"\"\"Column in child dataframe\"\"\"\n        return self.child_dataframe.ww[self._child_column_name]\n\n    @property\n    def parent_name(self):\n        \"\"\"The name of the parent, relative to the child.\"\"\"\n        if self._is_unique():\n            return self._parent_dataframe_name\n        else:\n            return \"%s[%s]\" % (self._parent_dataframe_name, self._child_column_name)\n\n    @property\n    def child_name(self):\n        \"\"\"The name of the child, relative to the parent.\"\"\"\n        if self._is_unique():\n            return self._child_dataframe_name\n        else:\n            return \"%s[%s]\" % (self._child_dataframe_name, self._child_column_name)\n\n    def to_dictionary(self):\n        return {\n            \"parent_dataframe_name\": self._parent_dataframe_name,\n            \"child_dataframe_name\": self._child_dataframe_name,\n            \"parent_column_name\": self._parent_column_name,\n            \"child_column_name\": self._child_column_name,\n        }\n\n    def _is_unique(self):\n        \"\"\"Is there any other relationship with same parent and child dataframes?\"\"\"\n        es = self.entityset\n        relationships = es.get_forward_relationships(self._child_dataframe_name)\n        n = len(\n            [\n                r\n                for r in relationships\n                if r._parent_dataframe_name == self._parent_dataframe_name\n            ],\n        )\n\n        assert n > 0, \"This relationship is missing from the entityset\"\n\n        return n == 1\n\n\nclass RelationshipPath(object):\n    def __init__(self, relationships_with_direction):\n        self._relationships_with_direction = relationships_with_direction\n\n    @property\n    def name(self):\n        relationship_names = [\n            _direction_name(is_forward, r)\n            for is_forward, r in self._relationships_with_direction\n        ]\n\n        return \".\".join(relationship_names)\n\n    def dataframes(self):\n        if self:\n            # Yield first dataframe.\n            is_forward, relationship = self[0]\n            if is_forward:\n                yield relationship._child_dataframe_name\n            else:\n                yield relationship._parent_dataframe_name\n\n        # Yield the dataframe pointed to by each relationship.\n        for is_forward, relationship in self:\n            if is_forward:\n                yield relationship._parent_dataframe_name\n            else:\n                yield relationship._child_dataframe_name\n\n    def __add__(self, other):\n        return RelationshipPath(\n            self._relationships_with_direction + other._relationships_with_direction,\n        )\n\n    def __getitem__(self, index):\n        return self._relationships_with_direction[index]\n\n    def __iter__(self):\n        for is_forward, relationship in self._relationships_with_direction:\n            yield is_forward, relationship\n\n    def __len__(self):\n        return len(self._relationships_with_direction)\n\n    def __eq__(self, other):\n        return (\n            isinstance(other, RelationshipPath)\n            and self._relationships_with_direction\n            == other._relationships_with_direction\n        )\n\n    def __ne__(self, other):\n        return not self == other\n\n    def __repr__(self):\n        if self._relationships_with_direction:\n            path = \"%s.%s\" % (next(self.dataframes()), self.name)\n        else:\n            path = \"[]\"\n        return \"<RelationshipPath %s>\" % path\n\n\ndef _direction_name(is_forward, relationship):\n    if is_forward:\n        return relationship.parent_name\n    else:\n        return relationship.child_name\n"
  },
  {
    "path": "featuretools/entityset/serialize.py",
    "content": "import datetime\nimport json\nimport os\nimport tarfile\nimport tempfile\n\nfrom woodwork.serializers.serializer_base import typing_info_to_dict\n\nfrom featuretools.utils.s3_utils import get_transport_params, use_smartopen_es\nfrom featuretools.utils.wrangle import _is_s3, _is_url\nfrom featuretools.version import ENTITYSET_SCHEMA_VERSION\n\nFORMATS = [\"csv\", \"pickle\", \"parquet\"]\n\n\ndef entityset_to_description(entityset, format=None):\n    \"\"\"Serialize entityset to data description.\n\n    Args:\n        entityset (EntitySet) : Instance of :class:`.EntitySet`.\n\n    Returns:\n        description (dict) : Description of :class:`.EntitySet`.\n    \"\"\"\n\n    dataframes = {\n        dataframe.ww.name: typing_info_to_dict(dataframe)\n        for dataframe in entityset.dataframes\n    }\n    relationships = [\n        relationship.to_dictionary() for relationship in entityset.relationships\n    ]\n\n    data_description = {\n        \"schema_version\": ENTITYSET_SCHEMA_VERSION,\n        \"id\": entityset.id,\n        \"dataframes\": dataframes,\n        \"relationships\": relationships,\n        \"format\": format,\n    }\n    return data_description\n\n\ndef write_data_description(entityset, path, profile_name=None, **kwargs):\n    \"\"\"Serialize entityset to data description and write to disk or S3 path.\n\n    Args:\n        entityset (EntitySet) : Instance of :class:`.EntitySet`.\n        path (str) : Location on disk or S3 path to write `data_description.json` and dataframe data.\n        profile_name (str, bool): The AWS profile specified to write to S3. Will default to None and search for AWS credentials.\n            Set to False to use an anonymous profile.\n        kwargs (keywords) : Additional keyword arguments to pass as keywords arguments to the underlying serialization method or to specify AWS profile.\n    \"\"\"\n    if _is_s3(path):\n        with tempfile.TemporaryDirectory() as tmpdir:\n            os.makedirs(os.path.join(tmpdir, \"data\"))\n            dump_data_description(entityset, tmpdir, **kwargs)\n            file_path = create_archive(tmpdir)\n\n            transport_params = get_transport_params(profile_name)\n            use_smartopen_es(\n                file_path,\n                path,\n                read=False,\n                transport_params=transport_params,\n            )\n    elif _is_url(path):\n        raise ValueError(\"Writing to URLs is not supported\")\n    else:\n        path = os.path.abspath(path)\n        os.makedirs(os.path.join(path, \"data\"), exist_ok=True)\n        dump_data_description(entityset, path, **kwargs)\n\n\ndef dump_data_description(entityset, path, **kwargs):\n    format = kwargs.get(\"format\")\n    description = entityset_to_description(entityset, format)\n    for df in entityset.dataframes:\n        data_path = os.path.join(path, \"data\", df.ww.name)\n        os.makedirs(os.path.join(data_path, \"data\"), exist_ok=True)\n        df.ww.to_disk(data_path, **kwargs)\n    file = os.path.join(path, \"data_description.json\")\n    with open(file, \"w\") as file:\n        json.dump(description, file)\n\n\ndef create_archive(tmpdir):\n    file_name = \"es-{date:%Y-%m-%d_%H%M%S}.tar\".format(date=datetime.datetime.now())\n    file_path = os.path.join(tmpdir, file_name)\n    tar = tarfile.open(str(file_path), \"w\")\n    tar.add(str(tmpdir) + \"/data_description.json\", arcname=\"/data_description.json\")\n    tar.add(str(tmpdir) + \"/data\", arcname=\"/data\")\n    tar.close()\n    return file_path\n"
  },
  {
    "path": "featuretools/entityset/timedelta.py",
    "content": "import pandas as pd\nfrom dateutil.relativedelta import relativedelta\n\n\nclass Timedelta(object):\n    \"\"\"Represents differences in time.\n\n    Timedeltas can be defined in multiple units. Supported units:\n\n    - \"ms\" : milliseconds\n    - \"s\" : seconds\n    - \"h\" : hours\n    - \"m\" : minutes\n    - \"d\" : days\n    - \"o\"/\"observations\" : number of individual events\n    - \"mo\" : months\n    - \"Y\" : years\n\n    Timedeltas can also be defined in terms of observations. In this case, the\n    Timedelta represents the period spanned by `value`.\n\n    For observation timedeltas:\n    >>> three_observations_log = Timedelta(3, \"observations\")\n    >>> three_observations_log.get_name()\n    '3 Observations'\n    \"\"\"\n\n    _Observations = \"o\"\n\n    # units for absolute times\n    _absolute_units = [\"ms\", \"s\", \"h\", \"m\", \"d\", \"w\"]\n    _relative_units = [\"mo\", \"Y\"]\n\n    _readable_units = {\n        \"ms\": \"Milliseconds\",\n        \"s\": \"Seconds\",\n        \"h\": \"Hours\",\n        \"m\": \"Minutes\",\n        \"d\": \"Days\",\n        \"o\": \"Observations\",\n        \"w\": \"Weeks\",\n        \"Y\": \"Years\",\n        \"mo\": \"Months\",\n    }\n\n    _readable_to_unit = {v.lower(): k for k, v in _readable_units.items()}\n\n    def __init__(self, value, unit=None, delta_obj=None):\n        \"\"\"\n        Args:\n            value (float, str, dict) : Value of timedelta, string providing\n                both unit and value, or a dictionary of units and times.\n            unit (str) : Unit of time delta.\n            delta_obj (pd.Timedelta or pd.DateOffset) : A time object used\n                internally to do time operations. If None is provided, one will\n                be created using the provided value and unit.\n        \"\"\"\n        self.check_value(value, unit)\n        self.times = self.fix_units()\n\n        if delta_obj is not None:\n            self.delta_obj = delta_obj\n        else:\n            self.delta_obj = self.get_unit_type()\n\n    @classmethod\n    def from_dictionary(cls, dictionary):\n        dict_units = dictionary[\"unit\"]\n        dict_values = dictionary[\"value\"]\n        if isinstance(dict_units, str) and isinstance(dict_values, (int, float)):\n            return cls({dict_units: dict_values})\n        else:\n            all_units = dict()\n            for i in range(len(dict_units)):\n                all_units[dict_units[i]] = dict_values[i]\n            return cls(all_units)\n\n    @classmethod\n    def make_singular(cls, s):\n        if len(s) > 1 and s.endswith(\"s\"):\n            return s[:-1]\n        return s\n\n    @classmethod\n    def _check_unit_plural(cls, s):\n        if len(s) > 2 and not s.endswith(\"s\"):\n            return (s + \"s\").lower()\n        elif len(s) > 1:\n            return s.lower()\n        return s\n\n    def get_value(self, unit=None):\n        if unit is not None:\n            return self.times[unit]\n        elif len(self.times.values()) == 1:\n            return list(self.times.values())[0]\n        else:\n            return self.times\n\n    def get_units(self):\n        return list(self.times.keys())\n\n    def get_unit_type(self):\n        all_units = self.get_units()\n        if self._Observations in all_units:\n            return None\n        elif self.is_absolute() and self.has_multiple_units() is False:\n            return pd.Timedelta(self.times[all_units[0]], all_units[0])\n        else:\n            readable_times = self.lower_readable_times()\n            return relativedelta(**readable_times)\n\n    def check_value(self, value, unit):\n        if isinstance(value, str):\n            from featuretools.utils.wrangle import _check_timedelta\n\n            td = _check_timedelta(value)\n            self.times = td.times\n        elif isinstance(value, dict):\n            self.times = value\n        else:\n            self.times = {unit: value}\n\n    def fix_units(self):\n        fixed_units = dict()\n        for unit, value in self.times.items():\n            unit = self._check_unit_plural(unit)\n            if unit in self._readable_to_unit:\n                unit = self._readable_to_unit[unit]\n            fixed_units[unit] = value\n        return fixed_units\n\n    def lower_readable_times(self):\n        readable_times = dict()\n        for unit, value in self.times.items():\n            readable_unit = self._readable_units[unit].lower()\n            readable_times[readable_unit] = value\n        return readable_times\n\n    def get_name(self):\n        all_units = self.get_units()\n        if self.has_multiple_units() is False:\n            return \"{} {}\".format(\n                self.times[all_units[0]],\n                self._readable_units[all_units[0]],\n            )\n        final_str = \"\"\n        for unit, value in self.times.items():\n            if value == 1:\n                unit = self.make_singular(unit)\n            final_str += \"{} {} \".format(value, self._readable_units[unit])\n        return final_str[:-1]\n\n    def get_arguments(self):\n        units = list()\n        values = list()\n        for unit, value in self.times.items():\n            units.append(unit)\n            values.append(value)\n        if len(units) == 1:\n            return {\"unit\": units[0], \"value\": values[0]}\n        else:\n            return {\"unit\": units, \"value\": values}\n\n    def is_absolute(self):\n        for unit in self.get_units():\n            if unit not in self._absolute_units:\n                return False\n        return True\n\n    def has_no_observations(self):\n        for unit in self.get_units():\n            if unit in self._Observations:\n                return False\n        return True\n\n    def has_multiple_units(self):\n        if len(self.get_units()) > 1:\n            return True\n        else:\n            return False\n\n    def __eq__(self, other):\n        if not isinstance(other, Timedelta):\n            return False\n\n        return self.times == other.times\n\n    def __neg__(self):\n        \"\"\"Negate the timedelta\"\"\"\n        new_times = dict()\n        for unit, value in self.times.items():\n            new_times[unit] = -value\n        if self.delta_obj is not None:\n            return Timedelta(new_times, delta_obj=-self.delta_obj)\n        else:\n            return Timedelta(new_times)\n\n    def __radd__(self, time):\n        \"\"\"Add the Timedelta to a timestamp value\"\"\"\n        if self._Observations not in self.get_units():\n            return time + self.delta_obj\n        else:\n            raise Exception(\"Invalid unit\")\n\n    def __rsub__(self, time):\n        \"\"\"Subtract the Timedelta from a timestamp value\"\"\"\n        if self._Observations not in self.get_units():\n            return time - self.delta_obj\n        else:\n            raise Exception(\"Invalid unit\")\n"
  },
  {
    "path": "featuretools/exceptions.py",
    "content": "class UnknownFeature(Exception):\n    def __init__(self, *args, **kwargs):\n        Exception.__init__(self, *args, **kwargs)\n\n\nclass UnusedPrimitiveWarning(UserWarning):\n    pass\n"
  },
  {
    "path": "featuretools/feature_base/__init__.py",
    "content": "# flake8: noqa\nfrom featuretools.feature_base.api import *\n"
  },
  {
    "path": "featuretools/feature_base/api.py",
    "content": "# flake8: noqa\nfrom featuretools.feature_base.feature_base import (\n    AggregationFeature,\n    DirectFeature,\n    Feature,\n    FeatureBase,\n    FeatureOutputSlice,\n    GroupByTransformFeature,\n    IdentityFeature,\n    TransformFeature,\n)\nfrom featuretools.feature_base.feature_descriptions import describe_feature\nfrom featuretools.feature_base.feature_visualizer import graph_feature\nfrom featuretools.feature_base.features_deserializer import load_features\nfrom featuretools.feature_base.features_serializer import save_features\n"
  },
  {
    "path": "featuretools/feature_base/cache.py",
    "content": "\"\"\"\ncache.py\n\nCustom caching class, currently used for FeatureBase\n\"\"\"\n\n# needed for defaultdict annotation if < python 3.9\nfrom __future__ import annotations\n\nfrom collections import defaultdict\nfrom dataclasses import dataclass, field\nfrom enum import Enum\nfrom typing import Any, List, Optional, Union\n\n\nclass CacheType(Enum):\n    \"\"\"Enumerates the supported cache types\"\"\"\n\n    DEPENDENCY = 1\n    DEPTH = 2\n\n\n@dataclass()\nclass FeatureCache:\n    \"\"\"Provides caching for the defined types\"\"\"\n\n    enabled: bool = False\n    cache: defaultdict[dict] = field(default_factory=lambda: defaultdict(dict))\n\n    def get(\n        self,\n        cache_type: CacheType,\n        hashkey: int,\n    ) -> Optional[Union[List[Any], Any]]:\n        \"\"\"Gets the cache entry, if enabled and defined\n\n        Args:\n            cache_type (CacheType): type of cache\n            hashkey (int): hash key\n\n        Returns:\n            Optional[Union[List[Any], Any]]: payload assigned to the hashkey\n        \"\"\"\n        if not self.enabled or cache_type not in self.cache:\n            return None\n        return self.cache[cache_type].get(hashkey, None)\n\n    def add(self, cache_type: CacheType, hashkey: int, payload: Any):\n        \"\"\"Adds an entry to the cache, if enabled\n\n        Args:\n            cache_type (CacheType): type of cache\n            hashkey (int): hash key\n            payload (Any): payload to assign\n        \"\"\"\n        if self.enabled:\n            self.cache[cache_type][hashkey] = payload\n\n    def clear_all(self):\n        \"\"\"Clears the cache collections\"\"\"\n        self.cache.clear()\n\n\nfeature_cache = FeatureCache()\n"
  },
  {
    "path": "featuretools/feature_base/feature_base.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Boolean, BooleanNullable\n\nfrom featuretools import primitives\nfrom featuretools.entityset.relationship import Relationship, RelationshipPath\nfrom featuretools.entityset.timedelta import Timedelta\nfrom featuretools.feature_base.utils import is_valid_input\nfrom featuretools.primitives.base import (\n    AggregationPrimitive,\n    PrimitiveBase,\n    TransformPrimitive,\n)\nfrom featuretools.utils.wrangle import _check_time_against_column, _check_timedelta\n\n_ES_REF = {}\n\n\nclass FeatureBase(object):\n    def __init__(\n        self,\n        dataframe,\n        base_features,\n        relationship_path,\n        primitive,\n        name=None,\n        names=None,\n    ):\n        \"\"\"Base class for all features\n\n        Args:\n            entityset (EntitySet): entityset this feature is being calculated for\n            dataframe (DataFrame): dataframe for calculating this feature\n            base_features (list[FeatureBase]): list of base features for primitive\n            relationship_path (RelationshipPath): path from this dataframe to the\n                dataframe of the base features.\n            primitive (:class:`.PrimitiveBase`): primitive to calculate. if not initialized when passed, gets initialized with no arguments\n        \"\"\"\n        assert all(\n            isinstance(f, FeatureBase) for f in base_features\n        ), \"All base features must be features\"\n\n        self.dataframe_name = dataframe.ww.name\n        self.entityset = _ES_REF[dataframe.ww.metadata[\"entityset_id\"]]\n\n        self.base_features = base_features\n\n        # initialize if not already initialized\n        if not isinstance(primitive, PrimitiveBase):\n            primitive = primitive()\n\n        self.primitive = primitive\n\n        self.relationship_path = relationship_path\n\n        self._name = name\n\n        self._names = names\n\n        assert self._check_input_types(), (\n            \"Provided inputs don't match input \" \"type requirements\"\n        )\n\n    def __getitem__(self, key):\n        assert (\n            self.number_output_features > 1\n        ), \"can only access slice of multi-output feature\"\n        assert (\n            self.number_output_features > key\n        ), \"index is higher than the number of outputs\"\n        return FeatureOutputSlice(self, key)\n\n    @classmethod\n    def from_dictionary(cls, arguments, entityset, dependencies, primitive):\n        raise NotImplementedError(\"Must define from_dictionary on FeatureBase subclass\")\n\n    def rename(self, name):\n        \"\"\"Rename Feature, returns copy. Will reset any custom feature column names\n        to their default value.\"\"\"\n        feature_copy = self.copy()\n        feature_copy._name = name\n        feature_copy._names = None\n        return feature_copy\n\n    def copy(self):\n        raise NotImplementedError(\"Must define copy on FeatureBase subclass\")\n\n    def get_name(self):\n        if not self._name:\n            self._name = self.generate_name()\n        return self._name\n\n    def get_feature_names(self):\n        if not self._names:\n            if self.number_output_features == 1:\n                self._names = [self.get_name()]\n            else:\n                self._names = self.generate_names()\n                if self.get_name() != self.generate_name():\n                    self._names = [\n                        self.get_name() + \"[{}]\".format(i)\n                        for i in range(len(self._names))\n                    ]\n        return self._names\n\n    def set_feature_names(self, names):\n        \"\"\"Set new values for the feature column names, overriding the default values.\n        Number of names provided must match the number of output columns defined for\n        the feature, and all provided names should be unique. Only works for features\n        that have more than one output column. Use ``Feature.rename`` to change the column\n        name for single output features.\n\n        Args:\n            names (list[str]): List of names to use for the output feature columns. Provided\n                names must be unique.\n        \"\"\"\n        if self.number_output_features == 1:\n            raise ValueError(\n                \"The set_feature_names can only be used on features that have more than one output column.\",\n            )\n\n        num_new_names = len(names)\n        if self.number_output_features != num_new_names:\n            raise ValueError(\n                \"Number of names provided must match the number of output features:\"\n                f\" {num_new_names} name(s) provided, {self.number_output_features} expected.\",\n            )\n\n        if len(set(names)) != num_new_names:\n            raise ValueError(\"Provided output feature names must be unique.\")\n\n        self._names = names\n\n    def get_function(self, **kwargs):\n        return self.primitive.get_function(**kwargs)\n\n    def get_dependencies(self, deep=False, ignored=None, copy=True):\n        \"\"\"Returns features that are used to calculate this feature\n\n        ..note::\n\n            If you only want the features that make up the input to the feature\n            function use the base_features attribute instead.\n\n\n        \"\"\"\n        deps = []\n\n        for d in self.base_features[:]:\n            deps += [d]\n\n        if hasattr(self, \"where\") and self.where:\n            deps += [self.where]\n\n        if ignored is None:\n            ignored = set([])\n        deps = [d for d in deps if d.unique_name() not in ignored]\n\n        if deep:\n            for dep in deps[:]:  # copy so we don't modify list we iterate over\n                deep_deps = dep.get_dependencies(deep, ignored)\n                deps += deep_deps\n\n        return deps\n\n    def get_depth(self, stop_at=None):\n        \"\"\"Returns depth of feature\"\"\"\n        max_depth = 0\n        stop_at_set = set()\n        if stop_at is not None:\n            stop_at_set = set([i.unique_name() for i in stop_at])\n            if self.unique_name() in stop_at_set:\n                return 0\n        for dep in self.get_dependencies(deep=True, ignored=stop_at_set):\n            max_depth = max(dep.get_depth(stop_at=stop_at), max_depth)\n        return max_depth + 1\n\n    def _check_input_types(self):\n        if len(self.base_features) == 0:\n            return True\n\n        input_types = self.primitive.input_types\n        if input_types is not None:\n            if not isinstance(input_types[0], list):\n                input_types = [input_types]\n\n            for t in input_types:\n                zipped = list(zip(t, self.base_features))\n                if all([is_valid_input(f.column_schema, t) for t, f in zipped]):\n                    return True\n        else:\n            return True\n        return False\n\n    @property\n    def dataframe(self):\n        \"\"\"Dataframe this feature belongs too\"\"\"\n        return self.entityset[self.dataframe_name]\n\n    @property\n    def number_output_features(self):\n        return self.primitive.number_output_features\n\n    def __repr__(self):\n        return \"<Feature: %s>\" % (self.get_name())\n\n    def hash(self):\n        return hash(self.get_name() + self.dataframe_name)\n\n    def __hash__(self):\n        return self.hash()\n\n    @property\n    def column_schema(self):\n        feature = self\n        column_schema = self.primitive.return_type\n\n        while column_schema is None:\n            # get column_schema of first base feature\n            base_feature = feature.base_features[0]\n            column_schema = base_feature.column_schema\n\n            # only the original time index should exist\n            # so make this feature's return type just a Datetime\n            if \"time_index\" in column_schema.semantic_tags:\n                column_schema = ColumnSchema(\n                    logical_type=column_schema.logical_type,\n                    semantic_tags=column_schema.semantic_tags - {\"time_index\"},\n                )\n            elif \"index\" in column_schema.semantic_tags:\n                column_schema = ColumnSchema(\n                    logical_type=column_schema.logical_type,\n                    semantic_tags=column_schema.semantic_tags - {\"index\"},\n                )\n                # Need to add back in the numeric standard tag so the schema can get recognized\n                # as a valid return type\n                if column_schema.is_numeric:\n                    column_schema.semantic_tags.add(\"numeric\")\n                if column_schema.is_categorical:\n                    column_schema.semantic_tags.add(\"category\")\n\n            # direct features should keep the foreign key tag, but all other features should get converted\n            if (\n                not isinstance(feature, DirectFeature)\n                and \"foreign_key\" in column_schema.semantic_tags\n            ):\n                column_schema = ColumnSchema(\n                    logical_type=column_schema.logical_type,\n                    semantic_tags=column_schema.semantic_tags - {\"foreign_key\"},\n                )\n\n            feature = base_feature\n\n        return column_schema\n\n    @property\n    def default_value(self):\n        return self.primitive.default_value\n\n    def get_arguments(self):\n        raise NotImplementedError(\"Must define get_arguments on FeatureBase subclass\")\n\n    def to_dictionary(self):\n        return {\n            \"type\": type(self).__name__,\n            \"dependencies\": [dep.unique_name() for dep in self.get_dependencies()],\n            \"arguments\": self.get_arguments(),\n        }\n\n    def _handle_binary_comparison(self, other, Primitive, PrimitiveScalar):\n        if isinstance(other, FeatureBase):\n            return Feature([self, other], primitive=Primitive)\n\n        return Feature([self], primitive=PrimitiveScalar(other))\n\n    def __eq__(self, other):\n        \"\"\"Compares to other by equality\"\"\"\n        return self._handle_binary_comparison(\n            other,\n            primitives.Equal,\n            primitives.EqualScalar,\n        )\n\n    def __ne__(self, other):\n        \"\"\"Compares to other by non-equality\"\"\"\n        return self._handle_binary_comparison(\n            other,\n            primitives.NotEqual,\n            primitives.NotEqualScalar,\n        )\n\n    def __gt__(self, other):\n        \"\"\"Compares if greater than other\"\"\"\n        return self._handle_binary_comparison(\n            other,\n            primitives.GreaterThan,\n            primitives.GreaterThanScalar,\n        )\n\n    def __ge__(self, other):\n        \"\"\"Compares if greater than or equal to other\"\"\"\n        return self._handle_binary_comparison(\n            other,\n            primitives.GreaterThanEqualTo,\n            primitives.GreaterThanEqualToScalar,\n        )\n\n    def __lt__(self, other):\n        \"\"\"Compares if less than other\"\"\"\n        return self._handle_binary_comparison(\n            other,\n            primitives.LessThan,\n            primitives.LessThanScalar,\n        )\n\n    def __le__(self, other):\n        \"\"\"Compares if less than or equal to other\"\"\"\n        return self._handle_binary_comparison(\n            other,\n            primitives.LessThanEqualTo,\n            primitives.LessThanEqualToScalar,\n        )\n\n    def __add__(self, other):\n        \"\"\"Add other\"\"\"\n        return self._handle_binary_comparison(\n            other,\n            primitives.AddNumeric,\n            primitives.AddNumericScalar,\n        )\n\n    def __radd__(self, other):\n        return self.__add__(other)\n\n    def __sub__(self, other):\n        \"\"\"Subtract other\"\"\"\n        return self._handle_binary_comparison(\n            other,\n            primitives.SubtractNumeric,\n            primitives.SubtractNumericScalar,\n        )\n\n    def __rsub__(self, other):\n        return Feature([self], primitive=primitives.ScalarSubtractNumericFeature(other))\n\n    def __div__(self, other):\n        \"\"\"Divide by other\"\"\"\n        return self._handle_binary_comparison(\n            other,\n            primitives.DivideNumeric,\n            primitives.DivideNumericScalar,\n        )\n\n    def __truediv__(self, other):\n        return self.__div__(other)\n\n    def __rtruediv__(self, other):\n        return self.__rdiv__(other)\n\n    def __rdiv__(self, other):\n        return Feature([self], primitive=primitives.DivideByFeature(other))\n\n    def __mul__(self, other):\n        \"\"\"Multiply by other\"\"\"\n        if isinstance(other, FeatureBase):\n            if all(\n                [\n                    isinstance(f.column_schema.logical_type, (Boolean, BooleanNullable))\n                    for f in (self, other)\n                ],\n            ):\n                return Feature([self, other], primitive=primitives.MultiplyBoolean)\n            if (\n                \"numeric\" in self.column_schema.semantic_tags\n                and isinstance(\n                    other.column_schema.logical_type,\n                    (Boolean, BooleanNullable),\n                )\n                or \"numeric\" in other.column_schema.semantic_tags\n                and isinstance(\n                    self.column_schema.logical_type,\n                    (Boolean, BooleanNullable),\n                )\n            ):\n                return Feature(\n                    [self, other],\n                    primitive=primitives.MultiplyNumericBoolean,\n                )\n        return self._handle_binary_comparison(\n            other,\n            primitives.MultiplyNumeric,\n            primitives.MultiplyNumericScalar,\n        )\n\n    def __rmul__(self, other):\n        return self.__mul__(other)\n\n    def __mod__(self, other):\n        \"\"\"Take modulus of other\"\"\"\n        return self._handle_binary_comparison(\n            other,\n            primitives.ModuloNumeric,\n            primitives.ModuloNumericScalar,\n        )\n\n    def __rmod__(self, other):\n        return Feature([self], primitive=primitives.ModuloByFeature(other))\n\n    def __and__(self, other):\n        return self.AND(other)\n\n    def __rand__(self, other):\n        return Feature([other, self], primitive=primitives.And)\n\n    def __or__(self, other):\n        return self.OR(other)\n\n    def __ror__(self, other):\n        return Feature([other, self], primitive=primitives.Or)\n\n    def __not__(self, other):\n        return self.NOT(other)\n\n    def __abs__(self):\n        return Feature([self], primitive=primitives.Absolute)\n\n    def __neg__(self):\n        return Feature([self], primitive=primitives.Negate)\n\n    def AND(self, other_feature):\n        \"\"\"Logical AND with other_feature\"\"\"\n        return Feature([self, other_feature], primitive=primitives.And)\n\n    def OR(self, other_feature):\n        \"\"\"Logical OR with other_feature\"\"\"\n        return Feature([self, other_feature], primitive=primitives.Or)\n\n    def NOT(self):\n        \"\"\"Creates inverse of feature\"\"\"\n        return Feature([self], primitive=primitives.Not)\n\n    def isin(self, list_of_output):\n        return Feature(\n            [self],\n            primitive=primitives.IsIn(list_of_outputs=list_of_output),\n        )\n\n    def is_null(self):\n        \"\"\"Compares feature to null by equality\"\"\"\n        return Feature([self], primitive=primitives.IsNull)\n\n    def __invert__(self):\n        return self.NOT()\n\n    def unique_name(self):\n        return \"%s: %s\" % (self.dataframe_name, self.get_name())\n\n    def relationship_path_name(self):\n        return self.relationship_path.name\n\n\nclass IdentityFeature(FeatureBase):\n    \"\"\"Feature for dataframe that is equivalent to underlying column\"\"\"\n\n    def __init__(self, column, name=None):\n        self.column_name = column.ww.name\n        self.return_type = column.ww.schema\n\n        metadata = column.ww.schema._metadata\n        es = _ES_REF[metadata[\"entityset_id\"]]\n        super(IdentityFeature, self).__init__(\n            dataframe=es[metadata[\"dataframe_name\"]],\n            base_features=[],\n            relationship_path=RelationshipPath([]),\n            primitive=PrimitiveBase,\n            name=name,\n        )\n\n    @classmethod\n    def from_dictionary(cls, arguments, entityset, dependencies, primitive):\n        dataframe_name = arguments[\"dataframe_name\"]\n        column_name = arguments[\"column_name\"]\n        column = entityset[dataframe_name].ww[column_name]\n        return cls(column=column, name=arguments[\"name\"])\n\n    def copy(self):\n        \"\"\"Return copy of feature\"\"\"\n        return IdentityFeature(self.entityset[self.dataframe_name].ww[self.column_name])\n\n    def generate_name(self):\n        return self.column_name\n\n    def get_depth(self, stop_at=None):\n        return 0\n\n    def get_arguments(self):\n        return {\n            \"name\": self.get_name(),\n            \"column_name\": self.column_name,\n            \"dataframe_name\": self.dataframe_name,\n        }\n\n    @property\n    def column_schema(self):\n        return self.return_type\n\n\nclass DirectFeature(FeatureBase):\n    \"\"\"Feature for child dataframe that inherits\n    a feature value from a parent dataframe\"\"\"\n\n    input_types = [ColumnSchema()]\n    return_type = None\n\n    def __init__(\n        self,\n        base_feature,\n        child_dataframe_name,\n        relationship=None,\n        name=None,\n    ):\n        base_feature = _validate_base_features(base_feature)[0]\n        self.parent_dataframe_name = base_feature.dataframe_name\n        relationship = self._handle_relationship(\n            base_feature.entityset,\n            child_dataframe_name,\n            relationship,\n        )\n        child_dataframe = base_feature.entityset[child_dataframe_name]\n        super(DirectFeature, self).__init__(\n            dataframe=child_dataframe,\n            base_features=[base_feature],\n            relationship_path=RelationshipPath([(True, relationship)]),\n            primitive=PrimitiveBase,\n            name=name,\n        )\n\n    def _handle_relationship(self, entityset, child_dataframe_name, relationship):\n        child_dataframe = entityset[child_dataframe_name]\n        if relationship:\n            relationship_child = relationship.child_dataframe\n            assert (\n                child_dataframe.ww.name == relationship_child.ww.name\n            ), \"child_dataframe must be the relationship child dataframe\"\n\n            assert (\n                self.parent_dataframe_name == relationship.parent_dataframe.ww.name\n            ), \"Base feature must be defined on the relationship parent dataframe\"\n        else:\n            child_relationships = entityset.get_forward_relationships(\n                child_dataframe.ww.name,\n            )\n            possible_relationships = (\n                r\n                for r in child_relationships\n                if r.parent_dataframe.ww.name == self.parent_dataframe_name\n            )\n            relationship = next(possible_relationships, None)\n\n            if not relationship:\n                raise RuntimeError(\n                    'No relationship from \"%s\" to \"%s\" found.'\n                    % (child_dataframe.ww.name, self.parent_dataframe_name),\n                )\n\n            # Check for another path.\n            elif next(possible_relationships, None):\n                message = (\n                    \"There are multiple relationships to the base dataframe. \"\n                    \"You must specify a relationship.\"\n                )\n                raise RuntimeError(message)\n\n        return relationship\n\n    @classmethod\n    def from_dictionary(cls, arguments, entityset, dependencies, primitive):\n        base_feature = dependencies[arguments[\"base_feature\"]]\n        relationship = Relationship.from_dictionary(\n            arguments[\"relationship\"],\n            entityset,\n        )\n        child_dataframe_name = relationship.child_dataframe.ww.name\n        return cls(\n            base_feature=base_feature,\n            child_dataframe_name=child_dataframe_name,\n            relationship=relationship,\n            name=arguments[\"name\"],\n        )\n\n    @property\n    def number_output_features(self):\n        return self.base_features[0].number_output_features\n\n    @property\n    def default_value(self):\n        return self.base_features[0].default_value\n\n    def copy(self):\n        \"\"\"Return copy of feature\"\"\"\n        _is_forward, relationship = self.relationship_path[0]\n        return DirectFeature(\n            self.base_features[0],\n            self.dataframe_name,\n            relationship=relationship,\n        )\n\n    @property\n    def column_schema(self):\n        return self.base_features[0].column_schema\n\n    def generate_name(self):\n        return self._name_from_base(self.base_features[0].get_name())\n\n    def generate_names(self):\n        return [\n            self._name_from_base(base_name)\n            for base_name in self.base_features[0].get_feature_names()\n        ]\n\n    def get_arguments(self):\n        _is_forward, relationship = self.relationship_path[0]\n        return {\n            \"name\": self.get_name(),\n            \"base_feature\": self.base_features[0].unique_name(),\n            \"relationship\": relationship.to_dictionary(),\n        }\n\n    def _name_from_base(self, base_name):\n        return \"%s.%s\" % (self.relationship_path_name(), base_name)\n\n\nclass AggregationFeature(FeatureBase):\n    # Feature to condition this feature by in\n    # computation (e.g. take the Count of products where the product_id is\n    # \"basketball\".)\n    where = None\n    #: (str or :class:`.Timedelta`): Use only some amount of previous data from\n    # each time point during calculation\n    use_previous = None\n\n    def __init__(\n        self,\n        base_features,\n        parent_dataframe_name,\n        primitive,\n        relationship_path=None,\n        use_previous=None,\n        where=None,\n        name=None,\n    ):\n        base_features = _validate_base_features(base_features)\n\n        for bf in base_features:\n            if bf.number_output_features > 1:\n                raise ValueError(\"Cannot stack on whole multi-output feature.\")\n\n        self.child_dataframe_name = base_features[0].dataframe_name\n        entityset = base_features[0].entityset\n        relationship_path, self._path_is_unique = self._handle_relationship_path(\n            entityset,\n            parent_dataframe_name,\n            relationship_path,\n        )\n\n        self.parent_dataframe_name = parent_dataframe_name\n\n        if where is not None:\n            self.where = _validate_base_features(where)[0]\n            msg = \"Where feature must be defined on child dataframe {}\".format(\n                self.child_dataframe_name,\n            )\n            assert self.where.dataframe_name == self.child_dataframe_name, msg\n\n        if use_previous:\n            assert entityset[self.child_dataframe_name].ww.time_index is not None, (\n                \"Applying function that requires time index to dataframe that \"\n                \"doesn't have one\"\n            )\n            self.use_previous = _check_timedelta(use_previous)\n            assert len(base_features) > 0\n            time_index = base_features[0].dataframe.ww.time_index\n            time_col = base_features[0].dataframe.ww[time_index]\n            assert time_index is not None, (\n                \"Use previous can only be defined \" \"on dataframes with a time index\"\n            )\n            assert _check_time_against_column(self.use_previous, time_col)\n\n        super(AggregationFeature, self).__init__(\n            dataframe=entityset[parent_dataframe_name],\n            base_features=base_features,\n            relationship_path=relationship_path,\n            primitive=primitive,\n            name=name,\n        )\n\n    def _handle_relationship_path(\n        self,\n        entityset,\n        parent_dataframe_name,\n        relationship_path,\n    ):\n        parent_dataframe = entityset[parent_dataframe_name]\n        child_dataframe = entityset[self.child_dataframe_name]\n\n        if relationship_path:\n            assert all(\n                not is_forward for is_forward, _r in relationship_path\n            ), \"All relationships in path must be backward\"\n\n            _is_forward, first_relationship = relationship_path[0]\n            first_parent = first_relationship.parent_dataframe\n            assert (\n                parent_dataframe.ww.name == first_parent.ww.name\n            ), \"parent_dataframe must match first relationship in path.\"\n\n            _is_forward, last_relationship = relationship_path[-1]\n            assert (\n                child_dataframe.ww.name == last_relationship.child_dataframe.ww.name\n            ), \"Base feature must be defined on the dataframe at the end of relationship_path\"\n\n            path_is_unique = entityset.has_unique_forward_path(\n                child_dataframe.ww.name,\n                parent_dataframe.ww.name,\n            )\n        else:\n            paths = entityset.find_backward_paths(\n                parent_dataframe.ww.name,\n                child_dataframe.ww.name,\n            )\n            first_path = next(paths, None)\n\n            if not first_path:\n                raise RuntimeError(\n                    'No backward path from \"%s\" to \"%s\" found.'\n                    % (parent_dataframe.ww.name, child_dataframe.ww.name),\n                )\n            # Check for another path.\n            elif next(paths, None):\n                message = (\n                    \"There are multiple possible paths to the base dataframe. \"\n                    \"You must specify a relationship path.\"\n                )\n                raise RuntimeError(message)\n\n            relationship_path = RelationshipPath([(False, r) for r in first_path])\n            path_is_unique = True\n\n        return relationship_path, path_is_unique\n\n    @classmethod\n    def from_dictionary(cls, arguments, entityset, dependencies, primitive):\n        base_features = [dependencies[name] for name in arguments[\"base_features\"]]\n        relationship_path = [\n            Relationship.from_dictionary(r, entityset)\n            for r in arguments[\"relationship_path\"]\n        ]\n        parent_dataframe_name = relationship_path[0].parent_dataframe.ww.name\n        relationship_path = RelationshipPath([(False, r) for r in relationship_path])\n\n        use_previous_data = arguments[\"use_previous\"]\n        use_previous = use_previous_data and Timedelta.from_dictionary(\n            use_previous_data,\n        )\n\n        where_name = arguments[\"where\"]\n        where = where_name and dependencies[where_name]\n\n        feat = cls(\n            base_features=base_features,\n            parent_dataframe_name=parent_dataframe_name,\n            primitive=primitive,\n            relationship_path=relationship_path,\n            use_previous=use_previous,\n            where=where,\n            name=arguments[\"name\"],\n        )\n        feat._names = arguments.get(\"feature_names\")\n        return feat\n\n    def copy(self):\n        return AggregationFeature(\n            self.base_features,\n            parent_dataframe_name=self.parent_dataframe_name,\n            relationship_path=self.relationship_path,\n            primitive=self.primitive,\n            use_previous=self.use_previous,\n            where=self.where,\n        )\n\n    def _where_str(self):\n        if self.where is not None:\n            where_str = \" WHERE \" + self.where.get_name()\n        else:\n            where_str = \"\"\n        return where_str\n\n    def _use_prev_str(self):\n        if self.use_previous is not None and hasattr(self.use_previous, \"get_name\"):\n            use_prev_str = \", Last {}\".format(self.use_previous.get_name())\n        else:\n            use_prev_str = \"\"\n        return use_prev_str\n\n    def generate_name(self):\n        return self.primitive.generate_name(\n            base_feature_names=[bf.get_name() for bf in self.base_features],\n            relationship_path_name=self.relationship_path_name(),\n            parent_dataframe_name=self.parent_dataframe_name,\n            where_str=self._where_str(),\n            use_prev_str=self._use_prev_str(),\n        )\n\n    def generate_names(self):\n        return self.primitive.generate_names(\n            base_feature_names=[bf.get_name() for bf in self.base_features],\n            relationship_path_name=self.relationship_path_name(),\n            parent_dataframe_name=self.parent_dataframe_name,\n            where_str=self._where_str(),\n            use_prev_str=self._use_prev_str(),\n        )\n\n    def get_arguments(self):\n        arg_dict = {\n            \"name\": self.get_name(),\n            \"base_features\": [feat.unique_name() for feat in self.base_features],\n            \"relationship_path\": [r.to_dictionary() for _, r in self.relationship_path],\n            \"primitive\": self.primitive,\n            \"where\": self.where and self.where.unique_name(),\n            \"use_previous\": self.use_previous and self.use_previous.get_arguments(),\n        }\n        if self.number_output_features > 1:\n            arg_dict[\"feature_names\"] = self.get_feature_names()\n        return arg_dict\n\n    def relationship_path_name(self):\n        if self._path_is_unique:\n            return self.child_dataframe_name\n        else:\n            return self.relationship_path.name\n\n\nclass TransformFeature(FeatureBase):\n    def __init__(self, base_features, primitive, name=None):\n        base_features = _validate_base_features(base_features)\n\n        for bf in base_features:\n            if bf.number_output_features > 1:\n                raise ValueError(\"Cannot stack on whole multi-output feature.\")\n        dataframe = base_features[0].entityset[base_features[0].dataframe_name]\n        super(TransformFeature, self).__init__(\n            dataframe=dataframe,\n            base_features=base_features,\n            relationship_path=RelationshipPath([]),\n            primitive=primitive,\n            name=name,\n        )\n\n    @classmethod\n    def from_dictionary(cls, arguments, entityset, dependencies, primitive):\n        base_features = [dependencies[name] for name in arguments[\"base_features\"]]\n        feat = cls(\n            base_features=base_features,\n            primitive=primitive,\n            name=arguments[\"name\"],\n        )\n        feat._names = arguments.get(\"feature_names\")\n        return feat\n\n    def copy(self):\n        return TransformFeature(self.base_features, self.primitive)\n\n    def generate_name(self):\n        return self.primitive.generate_name(\n            base_feature_names=[bf.get_name() for bf in self.base_features],\n        )\n\n    def generate_names(self):\n        return self.primitive.generate_names(\n            base_feature_names=[bf.get_name() for bf in self.base_features],\n        )\n\n    def get_arguments(self):\n        arg_dict = {\n            \"name\": self.get_name(),\n            \"base_features\": [feat.unique_name() for feat in self.base_features],\n            \"primitive\": self.primitive,\n        }\n        if self.number_output_features > 1:\n            arg_dict[\"feature_names\"] = self.get_feature_names()\n        return arg_dict\n\n\nclass GroupByTransformFeature(TransformFeature):\n    def __init__(self, base_features, primitive, groupby, name=None):\n        if not isinstance(groupby, FeatureBase):\n            groupby = IdentityFeature(groupby)\n        assert (\n            len({\"category\", \"foreign_key\"} - groupby.column_schema.semantic_tags) < 2\n        )\n        self.groupby = groupby\n\n        base_features = _validate_base_features(base_features)\n        base_features.append(groupby)\n\n        super(GroupByTransformFeature, self).__init__(\n            base_features=base_features,\n            primitive=primitive,\n            name=name,\n        )\n\n    @classmethod\n    def from_dictionary(cls, arguments, entityset, dependencies, primitive):\n        base_features = [dependencies[name] for name in arguments[\"base_features\"]]\n        groupby = dependencies[arguments[\"groupby\"]]\n        feat = cls(\n            base_features=base_features,\n            primitive=primitive,\n            groupby=groupby,\n            name=arguments[\"name\"],\n        )\n        feat._names = arguments.get(\"feature_names\")\n        return feat\n\n    def copy(self):\n        # the groupby feature is appended to base_features in the __init__\n        # so here we separate them again\n        return GroupByTransformFeature(\n            self.base_features[:-1],\n            self.primitive,\n            self.groupby,\n        )\n\n    def generate_name(self):\n        # exclude the groupby feature from base_names since it has a special\n        # place in the feature name\n        base_names = [bf.get_name() for bf in self.base_features[:-1]]\n        _name = self.primitive.generate_name(base_names)\n        return \"{} by {}\".format(_name, self.groupby.get_name())\n\n    def generate_names(self):\n        base_names = [bf.get_name() for bf in self.base_features[:-1]]\n        _names = self.primitive.generate_names(base_names)\n        names = [name + \" by {}\".format(self.groupby.get_name()) for name in _names]\n        return names\n\n    def get_arguments(self):\n        # Do not include groupby in base_features.\n        feature_names = [\n            feat.unique_name()\n            for feat in self.base_features\n            if feat.unique_name() != self.groupby.unique_name()\n        ]\n        arg_dict = {\n            \"name\": self.get_name(),\n            \"base_features\": feature_names,\n            \"primitive\": self.primitive,\n            \"groupby\": self.groupby.unique_name(),\n        }\n        if self.number_output_features > 1:\n            arg_dict[\"feature_names\"] = self.get_feature_names()\n        return arg_dict\n\n\nclass Feature(object):\n    \"\"\"\n    Alias to create feature. Infers the feature type based on init parameters.\n    \"\"\"\n\n    def __new__(\n        self,\n        base,\n        dataframe_name=None,\n        groupby=None,\n        parent_dataframe_name=None,\n        primitive=None,\n        use_previous=None,\n        where=None,\n    ):\n        # either direct or identity\n        if primitive is None and dataframe_name is None:\n            return IdentityFeature(base)\n        elif primitive is None and dataframe_name is not None:\n            return DirectFeature(base, dataframe_name)\n        elif primitive is not None and parent_dataframe_name is not None:\n            assert isinstance(primitive, AggregationPrimitive) or issubclass(\n                primitive,\n                AggregationPrimitive,\n            )\n            return AggregationFeature(\n                base,\n                parent_dataframe_name=parent_dataframe_name,\n                use_previous=use_previous,\n                where=where,\n                primitive=primitive,\n            )\n        elif primitive is not None:\n            assert isinstance(primitive, TransformPrimitive) or issubclass(\n                primitive,\n                TransformPrimitive,\n            )\n            if groupby is not None:\n                return GroupByTransformFeature(\n                    base,\n                    primitive=primitive,\n                    groupby=groupby,\n                )\n            return TransformFeature(base, primitive=primitive)\n\n        raise Exception(\"Unrecognized feature initialization\")\n\n\nclass FeatureOutputSlice(FeatureBase):\n    \"\"\"\n    Class to access specific multi output feature column\n    \"\"\"\n\n    def __init__(self, base_feature, n, name=None):\n        base_features = [base_feature]\n        self.num_output_parent = base_feature.number_output_features\n\n        msg = \"cannot access slice from single output feature\"\n        assert self.num_output_parent > 1, msg\n        msg = \"cannot access column that is not between 0 and \" + str(\n            self.num_output_parent - 1,\n        )\n        assert n < self.num_output_parent, msg\n\n        self.n = n\n        self._name = name\n        self._names = [name] if name else None\n        self.base_features = base_features\n        self.base_feature = base_features[0]\n\n        self.dataframe_name = base_feature.dataframe_name\n        self.entityset = base_feature.entityset\n        self.primitive = base_feature.primitive\n\n        self.relationship_path = base_feature.relationship_path\n\n    def __getitem__(self, key):\n        raise ValueError(\"Cannot get item from slice of multi output feature\")\n\n    def generate_name(self):\n        return self.base_feature.get_feature_names()[self.n]\n\n    @property\n    def number_output_features(self):\n        return 1\n\n    def get_arguments(self):\n        return {\n            \"name\": self.get_name(),\n            \"base_feature\": self.base_feature.unique_name(),\n            \"n\": self.n,\n        }\n\n    @classmethod\n    def from_dictionary(cls, arguments, entityset, dependencies, primitive):\n        base_feature_name = arguments[\"base_feature\"]\n        base_feature = dependencies[base_feature_name]\n        n = arguments[\"n\"]\n        name = arguments[\"name\"]\n        return cls(base_feature=base_feature, n=n, name=name)\n\n    def copy(self):\n        return FeatureOutputSlice(self.base_feature, self.n)\n\n\ndef _validate_base_features(feature):\n    if \"Series\" == type(feature).__name__:\n        return [IdentityFeature(feature)]\n    elif hasattr(feature, \"__iter__\"):\n        features = [_validate_base_features(f)[0] for f in feature]\n        msg = \"all base features must share the same dataframe\"\n        assert len(set([bf.dataframe_name for bf in features])) == 1, msg\n        return features\n    elif isinstance(feature, FeatureBase):\n        return [feature]\n    else:\n        raise Exception(\"Not a feature\")\n"
  },
  {
    "path": "featuretools/feature_base/feature_descriptions.py",
    "content": "import json\n\nimport featuretools as ft\n\n\ndef describe_feature(\n    feature,\n    feature_descriptions=None,\n    primitive_templates=None,\n    metadata_file=None,\n):\n    \"\"\"Generates an English language description of a feature.\n\n    Args:\n        feature (FeatureBase) : Feature to describe\n        feature_descriptions (dict, optional) : dictionary mapping features or unique\n            feature names to custom descriptions\n        primitive_templates (dict, optional) : dictionary mapping primitives or\n            primitive names to description templates\n        metadata_file (str, optional) : path to json metadata file\n\n    Returns:\n        str : English description of the feature\n    \"\"\"\n    feature_descriptions = feature_descriptions or {}\n    primitive_templates = primitive_templates or {}\n\n    if metadata_file:\n        file_feature_descriptions, file_primitive_templates = parse_json_metadata(\n            metadata_file,\n        )\n        feature_descriptions = {**file_feature_descriptions, **feature_descriptions}\n        primitive_templates = {**file_primitive_templates, **primitive_templates}\n\n    description = generate_description(\n        feature,\n        feature_descriptions,\n        primitive_templates,\n    )\n    return description[:1].upper() + description[1:] + \".\"\n\n\ndef generate_description(feature, feature_descriptions, primitive_templates):\n    # Check if feature has custom description\n    if feature in feature_descriptions or feature.unique_name() in feature_descriptions:\n        description = feature_descriptions.get(feature) or feature_descriptions.get(\n            feature.unique_name(),\n        )\n        return description\n\n    # Check if identity feature:\n    if isinstance(feature, ft.IdentityFeature):\n        description = feature.column_schema.description\n        if description is None:\n            description = 'the \"{}\"'.format(feature.column_name)\n        return description\n\n    # Handle direct features\n    if isinstance(feature, ft.DirectFeature):\n        base_feature, direct_description = get_direct_description(feature)\n        direct_base = generate_description(\n            base_feature,\n            feature_descriptions,\n            primitive_templates,\n        )\n        return direct_base + direct_description\n\n    # Get input descriptions\n    input_descriptions = []\n    input_columns = feature.base_features\n    if isinstance(feature, ft.feature_base.FeatureOutputSlice):\n        input_columns = feature.base_feature.base_features\n\n    for input_col in input_columns:\n        col_description = generate_description(\n            input_col,\n            feature_descriptions,\n            primitive_templates,\n        )\n        input_descriptions.append(col_description)\n\n    # Remove groupby description from input columns\n    groupby_description = None\n    if isinstance(feature, ft.GroupByTransformFeature):\n        groupby_description = input_descriptions.pop()\n\n    # Generate primitive description\n    template_override = None\n    if (\n        feature.primitive in primitive_templates\n        or feature.primitive.name in primitive_templates\n    ):\n        template_override = primitive_templates.get(\n            feature.primitive,\n        ) or primitive_templates.get(feature.primitive.name)\n    slice_num = feature.n if hasattr(feature, \"n\") else None\n    primitive_description = feature.primitive.get_description(\n        input_descriptions,\n        slice_num=slice_num,\n        template_override=template_override,\n    )\n    if isinstance(feature, ft.feature_base.FeatureOutputSlice):\n        feature = feature.base_feature\n\n    # Generate groupby phrase if applicable\n    groupby = \"\"\n    if isinstance(feature, ft.AggregationFeature):\n        groupby_description = get_aggregation_groupby(feature, feature_descriptions)\n    if groupby_description is not None:\n        if groupby_description.startswith(\"the \"):\n            groupby_description = groupby_description[4:]\n        groupby = \"for each {}\".format(groupby_description)\n\n    # Generate aggregation dataframe phrase with use_previous\n    dataframe_description = \"\"\n    if isinstance(feature, ft.AggregationFeature):\n        if feature.use_previous:\n            dataframe_description = \"of the previous {} of \".format(\n                feature.use_previous.get_name().lower(),\n            )\n        else:\n            dataframe_description = \"of all instances of \"\n        dataframe_description += '\"{}\"'.format(\n            feature.relationship_path[-1][1].child_dataframe.ww.name,\n        )\n\n    # Generate where phrase\n    where = \"\"\n    if hasattr(feature, \"where\") and feature.where:\n        where_col = generate_description(\n            feature.where.base_features[0],\n            feature_descriptions,\n            primitive_templates,\n        )\n        where = \"where {} is {}\".format(where_col, feature.where.primitive.value)\n\n    # Join all parts of template\n    description_template = [\n        primitive_description,\n        dataframe_description,\n        where,\n        groupby,\n    ]\n    description = \" \".join([phrase for phrase in description_template if phrase != \"\"])\n\n    return description\n\n\ndef get_direct_description(feature):\n    direct_description = (\n        ' the instance of \"{}\" associated with this ' 'instance of \"{}\"'.format(\n            feature.relationship_path[-1][1].parent_dataframe.ww.name,\n            feature.dataframe_name,\n        )\n    )\n    base_features = feature.base_features\n    # shortens stacked direct features to make it easier to understand\n    while isinstance(base_features[0], ft.DirectFeature):\n        base_feat = base_features[0]\n        base_feat_description = ' the instance of \"{}\" associated ' \"with\".format(\n            base_feat.relationship_path[-1][1].parent_dataframe.ww.name,\n        )\n        direct_description = base_feat_description + direct_description\n        base_features = base_feat.base_features\n    direct_description = \" for\" + direct_description\n\n    return base_features[0], direct_description\n\n\ndef get_aggregation_groupby(feature, feature_descriptions=None):\n    if feature_descriptions is None:\n        feature_descriptions = {}\n    groupby_name = feature.dataframe.ww.index\n    groupby = ft.IdentityFeature(\n        feature.entityset[feature.dataframe_name].ww[groupby_name],\n    )\n    if groupby in feature_descriptions or groupby.unique_name() in feature_descriptions:\n        return feature_descriptions.get(groupby) or feature_descriptions.get(\n            groupby.unique_name(),\n        )\n    else:\n        return '\"{}\" in \"{}\"'.format(groupby_name, feature.dataframe_name)\n\n\ndef parse_json_metadata(file):\n    with open(file) as f:\n        json_metadata = json.load(f)\n\n    return (\n        json_metadata.get(\"feature_descriptions\", {}),\n        json_metadata.get(\"primitive_templates\", {}),\n    )\n"
  },
  {
    "path": "featuretools/feature_base/feature_visualizer.py",
    "content": "import html\n\nfrom featuretools.feature_base.feature_base import (\n    AggregationFeature,\n    DirectFeature,\n    FeatureOutputSlice,\n    IdentityFeature,\n    TransformFeature,\n)\nfrom featuretools.feature_base.feature_descriptions import describe_feature\nfrom featuretools.utils.plot_utils import (\n    check_graphviz,\n    get_graphviz_format,\n    save_graph,\n)\n\nTARGET_COLOR = \"#D9EAD3\"\nTABLE_TEMPLATE = \"\"\"<\n<TABLE BORDER=\"0\" CELLBORDER=\"1\" CELLSPACING=\"0\" CELLPADDING=\"10\">\n    <TR>\n        <TD colspan=\"1\" bgcolor=\"#A9A9A9\"><B>{dataframe_name}</B></TD>\n    </TR>{table_cols}\n</TABLE>>\"\"\"\nCOL_TEMPLATE = \"\"\"<TR><TD ALIGN=\"LEFT\" port=\"{}\">{}</TD></TR>\"\"\"\nTARGET_TEMPLATE = \"\"\"\n    <TR>\n        <TD ALIGN=\"LEFT\" port=\"{}\" BGCOLOR=\"{target_color}\">{}</TD>\n    </TR>\"\"\".format(\n    \"{}\",\n    \"{}\",\n    target_color=TARGET_COLOR,\n)\n\n\ndef graph_feature(feature, to_file=None, description=False, **kwargs):\n    \"\"\"Generates a feature lineage graph for the given feature\n\n    Args:\n        feature (FeatureBase) : Feature to generate lineage graph for\n        to_file (str, optional) : Path to where the plot should be saved.\n            If set to None (as by default), the plot will not be saved.\n        description (bool or str, optional): The feature description to use as a caption\n            for the graph. If False, no description is added. Set to True\n            to use an auto-generated description. Defaults to False.\n        kwargs (keywords): Additional keyword arguments to pass as keyword arguments\n            to the ft.describe_feature function.\n\n    Returns:\n        graphviz.Digraph : Graph object that can directly be displayed in Jupyter notebooks.\n    \"\"\"\n    graphviz = check_graphviz()\n    format_ = get_graphviz_format(graphviz=graphviz, to_file=to_file)\n\n    # Initialize a new directed graph\n    graph = graphviz.Digraph(\n        feature.get_name(),\n        format=format_,\n        graph_attr={\"rankdir\": \"LR\"},\n    )\n\n    dataframes = {}\n    edges = ([], [])\n    primitives = []\n    groupbys = []\n\n    _, max_depth = get_feature_data(\n        feature,\n        dataframes,\n        groupbys,\n        edges,\n        primitives,\n        layer=0,\n    )\n    dataframes[feature.dataframe_name][\"targets\"].add(feature.get_name())\n\n    for df_name in dataframes:\n        dataframe_name = (\n            \"\\u2605 {} (target)\".format(df_name)\n            if df_name == feature.dataframe_name\n            else df_name\n        )\n        dataframe_table = get_dataframe_table(dataframe_name, dataframes[df_name])\n        graph.attr(\"node\", shape=\"plaintext\")\n        graph.node(df_name, dataframe_table)\n\n    graph.attr(\"node\", shape=\"diamond\")\n    num_primitives = len(primitives)\n    for prim_name, prim_label, layer, prim_type in primitives:\n        step_num = max_depth - layer\n        if num_primitives == 1:\n            type_str = (\n                '<FONT POINT-SIZE=\"12\"><B>{}</B><BR></BR></FONT>'.format(prim_type)\n                if prim_type\n                else \"\"\n            )\n            prim_label = \"<{}{}>\".format(type_str, prim_label)\n        else:\n            step = \"Step {}\".format(step_num)\n            type_str = \"   \" + prim_type if prim_type else \"\"\n            prim_label = (\n                '<<FONT POINT-SIZE=\"12\"><B>{}:</B>{}<BR></BR></FONT>{}>'.format(\n                    step,\n                    type_str,\n                    prim_label,\n                )\n            )\n\n        # sink first layer transform primitive if multiple primitives\n        if step_num == 1 and prim_type == \"Transform\" and num_primitives > 1:\n            with graph.subgraph() as init_transform:\n                init_transform.attr(rank=\"min\")\n                init_transform.node(name=prim_name, label=prim_label)\n        else:\n            graph.node(name=prim_name, label=prim_label)\n\n    graph.attr(\"node\", shape=\"box\")\n    for groupby_name, groupby_label in groupbys:\n        graph.node(name=groupby_name, label=groupby_label)\n\n    graph.attr(\"edge\", style=\"solid\", dir=\"forward\")\n    for edge in edges[1]:\n        graph.edge(*edge)\n\n    graph.attr(\"edge\", style=\"dotted\", arrowhead=\"none\", dir=\"forward\")\n    for edge in edges[0]:\n        graph.edge(*edge)\n\n    if description is True:\n        graph.attr(label=describe_feature(feature, **kwargs))\n    elif description is not False:\n        graph.attr(label=description)\n\n    if to_file:\n        save_graph(graph, to_file, format_)\n\n    return graph\n\n\ndef get_feature_data(feat, dataframes, groupbys, edges, primitives, layer=0):\n    # 1) add feature to dataframes tables:\n    feat_name = feat.get_name()\n    if feat.dataframe_name not in dataframes:\n        add_dataframe(feat.dataframe, dataframes)\n    dataframe_dict = dataframes[feat.dataframe_name]\n\n    # if we've already explored this feat, continue\n    feat_node = \"{}:{}\".format(feat.dataframe_name, feat_name)\n    if feat_name in dataframe_dict[\"columns\"] or feat_name in dataframe_dict[\"feats\"]:\n        return feat_node, layer\n\n    if isinstance(feat, IdentityFeature):\n        dataframe_dict[\"columns\"].add(feat_name)\n    else:\n        dataframe_dict[\"feats\"].add(feat_name)\n    base_node = feat_node\n\n    # 2) if multi-output, convert feature to generic base\n    if isinstance(feat, FeatureOutputSlice):\n        feat = feat.base_feature\n        feat_name = feat.get_name()\n\n    # 3) add primitive node\n    if feat.primitive.name or isinstance(feat, DirectFeature):\n        prim_name = feat.primitive.name if feat.primitive.name else \"join\"\n        prim_type = \"\"\n        if isinstance(feat, AggregationFeature):\n            prim_type = \"Aggregation\"\n        elif isinstance(feat, TransformFeature):\n            prim_type = \"Transform\"\n        primitive_node = \"{}_{}_{}\".format(layer, feat_name, prim_name)\n        primitives.append((primitive_node, prim_name.upper(), layer, prim_type))\n\n        edges[1].append([primitive_node, base_node])\n        base_node = primitive_node\n\n    # 4) add groupby/join edges and nodes\n    dependencies = [(dep.hash(), dep) for dep in feat.get_dependencies()]\n    for is_forward, r in feat.relationship_path:\n        if is_forward:\n            if r.child_dataframe.ww.name not in dataframes:\n                add_dataframe(r.child_dataframe, dataframes)\n            dataframes[r.child_dataframe.ww.name][\"columns\"].add(r._child_column_name)\n            child_node = \"{}:{}\".format(r.child_dataframe.ww.name, r._child_column_name)\n            edges[0].append([base_node, child_node])\n        else:\n            if r.child_dataframe.ww.name not in dataframes:\n                add_dataframe(r.child_dataframe, dataframes)\n            dataframes[r.child_dataframe.ww.name][\"columns\"].add(r._child_column_name)\n            child_node = \"{}:{}\".format(r.child_dataframe.ww.name, r._child_column_name)\n            child_name = child_node.replace(\":\", \"--\")\n            groupby_node = \"{}_groupby_{}\".format(feat_name, child_name)\n            groupby_name = \"group by\\n{}\".format(r._child_column_name)\n            groupbys.append((groupby_node, groupby_name))\n            edges[0].append([child_node, groupby_node])\n            edges[1].append([groupby_node, base_node])\n            base_node = groupby_node\n\n    if hasattr(feat, \"groupby\"):\n        groupby = feat.groupby\n        _ = get_feature_data(\n            groupby,\n            dataframes,\n            groupbys,\n            edges,\n            primitives,\n            layer + 1,\n        )\n        dependencies.remove((groupby.hash(), groupby))\n\n        groupby_name = groupby.get_name()\n        if isinstance(groupby, IdentityFeature):\n            dataframes[groupby.dataframe_name][\"columns\"].add(groupby_name)\n        else:\n            dataframes[groupby.dataframe_name][\"feats\"].add(groupby_name)\n\n        child_node = \"{}:{}\".format(groupby.dataframe_name, groupby_name)\n        child_name = child_node.replace(\":\", \"--\")\n        groupby_node = \"{}_groupby_{}\".format(feat_name, child_name)\n        groupby_name = \"group by\\n{}\".format(groupby_name)\n        groupbys.append((groupby_node, groupby_name))\n        edges[0].append([child_node, groupby_node])\n        edges[1].append([groupby_node, base_node])\n        base_node = groupby_node\n\n    # 5) recurse over dependents\n    max_depth = layer\n    for _, f in dependencies:\n        dependent_node, depth = get_feature_data(\n            f,\n            dataframes,\n            groupbys,\n            edges,\n            primitives,\n            layer + 1,\n        )\n        edges[1].append([dependent_node, base_node])\n\n        max_depth = max(depth, max_depth)\n\n    return feat_node, max_depth\n\n\ndef add_dataframe(dataframe, dataframe_dict):\n    dataframe_dict[dataframe.ww.name] = {\n        \"index\": dataframe.ww.index,\n        \"targets\": set(),\n        \"columns\": set(),\n        \"feats\": set(),\n    }\n\n\ndef get_dataframe_table(dataframe_name, dataframe_dict):\n    \"\"\"\n    given a dict of columns and feats, construct the html table for it\n    \"\"\"\n    index = dataframe_dict[\"index\"]\n    targets = dataframe_dict[\"targets\"]\n    columns = dataframe_dict[\"columns\"].difference(targets)\n    feats = dataframe_dict[\"feats\"].difference(targets)\n\n    # If the index is used, make sure it's the first element in the table\n    clean_index = html.escape(index)\n    if index in columns:\n        rows = [COL_TEMPLATE.format(clean_index, clean_index + \" (index)\")]\n        columns.discard(index)\n    elif index in targets:\n        rows = [TARGET_TEMPLATE.format(clean_index, clean_index + \" (index)\")]\n        targets.discard(index)\n    else:\n        rows = []\n\n    for col in list(columns) + list(feats) + list(targets):\n        template = COL_TEMPLATE\n        if col in targets:\n            template = TARGET_TEMPLATE\n\n        col = html.escape(col)\n        rows.append(template.format(col, col))\n\n    table = TABLE_TEMPLATE.format(\n        dataframe_name=dataframe_name,\n        table_cols=\"\\n\".join(rows),\n    )\n    return table\n"
  },
  {
    "path": "featuretools/feature_base/features_deserializer.py",
    "content": "import json\n\nfrom featuretools.entityset.deserialize import (\n    description_to_entityset as deserialize_es,\n)\nfrom featuretools.feature_base.feature_base import (\n    AggregationFeature,\n    DirectFeature,\n    Feature,\n    FeatureBase,\n    FeatureOutputSlice,\n    GroupByTransformFeature,\n    IdentityFeature,\n    TransformFeature,\n)\nfrom featuretools.primitives.utils import PrimitivesDeserializer\nfrom featuretools.utils.s3_utils import get_transport_params, use_smartopen_features\nfrom featuretools.utils.schema_utils import check_schema_version\nfrom featuretools.utils.wrangle import _is_s3, _is_url\n\n\ndef load_features(features, profile_name=None):\n    \"\"\"Loads the features from a filepath, S3 path, URL, an open file, or a JSON formatted string.\n\n    Args:\n        features (str or :class:`.FileObject`): The file location of saved features.\n        This must either be the name of the file, a JSON formatted string, or a readable file handle.\n\n        profile_name (str, bool): The AWS profile specified to write to S3. Will default to None and search for AWS credentials.\n            Set to False to use an anonymous profile.\n\n    Returns:\n        features (list[:class:`.FeatureBase`]): Feature definitions list.\n\n    Note:\n        Features saved in one version of Featuretools or Python are not guaranteed to work in another.\n        After upgrading Featuretools or Python, features may need to be generated again.\n\n    Example:\n        .. ipython:: python\n            :suppress:\n\n            import featuretools as ft\n            import os\n\n        .. code-block:: python\n\n            # Option 1\n            filepath = os.path.join('/Home/features/', 'list.json')\n            features = ft.load_features(filepath)\n\n            # Option 2\n            filepath = os.path.join('/Home/features/', 'list.json')\n            with open(filepath, 'r') as f:\n                features = ft.load_features(f)\n\n            # Option 3\n            filepath = os.path.join('/Home/features/', 'list.json')\n            with open(filepath, 'r') as :\n                feature_str = f.read()\n            features = ft.load_features(feature_str)\n\n\n    .. seealso::\n        :func:`.save_features`\n    \"\"\"\n    return FeaturesDeserializer.load(features, profile_name).to_list()\n\n\nclass FeaturesDeserializer(object):\n    FEATURE_CLASSES = {\n        \"AggregationFeature\": AggregationFeature,\n        \"DirectFeature\": DirectFeature,\n        \"Feature\": Feature,\n        \"FeatureBase\": FeatureBase,\n        \"GroupByTransformFeature\": GroupByTransformFeature,\n        \"IdentityFeature\": IdentityFeature,\n        \"TransformFeature\": TransformFeature,\n        \"FeatureOutputSlice\": FeatureOutputSlice,\n    }\n\n    def __init__(self, features_dict):\n        self.features_dict = features_dict\n        self._check_schema_version()\n        self.entityset = deserialize_es(features_dict[\"entityset\"])\n        self._deserialized_features = {}  # name -> feature\n        primitive_deserializer = PrimitivesDeserializer()\n        primitive_definitions = features_dict[\"primitive_definitions\"]\n        self._deserialized_primitives = {\n            k: primitive_deserializer.deserialize_primitive(v)\n            for k, v in primitive_definitions.items()\n        }\n\n    @classmethod\n    def load(cls, features, profile_name):\n        if isinstance(features, str):\n            try:\n                features_dict = json.loads(features)\n            except ValueError:\n                if _is_url(features) or _is_s3(features):\n                    transport_params = None\n                    if _is_s3(features):\n                        transport_params = get_transport_params(profile_name)\n                    features_dict = use_smartopen_features(\n                        features,\n                        transport_params=transport_params,\n                    )\n                else:\n                    with open(features, \"r\") as f:\n                        features_dict = json.load(f)\n            return cls(features_dict)\n        return cls(json.load(features))\n\n    def to_list(self):\n        feature_names = self.features_dict[\"feature_list\"]\n        return [self._deserialize_feature(name) for name in feature_names]\n\n    def _deserialize_feature(self, feature_name):\n        if feature_name in self._deserialized_features:\n            return self._deserialized_features[feature_name]\n\n        feature_dict = self.features_dict[\"feature_definitions\"][feature_name]\n        dependencies_list = feature_dict[\"dependencies\"]\n        primitive = None\n        primitive_id = feature_dict[\"arguments\"].get(\"primitive\")\n        if primitive_id is not None:\n            primitive = self._deserialized_primitives[primitive_id]\n\n        # Collect dependencies into a dictionary of name -> feature.\n        dependencies = {\n            dependency: self._deserialize_feature(dependency)\n            for dependency in dependencies_list\n        }\n\n        type = feature_dict[\"type\"]\n        cls = self.FEATURE_CLASSES.get(type)\n        if not cls:\n            raise RuntimeError('Unrecognized feature type \"%s\"' % type)\n\n        args = feature_dict[\"arguments\"]\n        feature = cls.from_dictionary(args, self.entityset, dependencies, primitive)\n\n        self._deserialized_features[feature_name] = feature\n        return feature\n\n    def _check_schema_version(self):\n        check_schema_version(self, \"features\")\n"
  },
  {
    "path": "featuretools/feature_base/features_serializer.py",
    "content": "import json\n\nfrom featuretools.primitives.utils import serialize_primitive\nfrom featuretools.utils.s3_utils import get_transport_params, use_smartopen_features\nfrom featuretools.utils.wrangle import _is_s3, _is_url\nfrom featuretools.version import FEATURES_SCHEMA_VERSION\nfrom featuretools.version import __version__ as ft_version\n\n\ndef save_features(features, location=None, profile_name=None):\n    \"\"\"Saves the features list as JSON to a specified filepath/S3 path, writes to an open file, or\n    returns the serialized features as a JSON string. If no file provided, returns a string.\n\n    Args:\n        features (list[:class:`.FeatureBase`]): List of Feature definitions.\n\n        location (str or :class:`.FileObject`, optional): The location of where to save\n            the features list which must include the name of the file,\n            or a writeable file handle to write to. If location is None, will return a JSON string\n            of the serialized features.\n            Default: None\n\n        profile_name (str, bool): The AWS profile specified to write to S3. Will default to None and search for AWS credentials.\n                                    Set to False to use an anonymous profile.\n\n    Note:\n        Features saved in one version of Featuretools are not guaranteed to work in another.\n        After upgrading Featuretools, features may need to be generated again.\n\n    Example:\n        .. ipython:: python\n            :suppress:\n\n            from featuretools.tests.testing_utils import (\n                make_ecommerce_entityset)\n            import featuretools as ft\n            es = make_ecommerce_entityset()\n            import os\n\n        .. code-block:: python\n\n            f1 = ft.Feature(es[\"log\"].ww[\"product_id\"])\n            f2 = ft.Feature(es[\"log\"].ww[\"purchased\"])\n            f3 = ft.Feature(es[\"log\"].ww[\"value\"])\n\n            features = [f1, f2, f3]\n\n            # Option 1\n            filepath = os.path.join('/Home/features/', 'list.json')\n            ft.save_features(features, filepath)\n\n            # Option 2\n            filepath = os.path.join('/Home/features/', 'list.json')\n            with open(filepath, 'w') as f:\n                ft.save_features(features, f)\n\n            # Option 3\n            features_string = ft.save_features(features)\n    .. seealso::\n        :func:`.load_features`\n    \"\"\"\n    return FeaturesSerializer(features).save(location, profile_name=profile_name)\n\n\nclass FeaturesSerializer(object):\n    def __init__(self, feature_list):\n        self.feature_list = feature_list\n        self._features_dict = None\n\n    def to_dict(self):\n        names_list = [feat.unique_name() for feat in self.feature_list]\n        es = self.feature_list[0].entityset\n\n        feature_defs, primitive_defs = self._feature_definitions()\n\n        return {\n            \"schema_version\": FEATURES_SCHEMA_VERSION,\n            \"ft_version\": ft_version,\n            \"entityset\": es.to_dictionary(),\n            \"feature_list\": names_list,\n            \"feature_definitions\": feature_defs,\n            \"primitive_definitions\": primitive_defs,\n        }\n\n    def save(self, location, profile_name):\n        features_dict = self.to_dict()\n        if location is None:\n            return json.dumps(features_dict)\n        if isinstance(location, str):\n            if _is_url(location):\n                raise ValueError(\"Writing to URLs is not supported\")\n            if _is_s3(location):\n                transport_params = get_transport_params(profile_name)\n                use_smartopen_features(\n                    location,\n                    features_dict,\n                    transport_params,\n                    read=False,\n                )\n            else:\n                with open(location, \"w\") as f:\n                    json.dump(features_dict, f)\n        else:\n            json.dump(features_dict, location)\n\n    def _feature_definitions(self):\n        if not self._features_dict:\n            self._features_dict = {}\n            self._primitives_dict = {}\n\n            for feature in self.feature_list:\n                self._serialize_feature(feature)\n\n            primitive_number = 0\n            primitive_id_to_key = {}\n            for name, feature in self._features_dict.items():\n                primitive = feature[\"arguments\"].get(\"primitive\")\n                if primitive:\n                    primitive_id = id(primitive)\n                    if primitive_id not in primitive_id_to_key.keys():\n                        # Primitive we haven't seen before, add to dict and increment primitive_id counter\n                        # Always use string for keys because json conversion results in integer dict keys\n                        # being converted to strings, but integer dict values are not.\n                        primitives_dict_key = str(primitive_number)\n                        primitive_id_to_key[primitive_id] = primitives_dict_key\n                        self._primitives_dict[primitives_dict_key] = (\n                            serialize_primitive(primitive)\n                        )\n                        self._features_dict[name][\"arguments\"][\"primitive\"] = (\n                            primitives_dict_key\n                        )\n                        primitive_number += 1\n                    else:\n                        # Primitive we have seen already - use existing primitive_id key\n                        key = primitive_id_to_key[primitive_id]\n                        self._features_dict[name][\"arguments\"][\"primitive\"] = key\n\n        return self._features_dict, self._primitives_dict\n\n    def _serialize_feature(self, feature):\n        name = feature.unique_name()\n\n        if name not in self._features_dict:\n            self._features_dict[feature.unique_name()] = feature.to_dictionary()\n\n            for dependency in feature.get_dependencies(deep=True):\n                name = dependency.unique_name()\n                if name not in self._features_dict:\n                    self._features_dict[name] = dependency.to_dictionary()\n"
  },
  {
    "path": "featuretools/feature_base/utils.py",
    "content": "def is_valid_input(candidate, template):\n    \"\"\"Checks if a candidate schema should be considered a match for a template schema\"\"\"\n    if template.logical_type is not None and not isinstance(\n        candidate.logical_type,\n        type(template.logical_type),\n    ):\n        return False\n    if len(template.semantic_tags - candidate.semantic_tags):\n        return False\n    return True\n"
  },
  {
    "path": "featuretools/feature_discovery/FeatureCollection.py",
    "content": "from __future__ import annotations\n\nimport hashlib\nfrom itertools import combinations\nfrom typing import Any, Dict, List, Optional, Set, Type, Union, cast\n\nfrom woodwork.logical_types import LogicalType\n\nfrom featuretools.feature_discovery.LiteFeature import LiteFeature\nfrom featuretools.feature_discovery.type_defs import ANY\nfrom featuretools.feature_discovery.utils import hash_primitive, logical_types_map\nfrom featuretools.primitives.base.primitive_base import PrimitiveBase\nfrom featuretools.primitives.utils import (\n    PrimitivesDeserializer,\n)\n\n\nclass FeatureCollection:\n    def __init__(self, features: List[LiteFeature]):\n        self._all_features: List[LiteFeature] = features\n        self.indexed = False\n        self.sorted = False\n        self._hash_key: Optional[str] = None\n\n    def sort_features(self):\n        if not self.sorted:\n            self._all_features = sorted(self._all_features)\n            self.sorted = True\n\n    def __repr__(self):\n        return f\"<FeatureCollection ({self.hash_key[:5]}) n_features={len(self._all_features)} indexed={self.indexed}>\"\n\n    @property\n    def all_features(self):\n        return self._all_features.copy()\n\n    @property\n    def hash_key(self) -> str:\n        if self._hash_key is None:\n            if not self.sorted:\n                self.sort_features()\n            self._set_hash()\n        assert self._hash_key is not None\n        return self._hash_key\n\n    def _set_hash(self):\n        hash_msg = hashlib.sha256()\n\n        for feature in self._all_features:\n            hash_msg.update(feature.id.encode(\"utf-8\"))\n\n        self._hash_key = hash_msg.hexdigest()\n        return self\n\n    def __hash__(self):\n        return hash(self.hash_key)\n\n    def __eq__(self, other: FeatureCollection) -> bool:\n        return self.hash_key == other.hash_key\n\n    def reindex(self) -> FeatureCollection:\n        self.by_logical_type: Dict[\n            Union[Type[LogicalType], None],\n            Set[LiteFeature],\n        ] = {}\n        self.by_tag: Dict[str, Set[LiteFeature]] = {}\n        self.by_origin_feature: Dict[LiteFeature, Set[LiteFeature]] = {}\n        self.by_depth: Dict[int, Set[LiteFeature]] = {}\n        self.by_name: Dict[str, LiteFeature] = {}\n        self.by_key: Dict[str, List[LiteFeature]] = {}\n\n        for feature in self._all_features:\n            for key in self.feature_to_keys(feature):\n                self.by_key.setdefault(key, []).append(feature)\n\n            logical_type = feature.logical_type\n            self.by_logical_type.setdefault(logical_type, set()).add(feature)\n\n            tags = feature.tags\n            for tag in tags:\n                self.by_tag.setdefault(tag, set()).add(feature)\n\n            origin_features = feature.get_origin_features()\n            for origin_feature in origin_features:\n                self.by_origin_feature.setdefault(origin_feature, set()).add(feature)\n\n            if feature.depth == 0:\n                self.by_origin_feature.setdefault(feature, set()).add(feature)\n\n            feature_name = feature.name\n            assert feature_name is not None\n            assert feature_name not in self.by_name\n\n            self.by_name[feature_name] = feature\n\n        self.indexed = True\n\n        return self\n\n    def get_by_logical_type(self, logical_type: Type[LogicalType]) -> Set[LiteFeature]:\n        return self.by_logical_type.get(logical_type, set())\n\n    def get_by_tag(self, tag: str) -> Set[LiteFeature]:\n        return self.by_tag.get(tag, set())\n\n    def get_by_origin_feature(self, origin_feature: LiteFeature) -> Set[LiteFeature]:\n        return self.by_origin_feature.get(origin_feature, set())\n\n    def get_by_origin_feature_name(self, name: str) -> Union[LiteFeature, None]:\n        feature = self.by_name.get(name)\n        return feature\n\n    def get_dependencies_by_origin_name(self, name) -> Set[LiteFeature]:\n        origin_feature = self.by_name.get(name)\n        if origin_feature:\n            return self.by_origin_feature[origin_feature]\n        return set()\n\n    def get_by_key(self, key: str) -> List[LiteFeature]:\n        return self.by_key.get(key, [])\n\n    def flatten_features(self) -> Dict[str, LiteFeature]:\n        all_features_dict: Dict[str, LiteFeature] = {}\n\n        def rfunc(feature_list: List[LiteFeature]):\n            for feature in feature_list:\n                all_features_dict.setdefault(feature.id, feature)\n                rfunc(feature.base_features)\n\n        rfunc(self._all_features)\n        return all_features_dict\n\n    def flatten_primitives(self) -> Dict[str, Dict[str, Any]]:\n        all_primitives_dict: Dict[str, Dict[str, Any]] = {}\n\n        def rfunc(feature_list: List[LiteFeature]):\n            for feature in feature_list:\n                if feature.primitive:\n                    key, prim_dict = hash_primitive(feature.primitive)\n                    all_primitives_dict.setdefault(key, prim_dict)\n                rfunc(feature.base_features)\n\n        rfunc(self._all_features)\n        return all_primitives_dict\n\n    def to_dict(self):\n        all_primitives_dict = self.flatten_primitives()\n        all_features_dict = self.flatten_features()\n\n        return {\n            \"primitives\": all_primitives_dict,\n            \"feature_ids\": [f.id for f in self._all_features],\n            \"all_features\": {k: f.to_dict() for k, f in all_features_dict.items()},\n        }\n\n    @staticmethod\n    def feature_to_keys(feature: LiteFeature) -> List[str]:\n        \"\"\"\n        Generate hashing keys from LiteFeature. For example:\n        - LiteFeature(\"f1\", Double, {\"numeric\"}) -> ['Double', 'numeric', 'Double,numeric', 'ANY']\n        - LiteFeature(\"f1\", Datetime, {\"time_index\"}) -> ['Datetime', 'time_index', 'Datetime,time_index', 'ANY']\n        - LiteFeature(\"f1\", Double, {\"index\", \"other\"}) -> ['Double', 'index', 'other', 'Double,index', 'Double,other', 'ANY']\n\n                Args:\n            feature (LiteFeature):\n\n        Returns:\n            List[str]\n                List of hashing keys\n        \"\"\"\n        keys: List[str] = []\n        logical_type = feature.logical_type\n        logical_type_name = None\n        if logical_type is not None:\n            logical_type_name = logical_type.__name__\n            keys.append(logical_type_name)\n\n        all_tags = sorted(feature.tags)\n\n        tag_combinations = []\n\n        # generate combinations of all lengths from 1 to the length of the input list\n        for i in range(1, len(all_tags) + 1):\n            # generate combinations of length i and append to the combinations_list\n            for comb in combinations(all_tags, i):\n                tag_combinations.append(list(comb))\n\n        for tag_combination in tag_combinations:\n            tags_key = \",\".join(tag_combination)\n            keys.append(tags_key)\n            if logical_type_name:\n                keys.append(f\"{logical_type_name},{tags_key}\")\n\n        keys.append(ANY)\n        return keys\n\n    @staticmethod\n    def from_dict(input_dict):\n        primitive_deserializer = PrimitivesDeserializer()\n\n        primitives = {}\n        for prim_key, prim_dict in input_dict[\"primitives\"].items():\n            primitive = primitive_deserializer.deserialize_primitive(\n                prim_dict,\n            )\n            assert isinstance(primitive, PrimitiveBase)\n            primitives[prim_key] = primitive\n\n        hydrated_features: Dict[str, LiteFeature] = {}\n\n        feature_ids: List[str] = cast(List[str], input_dict[\"feature_ids\"])\n        all_features: Dict[str, Any] = cast(Dict[str, Any], input_dict[\"all_features\"])\n\n        def hydrate_feature(feature_id: str) -> LiteFeature:\n            if feature_id in hydrated_features:\n                return hydrated_features[feature_id]\n\n            feature_dict = all_features[feature_id]\n            base_features = [hydrate_feature(x) for x in feature_dict[\"base_features\"]]\n\n            logical_type = (\n                logical_types_map[feature_dict[\"logical_type\"]]\n                if feature_dict[\"logical_type\"]\n                else None\n            )\n\n            hydrated_feature = LiteFeature(\n                name=feature_dict[\"name\"],\n                logical_type=logical_type,\n                tags=set(feature_dict[\"tags\"]),\n                primitive=primitives[feature_dict[\"primitive\"]]\n                if feature_dict[\"primitive\"]\n                else None,\n                base_features=base_features,\n                df_id=feature_dict[\"df_id\"],\n                related_features=set(),\n                idx=feature_dict[\"idx\"],\n            )\n\n            assert hydrated_feature.id == feature_dict[\"id\"] == feature_id\n            hydrated_features[feature_id] = hydrated_feature\n\n            # need to link after features are stored on cache\n            related_features = [\n                hydrate_feature(x) for x in feature_dict[\"related_features\"]\n            ]\n            hydrated_feature.related_features = set(related_features)\n\n            return hydrated_feature\n\n        return FeatureCollection([hydrate_feature(x) for x in feature_ids])\n"
  },
  {
    "path": "featuretools/feature_discovery/LiteFeature.py",
    "content": "from __future__ import annotations\n\nimport hashlib\nfrom dataclasses import field\nfrom functools import total_ordering\nfrom typing import Any, Dict, List, Optional, Set, Type, Union\n\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import LogicalType\n\nfrom featuretools.feature_discovery.utils import (\n    get_primitive_return_type,\n    hash_primitive,\n)\nfrom featuretools.primitives.base.primitive_base import PrimitiveBase\n\n\n@total_ordering\nclass LiteFeature:\n    _name: Optional[str] = None\n    _alias: Optional[str] = None\n\n    _logical_type: Optional[Type[LogicalType]] = None\n    _tags: Set[str] = field(default_factory=set)\n    _primitive: Optional[PrimitiveBase] = None\n    _base_features: List[LiteFeature] = field(default_factory=list)\n    _df_id: Optional[str] = None\n\n    _id: str\n    _n_output_features: int = 1\n\n    _depth = 0\n    _related_features: Set[LiteFeature]\n    _idx: int = 0\n\n    def __init__(\n        self,\n        name: Optional[str] = None,\n        logical_type: Optional[Type[LogicalType]] = None,\n        tags: Optional[Set[str]] = None,\n        primitive: Optional[PrimitiveBase] = None,\n        base_features: Optional[List[LiteFeature]] = None,\n        df_id: Optional[str] = None,\n        related_features: Optional[Set[LiteFeature]] = None,\n        idx: Optional[int] = None,\n    ):\n        self._logical_type = logical_type\n        self._tags = tags if tags else set()\n        self._primitive = primitive\n        self._base_features = base_features if base_features else []\n        self._df_id = df_id\n        self._idx = idx if idx is not None else 0\n        self._related_features = related_features if related_features else set()\n\n        if self._primitive:\n            if not isinstance(self._primitive, PrimitiveBase):\n                raise ValueError(\"primitive input must be of type PrimitiveBase\")\n\n            if len(self.base_features) == 0:\n                raise ValueError(\"there must be base features if given a primitive\")\n\n            if self._primitive.commutative:\n                self._base_features = sorted(self._base_features)\n\n            self._n_output_features = self._primitive.number_output_features\n            self._depth = max([x.depth for x in self.base_features]) + 1\n\n            if name:\n                self._alias = name\n\n            self._name = self._primitive.generate_name(\n                [x.name for x in self.base_features],\n            )\n\n            return_column_schema = get_primitive_return_type(self._primitive)\n            self._logical_type = (\n                type(return_column_schema.logical_type)\n                if return_column_schema.logical_type\n                else None\n            )\n\n            self._tags = return_column_schema.semantic_tags\n\n        else:\n            if name is None:\n                raise TypeError(\"Name must be given if origin feature\")\n\n            if self._logical_type is None:\n                raise TypeError(\"Logical Type must be given if origin feature\")\n\n            self._name = name\n\n        if self._logical_type is not None and \"index\" not in self._tags:\n            self._tags = self._tags | self._logical_type.standard_tags\n\n        self._id = self._generate_hash()\n\n    @property\n    def name(self):\n        if self._alias:\n            return self._alias\n        elif self.is_multioutput():\n            return f\"{self._name}[{self.idx}]\"\n        return self._name\n\n    @name.setter\n    def name(self, _):\n        raise AttributeError(\"name is immutable\")\n\n    def set_alias(self, value: Union[str, None]):\n        self._alias = value\n\n    @property\n    def non_indexed_name(self):\n        if not self.is_multioutput():\n            raise ValueError(\"only used on multioutput features\")\n        return self._name\n\n    @property\n    def logical_type(self):\n        return self._logical_type\n\n    @logical_type.setter\n    def logical_type(self, _):\n        raise AttributeError(\"logical_type is immutable\")\n\n    @property\n    def tags(self):\n        return self._tags.copy()\n\n    @tags.setter\n    def tags(self, _):\n        raise AttributeError(\"tags is immutable\")\n\n    @property\n    def primitive(self):\n        return self._primitive\n\n    @primitive.setter\n    def primitive(self, _):\n        raise AttributeError(\"primitive is immutable\")\n\n    @property\n    def base_features(self):\n        return self._base_features\n\n    @base_features.setter\n    def base_features(self, _):\n        raise AttributeError(\"base_features are immutable\")\n\n    @property\n    def df_id(self):\n        return self._df_id\n\n    @df_id.setter\n    def df_id(self, _):\n        raise AttributeError(\"df_id is immutable\")\n\n    @property\n    def id(self):\n        return self._id\n\n    @id.setter\n    def id(self, _):\n        raise AttributeError(\"id is immutable\")\n\n    @property\n    def n_output_features(self):\n        return self._n_output_features\n\n    @n_output_features.setter\n    def n_output_features(self, _):\n        raise AttributeError(\"n_output_features is immutable\")\n\n    @property\n    def depth(self):\n        return self._depth\n\n    @depth.setter\n    def depth(self, _):\n        raise AttributeError(\"depth is immutable\")\n\n    @property\n    def related_features(self):\n        return self._related_features.copy()\n\n    @related_features.setter\n    def related_features(self, value: Set[LiteFeature]):\n        self._related_features = value\n\n    @property\n    def idx(self):\n        return self._idx\n\n    @idx.setter\n    def idx(self, _):\n        raise AttributeError(\"idx is immutable\")\n\n    @staticmethod\n    def hash(\n        name: Optional[str],\n        primitive: Optional[PrimitiveBase] = None,\n        base_features: List[LiteFeature] = [],\n        df_id: Optional[str] = None,\n        idx: int = 0,\n    ):\n        hash_msg = hashlib.sha256()\n\n        if primitive:\n            # TODO: hashing should be on primitive\n            hash_msg.update(hash_primitive(primitive)[0].encode(\"utf-8\"))\n            commutative = primitive.commutative\n            assert (\n                len(base_features) > 0\n            ), \"there must be base features if give a primitive\"\n            base_columns = base_features\n            if commutative:\n                base_features.sort()\n\n            for c in base_columns:\n                hash_msg.update(c.id.encode(\"utf-8\"))\n\n        else:\n            assert name\n            hash_msg.update(name.encode(\"utf-8\"))\n            if df_id:\n                hash_msg.update(df_id.encode(\"utf-8\"))\n\n        hash_msg.update(str(idx).encode(\"utf-8\"))\n\n        return hash_msg.hexdigest()\n\n    def __eq__(self, other: LiteFeature):\n        return self._id == other._id\n\n    def __lt__(self, other: LiteFeature):\n        return self._id < other._id\n\n    def __ne__(self, other):\n        return self._id != other._id\n\n    def __hash__(self):\n        return hash(self._id)\n\n    def _generate_hash(self) -> str:\n        return self.hash(\n            name=self._name,\n            primitive=self._primitive,\n            base_features=self._base_features,\n            df_id=self._df_id,\n            idx=self._idx,\n        )\n\n    def get_primitive_name(self) -> Union[str, None]:\n        return self._primitive.name if self._primitive else None\n\n    def get_dependencies(self, deep=False) -> List[LiteFeature]:\n        flattened_dependencies = []\n        for f in self._base_features:\n            flattened_dependencies.append(f)\n\n            if deep:\n                dependencies = f.get_dependencies()\n                if isinstance(dependencies, list):\n                    flattened_dependencies.extend(dependencies)\n                else:\n                    flattened_dependencies.append(dependencies)\n        return flattened_dependencies\n\n    def get_origin_features(self) -> List[LiteFeature]:\n        all_dependencies = self.get_dependencies(deep=True)\n        return [f for f in all_dependencies if f._depth == 0]\n\n    @property\n    def column_schema(self) -> ColumnSchema:\n        return ColumnSchema(logical_type=self.logical_type, semantic_tags=self.tags)\n\n    def dependent_primitives(self) -> Set[Type[PrimitiveBase]]:\n        dependent_features = self.get_dependencies(deep=True)\n        dependent_primitives = {\n            type(f._primitive) for f in dependent_features if f._primitive\n        }\n        if self._primitive:\n            dependent_primitives.add(type(self._primitive))\n        return dependent_primitives\n\n    def to_dict(self) -> Dict[str, Any]:\n        return {\n            \"name\": self.name,\n            \"logical_type\": self.logical_type.__name__ if self.logical_type else None,\n            \"tags\": list(self.tags),\n            \"primitive\": hash_primitive(self.primitive)[0] if self.primitive else None,\n            \"base_features\": [x.id for x in self.base_features],\n            \"df_id\": self.df_id,\n            \"id\": self.id,\n            \"related_features\": [x.id for x in self.related_features],\n            \"idx\": self.idx,\n        }\n\n    def is_multioutput(self) -> bool:\n        return len(self._related_features) > 0\n\n    def copy(self) -> LiteFeature:\n        copied_feature = LiteFeature(\n            name=self._name,\n            logical_type=self._logical_type,\n            tags=self._tags.copy(),\n            primitive=self._primitive,\n            base_features=[f.copy() for f in self._base_features],\n            df_id=self._df_id,\n            idx=self._idx,\n            related_features=self._related_features.copy(),\n        )\n\n        copied_feature.set_alias(self._alias)\n\n        return copied_feature\n\n    def __repr__(self) -> str:\n        name = f\"name='{self.name}'\"\n        logical_type = f\"logical_type={self.logical_type}\"\n        tags = f\"tags={self.tags}\"\n        primitive = f\"primitive={self.get_primitive_name()}\"\n        return f\"LiteFeature({name}, {logical_type}, {tags}, {primitive})\"\n"
  },
  {
    "path": "featuretools/feature_discovery/__init__.py",
    "content": ""
  },
  {
    "path": "featuretools/feature_discovery/convertors.py",
    "content": "from __future__ import annotations\n\nfrom typing import Dict, List\n\nimport pandas as pd\nfrom woodwork.logical_types import LogicalType\n\nfrom featuretools.feature_base.feature_base import (\n    FeatureBase,\n    IdentityFeature,\n    TransformFeature,\n)\nfrom featuretools.feature_discovery.LiteFeature import LiteFeature\nfrom featuretools.primitives import TransformPrimitive\nfrom featuretools.primitives.base.primitive_base import PrimitiveBase\n\nFeatureCache = Dict[str, FeatureBase]\n\n\ndef convert_featurebase_list_to_feature_list(\n    featurebase_list: List[FeatureBase],\n) -> List[LiteFeature]:\n    \"\"\"\n    Convert a List of FeatureBase objects to a list LiteFeature objects\n\n    Args:\n        featurebase_list (List[FeatureBase]):\n\n    Returns:\n       LiteFeatures (List[LiteFeature]) - converted LiteFeature objects\n    \"\"\"\n\n    def rfunc(fb: FeatureBase) -> List[LiteFeature]:\n        base_features = [\n            feature\n            for feature_list in [rfunc(x) for x in fb.base_features]\n            for feature in feature_list\n        ]\n        col_schema = fb.column_schema\n\n        logical_type = col_schema.logical_type\n        if logical_type is not None:\n            assert issubclass(type(logical_type), LogicalType)\n            logical_type = type(logical_type)\n\n        tags = col_schema.semantic_tags\n\n        if isinstance(fb, IdentityFeature):\n            primitive = None\n        else:\n            primitive = fb.primitive\n            assert isinstance(primitive, PrimitiveBase)\n\n        if fb.number_output_features > 1:\n            features: List[LiteFeature] = []\n\n            for idx, name in enumerate(fb.get_feature_names()):\n                f = LiteFeature(\n                    name=name,\n                    logical_type=logical_type,\n                    tags=tags,\n                    primitive=primitive,\n                    base_features=base_features,\n                    # TODO: use when working with multi-table\n                    df_id=None,\n                    idx=idx,\n                )\n                features.append(f)\n\n            for feature in features:\n                related_features = [f for f in features if f.id != feature.id]\n                feature.related_features = set(related_features)\n\n            return features\n\n        return [\n            LiteFeature(\n                name=fb.get_name(),\n                logical_type=logical_type,\n                tags=tags,\n                primitive=primitive,\n                base_features=base_features,\n                # TODO: use when working with multi-table\n                df_id=None,\n            ),\n        ]\n\n    return [\n        feature\n        for feature_list in [rfunc(fb) for fb in featurebase_list]\n        for feature in feature_list\n    ]\n\n\ndef _feature_to_transform_feature(\n    feature: LiteFeature,\n    base_features: List[FeatureBase],\n) -> FeatureBase:\n    \"\"\"\n    Transform LiteFeature into FeatureBase object. Handles the Multi-output\n    feature in correct way.\n\n    Args:\n        feature (LiteFeature)\n        base_features (List[FeatureBase])\n\n    Returns:\n       FeatureBase\n    \"\"\"\n    assert feature.primitive\n\n    assert isinstance(\n        feature.primitive,\n        TransformPrimitive,\n    ), \"Only Transform Primitives\"\n\n    fb = TransformFeature(base_features, feature.primitive)\n    if feature.is_multioutput():\n        sorted_features = sorted(\n            [f for f in feature.related_features] + [feature],\n            key=lambda x: x.idx,\n        )\n        names = [x.name for x in sorted_features]\n\n        fb = fb.rename(feature.non_indexed_name)\n        fb.set_feature_names(names)\n    else:\n        fb = fb.rename(feature.name)\n\n    return fb\n\n\ndef _convert_feature_to_featurebase(\n    feature: LiteFeature,\n    dataframe: pd.DataFrame,\n    cache: FeatureCache,\n) -> FeatureBase:\n    \"\"\"\n    Recursively transforms a LiteFeature object into a Featurebase object\n\n    Args:\n        feature (LiteFeature)\n        base_features (List[FeatureBase])\n        cache (FeatureCache) already converted features\n\n    Returns:\n       FeatureBase\n    \"\"\"\n\n    def get_base_features(\n        feature: LiteFeature,\n    ) -> List[FeatureBase]:\n        new_base_features: List[FeatureBase] = []\n        for bf in feature.base_features:\n            fb = rfunc(bf)\n            if bf.is_multioutput():\n                idx = bf.idx\n                # if its multioutput, you can index on the FeatureBase\n                new_base_features.append(fb[idx])\n            else:\n                new_base_features.append(fb)\n\n        return new_base_features\n\n    def rfunc(feature: LiteFeature) -> FeatureBase:\n        # if feature has already been converted, return from cache\n        if feature.id in cache:\n            return cache[feature.id]\n\n        # if depth is 0, we are at an origin feature\n        if feature.depth == 0:\n            fb = IdentityFeature(dataframe.ww[feature.name])\n            cache[feature.id] = fb\n            return fb\n\n        base_features = get_base_features(feature)\n\n        fb = _feature_to_transform_feature(feature, base_features)\n        cache[feature.id] = fb\n        return fb\n\n    return rfunc(feature)\n\n\ndef convert_feature_list_to_featurebase_list(\n    feature_list: List[LiteFeature],\n    dataframe: pd.DataFrame,\n) -> List[FeatureBase]:\n    \"\"\"\n    Convert a list of LiteFeature objects into a list of FeatureBase objects\n\n    Args:\n        feature_list (List[LiteFeature])\n        dataframe (pd.DataFrame)\n\n    Returns:\n       List[FeatureBase]\n    \"\"\"\n    feature_cache: FeatureCache = {}\n\n    converted_features: List[FeatureBase] = []\n    for feature in feature_list:\n        if feature.is_multioutput():\n            related_feature_ids = [f.id for f in feature.related_features]\n            if any((x in feature_cache for x in related_feature_ids)):\n                # feature base already created for related ids\n                continue\n\n        fb = _convert_feature_to_featurebase(\n            feature=feature,\n            dataframe=dataframe,\n            cache=feature_cache,\n        )\n        converted_features.append(fb)\n\n    return converted_features\n"
  },
  {
    "path": "featuretools/feature_discovery/feature_discovery.py",
    "content": "import inspect\nfrom collections import defaultdict\nfrom itertools import combinations, permutations, product\nfrom typing import Iterable, List, Set, Tuple, Type, Union, cast\n\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import LogicalType\nfrom woodwork.table_schema import TableSchema\n\nfrom featuretools.feature_discovery.FeatureCollection import FeatureCollection\nfrom featuretools.feature_discovery.LiteFeature import LiteFeature\nfrom featuretools.feature_discovery.utils import column_schema_to_keys, flatten_list\nfrom featuretools.primitives.base.primitive_base import PrimitiveBase\n\n\ndef _index_column_set(column_set: List[ColumnSchema]) -> List[Tuple[str, int]]:\n    \"\"\"\n    Indexes input set to find types of columns and the quantity of each\n\n    Args:\n        column_set (List(ColumnSchema)):\n            List of Column types needed by associated primitive.\n\n    Returns:\n        List[Tuple[str, int]]\n            A list of key, count tuples\n\n    Examples:\n        .. code-block:: python\n\n            from featuretools.feature_discovery.feature_discovery import _index_column_set\n            from woodwork.column_schema import ColumnSchema\n\n            column_set = [ColumnSchema(semantic_tags={\"numeric\"}), ColumnSchema(semantic_tags={\"numeric\"})]\n            indexed_column_set = _index_column_set(column_set)\n            [(\"numeric\": 2)]\n    \"\"\"\n    out = defaultdict(int)\n    for column_schema in column_set:\n        key = column_schema_to_keys(column_schema)\n        out[key] += 1\n    return list(out.items())\n\n\ndef _get_features(\n    feature_collection: FeatureCollection,\n    column_keys: Tuple[Tuple[str, int]],\n    commutative: bool,\n) -> List[List[LiteFeature]]:\n    \"\"\"\n    Calculates all LiteFeature combinations using the given hashmap of existing features, and the input set of required columns.\n\n    Args:\n        feature_collection (FeatureCollection):\n            An indexed feature collection object for efficient querying of features\n        column_keys (List[Tuple[str, int]]):\n            List of Column types needed by associated primitive.\n        commutative (bool):\n            whether or not we need to use product or combinations to create feature sets.\n\n    Returns:\n        List[List[LiteFeature]]\n            A list of LiteFeature sets.\n\n    Examples:\n        .. code-block:: python\n\n            from featuretools.feature_discovery.feature_discovery import _get_features\n            from woodwork.column_schema import ColumnSchema\n\n            feature_groups = {\n                \"ANY\": [\"f1\", \"f2\", \"f3\"],\n                \"Double\": [\"f1\", \"f2\", \"f3\"],\n                \"numeric\": [\"f1\", \"f2\", \"f3\"],\n                \"Double,numeric\": [\"f1\", \"f2\", \"f3\"],\n            }\n            column_set = [ColumnSchema(semantic_tags={\"numeric\"}), ColumnSchema(semantic_tags={\"numeric\"})]\n            features = _get_features(col_groups, column_set, commutative=False)\n    \"\"\"\n\n    prod_iter = []\n    for key, count in column_keys:\n        relevant_features = list(feature_collection.get_by_key(key))\n\n        if commutative:\n            prod_iter.append(combinations(relevant_features, count))\n        else:\n            prod_iter.append(permutations(relevant_features, count))\n\n    feature_combinations = product(*prod_iter)\n\n    return [flatten_list(x) for x in feature_combinations]\n\n\ndef _primitive_to_columnsets(primitive: PrimitiveBase) -> List[List[ColumnSchema]]:\n    column_sets = primitive.input_types\n    assert column_sets is not None\n    if not isinstance(column_sets[0], list):\n        column_sets = [primitive.input_types]\n\n    column_sets = cast(List[List[ColumnSchema]], column_sets)\n\n    # Some primitives are commutative, yet have explicit versions of commutative pairs (eg. MultiplyNumericBoolean),\n    # which would create multiple versions, so this resolved that.\n    if primitive.commutative:\n        existing = set()\n        uniq_column_sets = []\n        for column_set in column_sets:\n            key = \"_\".join(sorted([x.__repr__() for x in column_set]))\n            if key not in existing:\n                uniq_column_sets.append(column_set)\n                existing.add(key)\n\n        column_sets = uniq_column_sets\n\n    return column_sets\n\n\ndef _get_matching_features(\n    feature_collection: FeatureCollection,\n    primitive: PrimitiveBase,\n) -> List[List[LiteFeature]]:\n    \"\"\"\n    For a given primitive, find all feature sets that can be used to create new feature\n\n    Args:\n        feature_collection (FeatureCollection):\n            An indexed feature collection object for efficient querying of features\n        primitive (PrimitiveBase)\n\n    Returns:\n        List[List[LiteFeature]]\n            List of feature sets\n\n    Examples:\n        .. code-block:: python\n\n            from featuretools.feature_discovery.feature_discovery import get_matching_columns\n            from woodwork.column_schema import ColumnSchema\n\n            feature_groups = {\n                \"ANY\": [\"f1\", \"f2\", \"f3\"],\n                \"Double\": [\"f1\", \"f2\", \"f3\"],\n                \"numeric\": [\"f1\", \"f2\", \"f3\"],\n                \"Double,numeric\": [\"f1\", \"f2\", \"f3\"],\n            }\n\n            feature_sets = _get_matching_features(col_groups, AddNumeric)\n\n            [\n                [\"f1\", \"f2\"],\n                [\"f1\", \"f3\"],\n                [\"f2\", \"f3\"]\n            ]\n    \"\"\"\n    column_sets = _primitive_to_columnsets(primitive=primitive)\n\n    column_keys_set = [_index_column_set(c) for c in column_sets]\n\n    commutative = primitive.commutative\n\n    feature_sets = []\n    for column_keys in column_keys_set:\n        assert column_keys is not None\n        feature_sets_ = _get_features(\n            feature_collection=feature_collection,\n            column_keys=tuple(column_keys),\n            commutative=commutative,\n        )\n\n        feature_sets.extend(feature_sets_)\n\n    return feature_sets\n\n\ndef _features_from_primitive(\n    primitive: PrimitiveBase,\n    feature_collection: FeatureCollection,\n) -> List[LiteFeature]:\n    \"\"\"\n    For a given primitive, creates all engineered features\n\n    Args:\n        primitive (Type[PrimitiveBase])\n        feature_collection (FeatureCollection):\n            An indexed feature collection object for efficient querying of features\n\n    Returns:\n        List[List[LiteFeature]]\n            List of feature sets\n\n    Examples:\n        .. code-block:: python\n\n            from featuretools.feature_discovery.feature_discovery import get_matching_columns\n            from woodwork.column_schema import ColumnSchema\n\n            feature_groups = {\n                \"ANY\": [\"f1\", \"f2\", \"f3\"],\n                \"Double\": [\"f1\", \"f2\", \"f3\"],\n                \"numeric\": [\"f1\", \"f2\", \"f3\"],\n                \"Double,numeric\": [\"f1\", \"f2\", \"f3\"],\n            }\n\n            feature_sets = _features_from_primitive(AddNumeric, feature_groups)\n\n            [\n                [\"f1\", \"f2\"],\n                [\"f1\", \"f3\"],\n                [\"f2\", \"f3\"]\n            ]\n    \"\"\"\n    assert isinstance(primitive, PrimitiveBase)\n\n    features: List[LiteFeature] = []\n    feature_sets = _get_matching_features(\n        feature_collection=feature_collection,\n        primitive=primitive,\n    )\n    for feature_set in feature_sets:\n        if primitive.number_output_features > 1:\n            related_features: Set[LiteFeature] = set()\n            for n in range(primitive.number_output_features):\n                feature = LiteFeature(\n                    primitive=primitive,\n                    base_features=feature_set,\n                    idx=n,\n                )\n\n                related_features.add(feature)\n\n            for f in related_features:\n                f.related_features = related_features - {f}\n                features.append(f)\n        else:\n            features.append(\n                LiteFeature(\n                    primitive=primitive,\n                    base_features=feature_set,\n                ),\n            )\n    return features\n\n\ndef schema_to_features(schema: TableSchema) -> List[LiteFeature]:\n    \"\"\"\n    ** EXPERIMENTAL **\n    Convert a Woodwork Schema object to a list of LiteFeatures.\n\n    Args:\n        schema (TableSchema):\n            Woodwork TableSchema object\n\n    Returns:\n        List[LiteFeature]\n\n    Examples:\n        .. code-block:: python\n\n            from featuretools.feature_discovery.feature_discovery import schema_to_features\n            from featuretools.primitives import Absolute, IsNull\n            import pandas as pd\n            import woodwork as ww\n\n            df = pd.DataFrame({\n                \"idx\": [0,1,2,3],\n                \"f1\": [\"A\", \"B\", \"C\", \"D\"],\n                \"f2\": [1.2, 2.3, 3.4, 4.5]\n            })\n\n            df.ww.init()\n\n            features = schema_to_features(df.ww.schema)\n\n    \"\"\"\n    features = []\n    for col_name, column_schema in schema.columns.items():\n        assert isinstance(column_schema, ColumnSchema)\n\n        logical_type = column_schema.logical_type\n        assert logical_type\n        assert issubclass(type(logical_type), LogicalType)\n\n        tags = column_schema.semantic_tags\n        assert isinstance(tags, set)\n\n        features.append(\n            LiteFeature(\n                name=col_name,\n                logical_type=type(logical_type),\n                tags=tags,\n            ),\n        )\n\n    return features\n\n\ndef _check_inputs(\n    input_features: Iterable[LiteFeature],\n    primitives: Union[List[Type[PrimitiveBase]], List[PrimitiveBase]],\n) -> Tuple[Iterable[LiteFeature], List[PrimitiveBase]]:\n    if not isinstance(input_features, Iterable):\n        raise ValueError(\"input_features must be an iterable of LiteFeature objects\")\n\n    for feature in input_features:\n        if not isinstance(feature, LiteFeature):\n            raise ValueError(\n                \"input_features must be an iterable of LiteFeature objects\",\n            )\n\n    if not isinstance(primitives, List):\n        raise ValueError(\n            \"primitives must be a list of Primitive classes or Primitive instances\",\n        )\n\n    primitive_instances: List[PrimitiveBase] = []\n    for primitive in primitives:\n        if inspect.isclass(primitive) and issubclass(primitive, PrimitiveBase):\n            primitive_instances.append(primitive())\n        elif isinstance(primitive, PrimitiveBase):\n            primitive_instances.append(primitive)\n        else:\n            raise ValueError(\n                \"primitives must be a list of Primitive classes or Primitive instances\",\n            )\n\n    return (input_features, primitive_instances)\n\n\ndef generate_features_from_primitives(\n    input_features: Iterable[LiteFeature],\n    primitives: Union[List[Type[PrimitiveBase]], List[PrimitiveBase]],\n) -> List[LiteFeature]:\n    \"\"\"\n    ** EXPERIMENTAL **\n    Calculates all Features for a given input of features and a list of primitives.\n\n    Args:\n        origin_features (List[LiteFeature]):\n            List of origin features\n        primitives (List[Type[PrimitiveBase]])\n            List of primitive classes\n\n    Returns:\n        List[LiteFeature]\n\n    Examples:\n        .. code-block:: python\n\n            from featuretools.feature_discovery.feature_discovery import lite_dfs\n            from featuretools.primitives import Absolute, IsNull\n            import pandas as pd\n            import woodwork as ww\n\n            df = pd.DataFrame({\n                \"idx\": [0,1,2,3],\n                \"f1\": [\"A\", \"B\", \"C\", \"D\"],\n                \"f2\": [1.2, 2.3, 3.4, 4.5]\n            })\n\n            df.ww.init()\n            origin_features = schema_to_features(df.ww.schema)\n            features = lite_dfs(origin_features, [Absolute, IsNull])\n\n    \"\"\"\n\n    (input_features, primitives) = _check_inputs(input_features, primitives)\n\n    features = [x.copy() for x in input_features]\n\n    feature_collection = FeatureCollection(features=features)\n    feature_collection.reindex()\n\n    for primitive in primitives:\n        features_ = _features_from_primitive(\n            primitive=primitive,\n            feature_collection=feature_collection,\n        )\n        features.extend(features_)\n\n    return features\n"
  },
  {
    "path": "featuretools/feature_discovery/type_defs.py",
    "content": "ANY = \"ANY\"\n"
  },
  {
    "path": "featuretools/feature_discovery/utils.py",
    "content": "import hashlib\nimport json\nfrom functools import lru_cache\nfrom typing import Any, Dict, Tuple\n\nfrom woodwork.column_schema import ColumnSchema\n\nfrom featuretools.feature_discovery.type_defs import ANY\nfrom featuretools.primitives.base.primitive_base import PrimitiveBase\nfrom featuretools.primitives.utils import (\n    get_all_logical_type_names,\n    get_all_primitives,\n    serialize_primitive,\n)\n\nprimitives_map = get_all_primitives()\nlogical_types_map = get_all_logical_type_names()\n\n\ndef column_schema_to_keys(column_schema: ColumnSchema) -> str:\n    \"\"\"\n    Generate a hashing key from a Columns Schema. For example:\n    - ColumnSchema(logical_type=Double) -> \"Double\"\n    - ColumnSchema(semantic_tags={\"index\"}) -> \"index\"\n    - ColumnSchema(logical_type=Double, semantic_tags={\"index\", \"other\"}) -> \"Double,index,other\"\n\n    Args:\n        column_schema (ColumnSchema):\n\n    Returns:\n        str: hashing key\n    \"\"\"\n    logical_type = column_schema.logical_type\n    tags = column_schema.semantic_tags\n    lt_key = None\n    if logical_type:\n        lt_key = type(logical_type).__name__\n\n    tags = sorted(tags)\n    if len(tags) > 0:\n        tag_key = \",\".join(tags)\n        return f\"{lt_key},{tag_key}\" if lt_key is not None else tag_key\n\n    elif lt_key is not None:\n        return lt_key\n    else:\n        return ANY\n\n\n@lru_cache(maxsize=None)\ndef hash_primitive(primitive: PrimitiveBase) -> Tuple[str, Dict[str, Any]]:\n    hash_msg = hashlib.sha256()\n    primitive_name = primitive.name\n    assert isinstance(primitive_name, str)\n    primitive_dict = serialize_primitive(primitive)\n    primitive_json = json.dumps(primitive_dict).encode(\"utf-8\")\n    hash_msg.update(primitive_json)\n    key = hash_msg.hexdigest()\n    return (key, primitive_dict)\n\n\ndef get_primitive_return_type(primitive: PrimitiveBase) -> ColumnSchema:\n    \"\"\"\n    Get Return type from a primitive\n\n    Args:\n        primitive (PrimitiveBase)\n\n    Returns:\n        ColumnSchema\n    \"\"\"\n    if primitive.return_type:\n        return primitive.return_type\n    return_type = primitive.input_types[0]\n    if isinstance(return_type, list):\n        return_type = return_type[0]\n    return return_type\n\n\ndef flatten_list(nested_list):\n    return [item for sublist in nested_list for item in sublist]\n"
  },
  {
    "path": "featuretools/primitives/__init__.py",
    "content": "# flake8: noqa\nimport inspect\nimport logging\nimport traceback\n\nimport pkg_resources\n\nfrom featuretools.primitives.standard import *\nfrom featuretools.primitives.utils import (\n    get_aggregation_primitives,\n    get_default_aggregation_primitives,\n    get_default_transform_primitives,\n    get_transform_primitives,\n    list_primitives,\n    summarize_primitives,\n)\n\n\ndef _load_primitives():\n    \"\"\"Load in a list of primitives registered by other libraries into Featuretools.\n\n    Example entry_points definition for a library using this entry point either in:\n\n        - setup.py:\n\n            setup(\n                entry_points={\n                    'featuretools_primitives': [\n                        'other_library = other_library',\n                    ],\n                },\n            )\n\n        - setup.cfg:\n\n            [options.entry_points]\n            featuretools_primitives =\n                other_library = other_library\n\n        - pyproject.toml:\n\n            [project.entry-points.\"featuretools_primitives\"]\n            other_library = \"other_library\"\n\n    where `other_library` is a top-level module containing all the primitives.\n    \"\"\"\n    logger = logging.getLogger(\"featuretools\")\n    base_primitives = AggregationPrimitive, TransformPrimitive  # noqa: F405\n\n    for entry_point in pkg_resources.iter_entry_points(\"featuretools_primitives\"):\n        try:\n            loaded = entry_point.load()\n        except Exception:\n            message = f'Featuretools failed to load \"{entry_point.name}\" primitives from \"{entry_point.module_name}\". '\n            message += \"For a full stack trace, set logging to debug.\"\n            logger.warning(message)\n            logger.debug(traceback.format_exc())\n            continue\n\n        for key in dir(loaded):\n            primitive = getattr(loaded, key, None)\n\n            if (\n                inspect.isclass(primitive)\n                and issubclass(primitive, base_primitives)\n                and primitive not in base_primitives\n            ):\n                name = primitive.__name__\n                scope = globals()\n\n                if name in scope:\n                    this_module, that_module = (\n                        primitive.__module__,\n                        scope[name].__module__,\n                    )\n                    message = f'While loading primitives via \"{entry_point.name}\" entry point, '\n                    message += (\n                        f'ignored primitive \"{name}\" from \"{this_module}\" because '\n                    )\n                    message += (\n                        f'a primitive with that name already exists in \"{that_module}\"'\n                    )\n                    logger.warning(message)\n                else:\n                    scope[name] = primitive\n\n\n_load_primitives()\n"
  },
  {
    "path": "featuretools/primitives/base/__init__.py",
    "content": "from featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\nfrom featuretools.primitives.base.primitive_base import PrimitiveBase\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n"
  },
  {
    "path": "featuretools/primitives/base/aggregation_primitive_base.py",
    "content": "from featuretools.primitives.base.primitive_base import PrimitiveBase\n\n\nclass AggregationPrimitive(PrimitiveBase):\n    def generate_name(\n        self,\n        base_feature_names,\n        relationship_path_name,\n        parent_dataframe_name,\n        where_str,\n        use_prev_str,\n    ):\n        base_features_str = \", \".join(base_feature_names)\n        return \"%s(%s.%s%s%s%s)\" % (\n            self.name.upper(),\n            relationship_path_name,\n            base_features_str,\n            where_str,\n            use_prev_str,\n            self.get_args_string(),\n        )\n\n    def generate_names(\n        self,\n        base_feature_names,\n        relationship_path_name,\n        parent_dataframe_name,\n        where_str,\n        use_prev_str,\n    ):\n        n = self.number_output_features\n        base_name = self.generate_name(\n            base_feature_names,\n            relationship_path_name,\n            parent_dataframe_name,\n            where_str,\n            use_prev_str,\n        )\n        return [base_name + \"[%s]\" % i for i in range(n)]\n"
  },
  {
    "path": "featuretools/primitives/base/primitive_base.py",
    "content": "import os\nfrom inspect import signature\n\nimport numpy as np\nimport pandas as pd\n\nfrom featuretools import config\nfrom featuretools.utils.description_utils import convert_to_nth\n\n\nclass PrimitiveBase(object):\n    \"\"\"Base class for all primitives.\"\"\"\n\n    #: (str): Name of the primitive\n    name = None\n    #: (list): woodwork.ColumnSchema types of inputs\n    input_types = None\n    #: (woodwork.ColumnSchema): ColumnSchema type of return\n    return_type = None\n    #: Default value this feature returns if no data found. Defaults to np.nan\n    default_value = np.nan\n    #: (bool): True if feature needs to know what the current calculation time\n    # is (provided to computational backend as \"time_last\")\n    uses_calc_time = False\n    #: (int): Maximum number of features in the largest chain proceeding\n    # downward from this feature's base features.\n    max_stack_depth = None\n    #: (int): Number of columns in feature matrix associated with this feature\n    number_output_features = 1\n    # whitelist of primitives can have this primitive in input_types\n    base_of = None\n    # blacklist of primitives can have this primitive in input_types\n    base_of_exclude = None\n    # whitelist of primitives that can be in input_types\n    stack_on = None\n    # blacklist of primitives that can be in signature\n    stack_on_exclude = None\n    # determines if primitive can be in input_types for self\n    stack_on_self = True\n    # (bool) If True will only make one feature per unique set of base features\n    commutative = False\n    #: (str, list[str]): description template of the primitive. Input column\n    # descriptions are passed as positional arguments to the template. Slice\n    # number (if present) in \"nth\" form is passed to the template via the\n    # `nth_slice` keyword argument. Multi-output primitives can use a list to\n    # differentiate between the base description and a slice description.\n    description_template = None\n\n    def __init__(self):\n        pass\n\n    def __call__(self, *args, **kwargs):\n        series_args = [pd.Series(arg) for arg in args]\n        try:\n            return self._method(*series_args, **kwargs)\n        except AttributeError:\n            self._method = self.get_function()\n            return self._method(*series_args, **kwargs)\n\n    def __lt__(self, other):\n        return (self.name + self.get_args_string()) < (\n            other.name + other.get_args_string()\n        )\n\n    def generate_name(self):\n        raise NotImplementedError(\"Subclass must implement\")\n\n    def generate_names(self):\n        raise NotImplementedError(\"Subclass must implement\")\n\n    def get_function(self):\n        raise NotImplementedError(\"Subclass must implement\")\n\n    def get_filepath(self, filename):\n        return os.path.join(config.get(\"primitive_data_folder\"), filename)\n\n    def get_args_string(self):\n        strings = []\n        for name, value in self.get_arguments():\n            # format arg to string\n            string = \"{}={}\".format(name, str(value))\n            strings.append(string)\n\n        if len(strings) == 0:\n            return \"\"\n\n        string = \", \".join(strings)\n        string = \", \" + string\n        return string\n\n    def get_arguments(self):\n        values = []\n\n        args = signature(self.__class__).parameters.items()\n        for name, arg in args:\n            # assert that arg is attribute of primitive\n            error = '\"{}\" must be attribute of {}'\n            assert hasattr(self, name), error.format(name, self.__class__.__name__)\n\n            value = getattr(self, name)\n            # check if args are the same type\n            if isinstance(value, type(arg.default)):\n                # skip if default value\n                if arg.default == value:\n                    continue\n\n            values.append((name, value))\n\n        return values\n\n    def get_description(\n        self,\n        input_column_descriptions,\n        slice_num=None,\n        template_override=None,\n    ):\n        template = template_override or self.description_template\n        if template:\n            if isinstance(template, list):\n                if slice_num is not None:\n                    slice_index = slice_num + 1\n                    if slice_index < len(template):\n                        return template[slice_index].format(\n                            *input_column_descriptions,\n                            nth_slice=convert_to_nth(slice_index),\n                        )\n                    else:\n                        if len(template) > 2:\n                            raise IndexError(\"Slice out of range of template\")\n                        return template[1].format(\n                            *input_column_descriptions,\n                            nth_slice=convert_to_nth(slice_index),\n                        )\n                else:\n                    template = template[0]\n            return template.format(*input_column_descriptions)\n\n        # generic case:\n        name = self.name.upper() if self.name is not None else type(self).__name__\n        if slice_num is not None:\n            nth_slice = convert_to_nth(slice_num + 1)\n            description = \"the {} output from applying {} to {}\".format(\n                nth_slice,\n                name,\n                \", \".join(input_column_descriptions),\n            )\n        else:\n            description = \"the result of applying {} to {}\".format(\n                name,\n                \", \".join(input_column_descriptions),\n            )\n        return description\n\n    @staticmethod\n    def flatten_nested_input_types(input_types):\n        \"\"\"Flattens nested column schema inputs into a single list.\"\"\"\n        if isinstance(input_types[0], list):\n            input_types = [\n                sub_input for input_obj in input_types for sub_input in input_obj\n            ]\n        return input_types\n"
  },
  {
    "path": "featuretools/primitives/base/transform_primitive_base.py",
    "content": "from featuretools.primitives.base.primitive_base import PrimitiveBase\n\n\nclass TransformPrimitive(PrimitiveBase):\n    \"\"\"Feature for dataframe that is a based off one or more other features\n    in that dataframe.\"\"\"\n\n    # (bool) If True, feature function depends on all values of dataframe\n    #   (and will receive these values as input, regardless of specified instance ids)\n    uses_full_dataframe = False\n\n    def generate_name(self, base_feature_names):\n        return \"%s(%s%s)\" % (\n            self.name.upper(),\n            \", \".join(base_feature_names),\n            self.get_args_string(),\n        )\n\n    def generate_names(self, base_feature_names):\n        n = self.number_output_features\n        base_name = self.generate_name(base_feature_names)\n        return [base_name + \"[%s]\" % i for i in range(n)]\n"
  },
  {
    "path": "featuretools/primitives/options_utils.py",
    "content": "import logging\nimport warnings\nfrom itertools import permutations\n\nfrom featuretools import primitives\nfrom featuretools.feature_base import IdentityFeature\n\nlogger = logging.getLogger(\"featuretools\")\n\n\ndef _get_primitive_options():\n    # all possible option keys: function that verifies value type\n    return {\n        \"ignore_dataframes\": list_dataframe_check,\n        \"include_dataframes\": list_dataframe_check,\n        \"ignore_columns\": dict_to_list_column_check,\n        \"include_columns\": dict_to_list_column_check,\n        \"ignore_groupby_dataframes\": list_dataframe_check,\n        \"include_groupby_dataframes\": list_dataframe_check,\n        \"ignore_groupby_columns\": dict_to_list_column_check,\n        \"include_groupby_columns\": dict_to_list_column_check,\n    }\n\n\ndef dict_to_list_column_check(option, es):\n    if not (\n        isinstance(option, dict)\n        and all([isinstance(option_val, list) for option_val in option.values()])\n    ):\n        return False\n    else:\n        for dataframe, columns in option.items():\n            if dataframe not in es:\n                warnings.warn(\"Dataframe '%s' not in entityset\" % (dataframe))\n            else:\n                for invalid_col in [\n                    column for column in columns if column not in es[dataframe]\n                ]:\n                    warnings.warn(\n                        \"Column '%s' not in dataframe '%s'\" % (invalid_col, dataframe),\n                    )\n        return True\n\n\ndef list_dataframe_check(option, es):\n    if not isinstance(option, list):\n        return False\n    else:\n        for invalid_dataframe in [\n            dataframe for dataframe in option if dataframe not in es\n        ]:\n            warnings.warn(\"Dataframe '%s' not in entityset\" % (invalid_dataframe))\n        return True\n\n\ndef generate_all_primitive_options(\n    all_primitives,\n    primitive_options,\n    ignore_dataframes,\n    ignore_columns,\n    es,\n):\n    dataframe_dict = {\n        dataframe.ww.name: [col for col in dataframe.columns]\n        for dataframe in es.dataframes\n    }\n\n    primitive_options = _init_primitive_options(primitive_options, dataframe_dict)\n    global_ignore_dataframes = ignore_dataframes\n    global_ignore_columns = ignore_columns.copy()\n    # for now, only use primitive names as option keys\n    for primitive in all_primitives:\n        if primitive in primitive_options and primitive.name in primitive_options:\n            msg = (\n                \"Options present for primitive instance and generic \"\n                \"primitive class (%s), primitive instance will not use generic \"\n                \"options\" % (primitive.name)\n            )\n            warnings.warn(msg)\n        if primitive in primitive_options or primitive.name in primitive_options:\n            options = primitive_options.get(\n                primitive,\n                primitive_options.get(primitive.name),\n            )\n            # Reconcile global options with individually-specified options\n            included_dataframes = set().union(\n                *[\n                    option.get(\"include_dataframes\", set()).union(\n                        option.get(\"include_columns\", {}).keys(),\n                    )\n                    for option in options\n                ]\n            )\n            global_ignore_dataframes = global_ignore_dataframes.difference(\n                included_dataframes,\n            )\n            for option in options:\n                # don't globally ignore a column if it's included for a primitive\n                if \"include_columns\" in option:\n                    for dataframe, include_cols in option[\"include_columns\"].items():\n                        global_ignore_columns[dataframe] = global_ignore_columns[\n                            dataframe\n                        ].difference(include_cols)\n                option[\"ignore_dataframes\"] = option[\"ignore_dataframes\"].union(\n                    ignore_dataframes.difference(included_dataframes),\n                )\n            for dataframe, ignore_cols in ignore_columns.items():\n                # if already ignoring columns for this dataframe, add globals\n                for option in options:\n                    if dataframe in option[\"ignore_columns\"]:\n                        option[\"ignore_columns\"][dataframe] = option[\"ignore_columns\"][\n                            dataframe\n                        ].union(ignore_cols)\n                    # if no ignore_columns and dataframe is explicitly included, don't ignore the column\n                    elif dataframe in included_dataframes:\n                        continue\n                    # Otherwise, keep the global option\n                    else:\n                        option[\"ignore_columns\"][dataframe] = ignore_cols\n        else:\n            # no user specified options, just use global defaults\n            primitive_options[primitive] = [\n                {\n                    \"ignore_dataframes\": ignore_dataframes,\n                    \"ignore_columns\": ignore_columns,\n                },\n            ]\n    return primitive_options, global_ignore_dataframes, global_ignore_columns\n\n\ndef _init_primitive_options(primitive_options, es):\n    # Flatten all tuple keys, convert value lists into sets, check for\n    # conflicting keys\n    flattened_options = {}\n    for primitive_keys, options in primitive_options.items():\n        if not isinstance(primitive_keys, tuple):\n            primitive_keys = (primitive_keys,)\n        if isinstance(options, list):\n            for primitive_key in primitive_keys:\n                if isinstance(primitive_key, str):\n                    primitive = primitives.get_aggregation_primitives().get(\n                        primitive_key,\n                    ) or primitives.get_transform_primitives().get(primitive_key)\n                    if not primitive:\n                        msg = \"Unknown primitive with name '{}'\".format(primitive_key)\n                        raise ValueError(msg)\n                else:\n                    primitive = primitive_key\n                assert (\n                    len(primitive.input_types[0]) == len(options)\n                    if isinstance(primitive.input_types[0], list)\n                    else len(primitive.input_types) == len(options)\n                ), (\n                    \"Number of options does not match number of inputs for primitive %s\"\n                    % (primitive_key)\n                )\n            options = [\n                _init_option_dict(primitive_keys, option, es) for option in options\n            ]\n        else:\n            options = [_init_option_dict(primitive_keys, options, es)]\n\n        for primitive in primitive_keys:\n            if isinstance(primitive, type):\n                primitive = primitive.name\n\n            # if primitive is specified more than once, raise error\n            if primitive in flattened_options:\n                raise KeyError(\"Multiple options found for primitive %s\" % (primitive))\n\n            flattened_options[primitive] = options\n    return flattened_options\n\n\ndef _init_option_dict(key, option_dict, es):\n    initialized_option_dict = {}\n    primitive_options = _get_primitive_options()\n    # verify all keys are valid and match expected type, convert lists to sets\n    for option_key, option in option_dict.items():\n        if option_key not in primitive_options:\n            raise KeyError(\n                \"Unrecognized primitive option '%s' for %s\"\n                % (option_key, \",\".join(key)),\n            )\n        if not primitive_options[option_key](option, es):\n            raise TypeError(\n                \"Incorrect type formatting for '%s' for %s\"\n                % (option_key, \",\".join(key)),\n            )\n        if isinstance(option, list):\n            initialized_option_dict[option_key] = set(option)\n        elif isinstance(option, dict):\n            initialized_option_dict[option_key] = {\n                key: set(option[key]) for key in option\n            }\n    # initialize ignore_dataframes and ignore_columns to empty sets if not present\n    if \"ignore_columns\" not in initialized_option_dict:\n        initialized_option_dict[\"ignore_columns\"] = dict()\n    if \"ignore_dataframes\" not in initialized_option_dict:\n        initialized_option_dict[\"ignore_dataframes\"] = set()\n    return initialized_option_dict\n\n\ndef column_filter(f, options, groupby=False):\n    if groupby and not f.column_schema.semantic_tags.intersection(\n        {\"category\", \"foreign_key\"},\n    ):\n        return False\n    include_cols = \"include_groupby_columns\" if groupby else \"include_columns\"\n    ignore_cols = \"ignore_groupby_columns\" if groupby else \"ignore_columns\"\n    include_dataframes = (\n        \"include_groupby_dataframes\" if groupby else \"include_dataframes\"\n    )\n    ignore_dataframes = \"ignore_groupby_dataframes\" if groupby else \"ignore_dataframes\"\n\n    dependencies = f.get_dependencies(deep=True) + [f]\n    for base_f in dependencies:\n        if isinstance(base_f, IdentityFeature):\n            if (\n                include_cols in options\n                and base_f.dataframe_name in options[include_cols]\n            ):\n                if base_f.get_name() in options[include_cols][base_f.dataframe_name]:\n                    continue  # this is a valid feature, go to next\n                else:\n                    return False  # this is not an included feature\n            if ignore_cols in options and base_f.dataframe_name in options[ignore_cols]:\n                if base_f.get_name() in options[ignore_cols][base_f.dataframe_name]:\n                    return False  # ignore this feature\n        if include_dataframes in options:\n            return base_f.dataframe_name in options[include_dataframes]\n        elif (\n            ignore_dataframes in options\n            and base_f.dataframe_name in options[ignore_dataframes]\n        ):\n            return False  # ignore the dataframe\n    return True\n\n\ndef ignore_dataframe_for_primitive(options, dataframe, groupby=False):\n    # This logic handles whether given options ignore an dataframe or not\n    def should_ignore_dataframe(option):\n        if groupby:\n            if (\n                \"include_groupby_columns\" not in option\n                or dataframe.ww.name not in option[\"include_groupby_columns\"]\n            ):\n                if (\n                    \"include_groupby_dataframes\" in option\n                    and dataframe.ww.name not in option[\"include_groupby_dataframes\"]\n                ):\n                    return True\n                elif (\n                    \"ignore_groupby_dataframes\" in option\n                    and dataframe.ww.name in option[\"ignore_groupby_dataframes\"]\n                ):\n                    return True\n        if (\n            \"include_columns\" in option\n            and dataframe.ww.name in option[\"include_columns\"]\n        ):\n            return False\n        elif \"include_dataframes\" in option:\n            return dataframe.ww.name not in option[\"include_dataframes\"]\n        elif dataframe.ww.name in option[\"ignore_dataframes\"]:\n            return True\n        else:\n            return False\n\n    return any([should_ignore_dataframe(option) for option in options])\n\n\ndef filter_groupby_matches_by_options(groupby_matches, options):\n    return filter_matches_by_options(\n        [(groupby_match,) for groupby_match in groupby_matches],\n        options,\n        groupby=True,\n    )\n\n\ndef filter_matches_by_options(matches, options, groupby=False, commutative=False):\n    # If more than one option, than need to handle each for each input\n    if len(options) > 1:\n\n        def is_valid_match(match):\n            if all(\n                [\n                    column_filter(m, option, groupby)\n                    for m, option in zip(match, options)\n                ],\n            ):\n                return True\n            else:\n                return False\n\n    else:\n\n        def is_valid_match(match):\n            if all([column_filter(f, options[0], groupby) for f in match]):\n                return True\n            else:\n                return False\n\n    valid_matches = set()\n    for match in matches:\n        if is_valid_match(match):\n            valid_matches.add(match)\n        elif commutative:\n            for order in permutations(match):\n                if is_valid_match(order):\n                    valid_matches.add(order)\n                    break\n\n    return sorted(\n        valid_matches,\n        key=lambda features: ([feature.unique_name() for feature in features]),\n    )\n"
  },
  {
    "path": "featuretools/primitives/standard/__init__.py",
    "content": "# flake8: noqa\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\nfrom featuretools.primitives.standard.aggregation import *\nfrom featuretools.primitives.standard.transform import *\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/__init__.py",
    "content": "from featuretools.primitives.standard.aggregation.all_primitive import All\nfrom featuretools.primitives.standard.aggregation.any_primitive import Any\nfrom featuretools.primitives.standard.aggregation.avg_time_between import AvgTimeBetween\nfrom featuretools.primitives.standard.aggregation.average_count_per_unique import (\n    AverageCountPerUnique,\n)\nfrom featuretools.primitives.standard.aggregation.count import Count\nfrom featuretools.primitives.standard.aggregation.count_above_mean import CountAboveMean\nfrom featuretools.primitives.standard.aggregation.count_below_mean import CountBelowMean\nfrom featuretools.primitives.standard.aggregation.count_greater_than import (\n    CountGreaterThan,\n)\nfrom featuretools.primitives.standard.aggregation.count_inside_nth_std import (\n    CountInsideNthSTD,\n)\nfrom featuretools.primitives.standard.aggregation.count_inside_range import (\n    CountInsideRange,\n)\nfrom featuretools.primitives.standard.aggregation.count_less_than import CountLessThan\nfrom featuretools.primitives.standard.aggregation.count_outside_nth_std import (\n    CountOutsideNthSTD,\n)\nfrom featuretools.primitives.standard.aggregation.count_outside_range import (\n    CountOutsideRange,\n)\nfrom featuretools.primitives.standard.aggregation.date_first_event import DateFirstEvent\nfrom featuretools.primitives.standard.aggregation.entropy import Entropy\nfrom featuretools.primitives.standard.aggregation.first import First\nfrom featuretools.primitives.standard.aggregation.first_last_time_delta import (\n    FirstLastTimeDelta,\n)\nfrom featuretools.primitives.standard.aggregation.kurtosis import Kurtosis\nfrom featuretools.primitives.standard.aggregation.is_unique import IsUnique\nfrom featuretools.primitives.standard.aggregation.last import Last\nfrom featuretools.primitives.standard.aggregation.max_primitive import Max\nfrom featuretools.primitives.standard.aggregation.max_consecutive_false import (\n    MaxConsecutiveFalse,\n)\nfrom featuretools.primitives.standard.aggregation.max_consecutive_negatives import (\n    MaxConsecutiveNegatives,\n)\nfrom featuretools.primitives.standard.aggregation.max_consecutive_positives import (\n    MaxConsecutivePositives,\n)\nfrom featuretools.primitives.standard.aggregation.max_consecutive_true import (\n    MaxConsecutiveTrue,\n)\nfrom featuretools.primitives.standard.aggregation.max_consecutive_zeros import (\n    MaxConsecutiveZeros,\n)\nfrom featuretools.primitives.standard.aggregation.mean import Mean\nfrom featuretools.primitives.standard.aggregation.median import Median\nfrom featuretools.primitives.standard.aggregation.max_count import MaxCount\nfrom featuretools.primitives.standard.aggregation.median_count import MedianCount\nfrom featuretools.primitives.standard.aggregation.max_min_delta import MaxMinDelta\nfrom featuretools.primitives.standard.aggregation.min_count import MinCount\nfrom featuretools.primitives.standard.aggregation.min_primitive import Min\nfrom featuretools.primitives.standard.aggregation.mode import Mode\nfrom featuretools.primitives.standard.aggregation.n_unique_days import NUniqueDays\nfrom featuretools.primitives.standard.aggregation.n_unique_days_of_calendar_year import (\n    NUniqueDaysOfCalendarYear,\n)\nfrom featuretools.primitives.standard.aggregation.n_unique_days_of_month import (\n    NUniqueDaysOfMonth,\n)\nfrom featuretools.primitives.standard.aggregation.has_no_duplicates import (\n    HasNoDuplicates,\n)\nfrom featuretools.primitives.standard.aggregation.is_monotonically_decreasing import (\n    IsMonotonicallyDecreasing,\n)\nfrom featuretools.primitives.standard.aggregation.is_monotonically_increasing import (\n    IsMonotonicallyIncreasing,\n)\nfrom featuretools.primitives.standard.aggregation.n_unique_months import NUniqueMonths\nfrom featuretools.primitives.standard.aggregation.n_unique_weeks import NUniqueWeeks\nfrom featuretools.primitives.standard.aggregation.n_most_common import NMostCommon\nfrom featuretools.primitives.standard.aggregation.n_most_common_frequency import (\n    NMostCommonFrequency,\n)\nfrom featuretools.primitives.standard.aggregation.num_true import NumTrue\nfrom featuretools.primitives.standard.aggregation.num_peaks import NumPeaks\nfrom featuretools.primitives.standard.aggregation.num_zero_crossings import (\n    NumZeroCrossings,\n)\nfrom featuretools.primitives.standard.aggregation.num_true_since_last_false import (\n    NumTrueSinceLastFalse,\n)\nfrom featuretools.primitives.standard.aggregation.num_false_since_last_true import (\n    NumFalseSinceLastTrue,\n)\nfrom featuretools.primitives.standard.aggregation.num_consecutive_greater_mean import (\n    NumConsecutiveGreaterMean,\n)\nfrom featuretools.primitives.standard.aggregation.num_consecutive_less_mean import (\n    NumConsecutiveLessMean,\n)\nfrom featuretools.primitives.standard.aggregation.num_unique import NumUnique\nfrom featuretools.primitives.standard.aggregation.percent_unique import PercentUnique\nfrom featuretools.primitives.standard.aggregation.percent_true import PercentTrue\nfrom featuretools.primitives.standard.aggregation.skew import Skew\nfrom featuretools.primitives.standard.aggregation.std import Std\nfrom featuretools.primitives.standard.aggregation.sum_primitive import Sum\nfrom featuretools.primitives.standard.aggregation.time_since_first import TimeSinceFirst\nfrom featuretools.primitives.standard.aggregation.time_since_last import TimeSinceLast\nfrom featuretools.primitives.standard.aggregation.time_since_last_true import (\n    TimeSinceLastTrue,\n)\nfrom featuretools.primitives.standard.aggregation.time_since_last_min import (\n    TimeSinceLastMin,\n)\nfrom featuretools.primitives.standard.aggregation.time_since_last_max import (\n    TimeSinceLastMax,\n)\nfrom featuretools.primitives.standard.aggregation.time_since_last_false import (\n    TimeSinceLastFalse,\n)\nfrom featuretools.primitives.standard.aggregation.trend import Trend\nfrom featuretools.primitives.standard.aggregation.variance import Variance\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/all_primitive.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Boolean, BooleanNullable\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass All(AggregationPrimitive):\n    \"\"\"Calculates if all values are 'True' in a list.\n\n    Description:\n        Given a list of booleans, return `True` if all\n        of the values are `True`.\n\n    Examples:\n        >>> all = All()\n        >>> all([False, False, False, True])\n        False\n    \"\"\"\n\n    name = \"all\"\n    input_types = [\n        [ColumnSchema(logical_type=Boolean)],\n        [ColumnSchema(logical_type=BooleanNullable)],\n    ]\n    return_type = ColumnSchema(logical_type=Boolean)\n    stack_on_self = False\n    description_template = \"whether all of {} are true\"\n\n    def get_function(self):\n        return np.all\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/any_primitive.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Boolean, BooleanNullable\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass Any(AggregationPrimitive):\n    \"\"\"Determines if any value is 'True' in a list.\n\n    Description:\n        Given a list of booleans, return `True` if one or\n        more of the values are `True`.\n\n    Examples:\n        >>> any = Any()\n        >>> any([False, False, False, True])\n        True\n    \"\"\"\n\n    name = \"any\"\n    input_types = [\n        [ColumnSchema(logical_type=Boolean)],\n        [ColumnSchema(logical_type=BooleanNullable)],\n    ]\n    return_type = ColumnSchema(logical_type=Boolean)\n    stack_on_self = False\n    description_template = \"whether any of {} are true\"\n\n    def get_function(self):\n        return np.any\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/average_count_per_unique.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Double\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass AverageCountPerUnique(AggregationPrimitive):\n    \"\"\"Determines the average count across all unique value.\n\n    Args:\n        skipna (bool): Determines if to use NA/null values.\n            Defaults to True to skip NA/null.\n\n    Examples:\n        Determine the average count values for all unique items\n        in the input\n        >>> input = [1, 1, 2, 2, 3, 4, 5, 6, 7, 8]\n        >>> avg_count_per_unique = AverageCountPerUnique()\n        >>> avg_count_per_unique(input)\n        1.25\n\n        Determine the average count values for all unique items\n        in the input with nan values ignored\n        >>> input = [1, 1, 2, 2, 3, 4, 5, None, 6, 7, 8]\n        >>> avg_count_per_unique = AverageCountPerUnique()\n        >>> avg_count_per_unique(input)\n        1.25\n\n        Determine the average count values for all unique items\n        in the input with nan values included\n        >>> input = [1, 2, 2, 3, 4, 5, None, 6, 7, 8, 9]\n        >>> avg_count_per_unique_skipna_false = AverageCountPerUnique(skipna=False)\n        >>> avg_count_per_unique_skipna_false(input)\n        1.1\n    \"\"\"\n\n    name = \"average_count_per_unique\"\n    input_types = [ColumnSchema(semantic_tags={\"category\"})]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    default_value = 0\n\n    def __init__(self, skipna=True):\n        self.skipna = skipna\n\n    def get_function(self):\n        def average_count_per_unique(x):\n            return x.value_counts(\n                dropna=self.skipna,\n            ).mean(skipna=self.skipna)\n\n        return average_count_per_unique\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/avg_time_between.py",
    "content": "from datetime import datetime\n\nimport numpy as np\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Double\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\nfrom featuretools.utils import convert_time_units\n\n\nclass AvgTimeBetween(AggregationPrimitive):\n    \"\"\"Computes the average number of seconds between consecutive events.\n\n    Description:\n        Given a list of datetimes, return the average time (default in seconds)\n        elapsed between consecutive events. If there are fewer\n        than 2 non-null values, return `NaN`.\n\n    Args:\n        unit (str): Defines the unit of time.\n            Defaults to seconds. Acceptable values:\n            years, months, days, hours, minutes, seconds, milliseconds, nanoseconds\n\n    Examples:\n        >>> from datetime import datetime\n        >>> avg_time_between = AvgTimeBetween()\n        >>> times = [datetime(2010, 1, 1, 11, 45, 0),\n        ...          datetime(2010, 1, 1, 11, 55, 15),\n        ...          datetime(2010, 1, 1, 11, 57, 30)]\n        >>> avg_time_between(times)\n        375.0\n        >>> avg_time_between = AvgTimeBetween(unit=\"minutes\")\n        >>> avg_time_between(times)\n        6.25\n    \"\"\"\n\n    name = \"avg_time_between\"\n    input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"})]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    description_template = \"the average time between each of {}\"\n\n    def __init__(self, unit=\"seconds\"):\n        self.unit = unit.lower()\n\n    def get_function(self):\n        def pd_avg_time_between(x):\n            \"\"\"Assumes time scales are closer to order\n            of seconds than to nanoseconds\n            if times are much closer to nanoseconds\n            we could get some floating point errors\n\n            this can be fixed with another function\n            that calculates the mean before converting\n            to seconds\n            \"\"\"\n            x = x.dropna()\n            if x.shape[0] < 2:\n                return np.nan\n            if isinstance(x.iloc[0], (pd.Timestamp, datetime)):\n                x = x.view(\"int64\")\n                # use len(x)-1 because we care about difference\n                # between values, len(x)-1 = len(diff(x))\n\n            avg = (x.max() - x.min()) / (len(x) - 1)\n            avg = avg * 1e-9\n\n            # long form:\n            # diff_in_ns = x.diff().iloc[1:].astype('int64')\n            # diff_in_seconds = diff_in_ns * 1e-9\n            # avg = diff_in_seconds.mean()\n            return convert_time_units(avg, self.unit)\n\n        return pd_avg_time_between\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/count.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass Count(AggregationPrimitive):\n    \"\"\"Determines the total number of values, excluding `NaN`.\n\n    Examples:\n        >>> count = Count()\n        >>> count([1, 2, 3, 4, 5, None])\n        5\n    \"\"\"\n\n    name = \"count\"\n    input_types = [ColumnSchema(semantic_tags={\"index\"})]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n    description_template = \"the number\"\n\n    def get_function(self):\n        return pd.Series.count\n\n    def generate_name(\n        self,\n        base_feature_names,\n        relationship_path_name,\n        parent_dataframe_name,\n        where_str,\n        use_prev_str,\n    ):\n        return \"COUNT(%s%s%s)\" % (relationship_path_name, where_str, use_prev_str)\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/count_above_mean.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass CountAboveMean(AggregationPrimitive):\n    \"\"\"Calculates the number of values that are above the mean.\n\n    Args:\n        skipna (bool): Determines if to use NA/null values. Defaults to\n            True to skip NA/null.\n\n    Examples:\n        >>> count_above_mean = CountAboveMean()\n        >>> count_above_mean([1, 2, 3, 4, 5])\n        2\n\n        The way NaNs are treated can be controlled.\n\n        >>> count_above_mean_skipna = CountAboveMean(skipna=False)\n        >>> count_above_mean_skipna([1, 2, 3, 4, 5, None])\n        nan\n    \"\"\"\n\n    name = \"count_above_mean\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n    stack_on_self = False\n\n    def __init__(self, skipna=True):\n        self.skipna = skipna\n\n    def get_function(self):\n        def count_above_mean(x):\n            mean = x.mean(skipna=self.skipna)\n            if np.isnan(mean):\n                return np.nan\n            return len(x[x > mean])\n\n        return count_above_mean\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/count_below_mean.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass CountBelowMean(AggregationPrimitive):\n    \"\"\"Determines the number of values that are below the mean.\n\n    Args:\n        skipna (bool): Determines if to use NA/null values. Defaults to\n            True to skip NA/null.\n\n    Examples:\n        >>> count_below_mean = CountBelowMean()\n        >>> count_below_mean([1, 2, 3, 4, 10])\n        3\n\n        The way NaNs are treated can be controlled.\n\n        >>> count_below_mean_skipna = CountBelowMean(skipna=False)\n        >>> count_below_mean_skipna([1, 2, 3, 4, 5, None])\n        nan\n    \"\"\"\n\n    name = \"count_below_mean\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n    stack_on_self = False\n\n    def __init__(self, skipna=True):\n        self.skipna = skipna\n\n    def get_function(self):\n        def count_below_mean(x):\n            mean = x.mean(skipna=self.skipna)\n            if np.isnan(mean):\n                return np.nan\n            return len(x[x < mean])\n\n        return count_below_mean\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/count_greater_than.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Integer\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass CountGreaterThan(AggregationPrimitive):\n    \"\"\"Determines the number of values greater than a controllable threshold.\n\n    Args:\n        threshold (float): The threshold to use when counting the number\n            of values greater than. Defaults to 10.\n\n    Examples:\n        >>> count_greater_than = CountGreaterThan(threshold=3)\n        >>> count_greater_than([1, 2, 3, 4, 5])\n        2\n    \"\"\"\n\n    name = \"count_greater_than\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=Integer, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n\n    def __init__(self, threshold=10):\n        self.threshold = threshold\n\n    def get_function(self):\n        def count_greater_than(x):\n            return x[x > self.threshold].count()\n\n        return count_greater_than\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/count_inside_nth_std.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Integer\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass CountInsideNthSTD(AggregationPrimitive):\n    \"\"\"Determines the count of observations that lie inside\n        the first N standard deviations (inclusive).\n\n    Args:\n        n (float): Number of standard deviations. Default is 1\n\n    Examples:\n        >>> count_inside_nth_std = CountInsideNthSTD(n=1.5)\n        >>> count_inside_nth_std([1, 10, 15, 20, 100])\n        4\n    \"\"\"\n\n    name = \"count_inside_nth_std\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=Integer, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n\n    def __init__(self, n=1):\n        if n < 0:\n            raise ValueError(\"n must be a positive number\")\n\n        self.n = n\n\n    def get_function(self):\n        def count_inside_nth_std(x):\n            cond = np.abs(x - np.mean(x)) <= np.std(x) * self.n\n            return cond.sum()\n\n        return count_inside_nth_std\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/count_inside_range.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass CountInsideRange(AggregationPrimitive):\n    \"\"\"Determines the number of values that fall within a certain range.\n\n    Args:\n        lower (float): Lower boundary of range (inclusive). Default is 0.\n        upper (float): Upper boundary of range (inclusive). Default is 1.\n        skipna (bool): If this is False any value in x is NaN then\n            the result will be NaN. If True, `nan` values are skipped.\n            Default is True.\n\n    Examples:\n        >>> count_inside_range = CountInsideRange(lower=1.5, upper=3.6)\n        >>> count_inside_range([1, 2, 3, 4, 5])\n        2\n\n        The way NaNs are treated can be controlled.\n\n        >>> count_inside_range_skipna = CountInsideRange(skipna=False)\n        >>> count_inside_range_skipna([1, 2, 3, 4, 5, None])\n        nan\n    \"\"\"\n\n    name = \"count_inside_range\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n\n    def __init__(self, lower=0, upper=1, skipna=True):\n        self.lower = lower\n        self.upper = upper\n        self.skipna = skipna\n\n    def get_function(self):\n        def count_inside_range(x):\n            if not self.skipna and x.isnull().values.any():\n                return np.nan\n            cond = (self.lower <= x) & (x <= self.upper)\n            return cond.sum()\n\n        return count_inside_range\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/count_less_than.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Integer\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass CountLessThan(AggregationPrimitive):\n    \"\"\"Determines the number of values less than a controllable threshold.\n\n    Args:\n        threshold (float): The threshold to use when counting the number\n            of values less than. Defaults to 10.\n\n    Examples:\n        >>> count_less_than = CountLessThan(threshold=3.5)\n        >>> count_less_than([1, 2, 3, 4, 5])\n        3\n    \"\"\"\n\n    name = \"count_less_than\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=Integer, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n\n    def __init__(self, threshold=10):\n        self.threshold = threshold\n\n    def get_function(self):\n        def count_less_than(x):\n            return x[x < self.threshold].count()\n\n        return count_less_than\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/count_outside_nth_std.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Integer\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass CountOutsideNthSTD(AggregationPrimitive):\n    \"\"\"Determines the number of observations that lie outside\n        the first N standard deviations.\n\n    Args:\n        n (float): Number of standard deviations. Default is 1\n\n    Examples:\n        >>> count_outside_nth_std = CountOutsideNthSTD(n=1.5)\n        >>> count_outside_nth_std([1, 10, 15, 20, 100])\n        1\n    \"\"\"\n\n    name = \"count_outside_nth_std\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=Integer, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n\n    def __init__(self, n=1):\n        if n < 0:\n            raise ValueError(\"n must be a positive number\")\n\n        self.n = n\n\n    def get_function(self):\n        def count_outside_nth_std(x):\n            cond = np.abs(x - np.mean(x)) > np.std(x) * self.n\n            return cond.sum()\n\n        return count_outside_nth_std\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/count_outside_range.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass CountOutsideRange(AggregationPrimitive):\n    \"\"\"Determines the number of values that fall outside a certain range.\n\n    Args:\n        lower (float): Lower boundary of range (exclusive). Default is 0.\n        upper (float): Upper boundary of range (exclusive). Default is 1.\n        skipna (bool): Determines if to use NA/null values. Defaults to\n            True to skip NA/null.\n\n    Examples:\n        >>> count_outside_range = CountOutsideRange(lower=1.5, upper=3.6)\n        >>> count_outside_range([1, 2, 3, 4, 5])\n        3\n\n        The way NaNs are treated can be controlled.\n\n        >>> count_outside_range_skipna = CountOutsideRange(skipna=False)\n        >>> count_outside_range_skipna([1, 2, 3, 4, 5, None])\n        nan\n    \"\"\"\n\n    name = \"count_outside_range\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n\n    def __init__(self, lower=0, upper=1, skipna=True):\n        self.lower = lower\n        self.upper = upper\n        self.skipna = skipna\n\n    def get_function(self):\n        def count_outside_range(x):\n            if not self.skipna and x.isnull().values.any():\n                return np.nan\n            cond = (x < self.lower) | (x > self.upper)\n            return cond.sum()\n\n        return count_outside_range\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/date_first_event.py",
    "content": "from pandas import NaT\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass DateFirstEvent(AggregationPrimitive):\n    \"\"\"Determines the first datetime from a list of datetimes.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> date_first_event = DateFirstEvent()\n        >>> date_first_event([\n        ...     datetime(2011, 4, 9, 10, 30, 10),\n        ...     datetime(2011, 4, 9, 10, 30, 20),\n        ...     datetime(2011, 4, 9, 10, 30, 30)])\n        Timestamp('2011-04-09 10:30:10')\n    \"\"\"\n\n    name = \"date_first_event\"\n    input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"})]\n    return_type = ColumnSchema(logical_type=Datetime)\n    stack_on_self = False\n    default_value = 0\n\n    def get_function(self):\n        def date_first_event(x):\n            x = x.dropna()\n            if x.empty:\n                return NaT\n            return x.iat[0]\n\n        return date_first_event\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/entropy.py",
    "content": "from scipy import stats\nfrom woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass Entropy(AggregationPrimitive):\n    \"\"\"Calculates the entropy for a categorical column\n\n    Description:\n        Given a list of observations from a categorical\n        column return the entropy of the distribution.\n        NaN values can be treated as a category or\n        dropped.\n\n    Args:\n        dropna (bool): Whether to consider NaN values as a separate category\n            Defaults to False.\n        base (float): The logarithmic base to use\n            Defaults to e (natural logarithm)\n\n    Examples:\n        >>> pd_entropy = Entropy()\n        >>> pd_entropy([1, 2, 3, 4])\n        1.3862943611198906\n    \"\"\"\n\n    name = \"entropy\"\n    input_types = [ColumnSchema(semantic_tags={\"category\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    stack_on_self = False\n    description_template = \"the entropy of {}\"\n\n    def __init__(self, dropna=False, base=None):\n        self.dropna = dropna\n        self.base = base\n\n    def get_function(self):\n        def pd_entropy(s):\n            distribution = s.value_counts(normalize=True, dropna=self.dropna)\n            if distribution.dtype == \"Float64\":\n                distribution = distribution.astype(\"float64\")\n            return stats.entropy(distribution.to_numpy(), base=self.base)\n\n        return pd_entropy\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/first.py",
    "content": "from woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass First(AggregationPrimitive):\n    \"\"\"Determines the first value in a list.\n\n    Examples:\n        >>> first = First()\n        >>> first([1, 2, 3, 4, 5, None])\n        1.0\n    \"\"\"\n\n    name = \"first\"\n    input_types = [ColumnSchema()]\n    return_type = None\n    stack_on_self = False\n    description_template = \"the first instance of {}\"\n\n    def get_function(self):\n        def pd_first(x):\n            return x.iloc[0]\n\n        return pd_first\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/first_last_time_delta.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Double\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass FirstLastTimeDelta(AggregationPrimitive):\n    \"\"\"Determines the time between the first and last time value\n        in seconds.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> first_last_time_delta = FirstLastTimeDelta()\n        >>> first_last_time_delta([\n        ...     datetime(2011, 4, 9, 10, 30, 0),\n        ...     datetime(2011, 4, 9, 10, 30, 15),\n        ...     datetime(2011, 4, 9, 10, 30, 35)])\n        35.0\n    \"\"\"\n\n    name = \"first_last_time_delta\"\n    input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"})]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    uses_calc_time = False\n    stack_on_self = False\n    default_value = 0\n\n    def get_function(self):\n        def first_last_time_delta(datetime_col):\n            datetime_col = datetime_col.dropna()\n            if datetime_col.empty:\n                return np.nan\n            delta = datetime_col.iloc[-1] - datetime_col.iloc[0]\n            return delta.total_seconds()\n\n        return first_last_time_delta\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/has_no_duplicates.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass HasNoDuplicates(AggregationPrimitive):\n    \"\"\"Determines if there are duplicates in the input.\n\n    Args:\n        skipna (bool): Determines if to use NA/null values.\n            Defaults to True to skip NA/null.\n\n    Examples:\n        >>> has_no_duplicates = HasNoDuplicates()\n        >>> has_no_duplicates([1, 1, 2])\n        False\n        >>> has_no_duplicates([1, 2, 3])\n        True\n\n        NaNs are skipped by default.\n\n        >>> has_no_duplicates([1, 2, 3, None, None])\n        True\n\n        However, the way NaNs are treated can be controlled.\n\n        >>> has_no_duplicates_skipna = HasNoDuplicates(skipna=False)\n        >>> has_no_duplicates_skipna([1, 2, 3, None, None])\n        False\n        >>> has_no_duplicates_skipna([1, 2, 3, None])\n        True\n    \"\"\"\n\n    name = \"has_no_duplicates\"\n    input_types = [\n        [ColumnSchema(semantic_tags={\"category\"})],\n        [ColumnSchema(semantic_tags={\"numeric\"})],\n    ]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n    stack_on_self = False\n    default_value = True\n\n    def __init__(self, skipna=True):\n        self.skipna = skipna\n\n    def get_function(self):\n        def has_no_duplicates(data):\n            if self.skipna:\n                data = data.dropna()\n            return not data.duplicated().any()\n\n        return has_no_duplicates\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/is_monotonically_decreasing.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass IsMonotonicallyDecreasing(AggregationPrimitive):\n    \"\"\"Determines if a series is monotonically decreasing.\n\n    Description:\n        Given a list of numeric values, return True if the\n        values are strictly decreasing. If the series contains\n        `NaN` values, they will be skipped.\n\n    Examples:\n        >>> is_monotonically_decreasing = IsMonotonicallyDecreasing()\n        >>> is_monotonically_decreasing([9, 5, 3, 1])\n        True\n    \"\"\"\n\n    name = \"is_monotonically_decreasing\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n    stack_on_self = False\n    default_value = False\n\n    def get_function(self):\n        def is_monotonically_decreasing(x):\n            return x.dropna().is_monotonic_decreasing\n\n        return is_monotonically_decreasing\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/is_monotonically_increasing.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass IsMonotonicallyIncreasing(AggregationPrimitive):\n    \"\"\"Determines if a series is monotonically increasing.\n\n    Description:\n        Given a list of numeric values, return True if the\n        values are strictly increasing. If the series contains\n        `NaN` values, they will be skipped.\n\n    Examples:\n        >>> is_monotonically_increasing = IsMonotonicallyIncreasing()\n        >>> is_monotonically_increasing([1, 3, 5, 9])\n        True\n    \"\"\"\n\n    name = \"is_monotonically_increasing\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n    stack_on_self = False\n    default_value = False\n\n    def get_function(self):\n        def is_monotonically_increasing(x):\n            return x.dropna().is_monotonic_increasing\n\n        return is_monotonically_increasing\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/is_unique.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass IsUnique(AggregationPrimitive):\n    \"\"\"Determines whether or not a series of discrete is all unique.\n\n    Description:\n        Given a series of discrete values, return True if each\n        value in the series is unique. If any value is repeated,\n        return False.\n\n    Examples:\n        >>> is_unique = IsUnique()\n        >>> is_unique(['red', 'blue', 'green', 'yellow'])\n        True\n\n        If the series is not unique, return False\n\n        >>> is_unique = IsUnique()\n        >>> is_unique(['red', 'blue', 'green', 'blue'])\n        False\n    \"\"\"\n\n    name = \"is_unique\"\n    input_types = [ColumnSchema(semantic_tags={\"category\"})]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n    stack_on_self = False\n    default_value = False\n\n    def get_function(self):\n        def is_unique(x):\n            return x.is_unique\n\n        return is_unique\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/kurtosis.py",
    "content": "from scipy.stats import kurtosis\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Double, Integer\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass Kurtosis(AggregationPrimitive):\n    \"\"\"Calculates the kurtosis for a list of numbers\n\n    Args:\n        fisher (bool): Optional. If True, Fisher's definition is used\n            (normal ==> 0.0). If False, Pearson's definition is used\n            (normal ==> 3.0). Default is True.\n        bias (bool): Optional. If False, then the calculations are\n            corrected for statistical bias. Default is True.\n        nan_policy (str): Optional. Defines how to handle when\n            input contains Nan. Possible values include\n            `['propagate', 'raise', 'omit']`. 'propagate'\n            returns Nan, 'raise' throws an error, 'omit'\n            performs the calculations ignoring Nan values.\n            Default is 'propagate'.\n\n    Examples:\n        >>> kurtosis = Kurtosis()\n        >>> kurtosis([1, 2, 3, 4, 5])\n        -1.3\n\n        You can use Pearson's definition by setting the 'fisher' argument to False\n\n        >>> kurtosis_fisher = Kurtosis(fisher=False)\n        >>> kurtosis_fisher([1, 2, 3, 4, 5])\n        1.7\n\n        You can correct for statistical bias by setting the 'bias' argument to False\n\n        >>> kurtosis_bias = Kurtosis(bias=False)\n        >>> kurtosis_bias([1, 2, 3, 4, 5])\n        -1.2000000000000004\n\n        You can specifiy how to handle NaN values in the input with the 'nan_policy'\n        argument\n\n        >>> kurtosis_nan_policy = Kurtosis(nan_policy='omit')\n        >>> kurtosis_nan_policy([1, 2, None, 3, 4, 5])\n        -1.3\n    \"\"\"\n\n    name = \"kurtosis\"\n    input_types = [\n        [ColumnSchema(logical_type=Integer, semantic_tags={\"numeric\"})],\n        [ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})],\n    ]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n\n    def __init__(self, fisher=True, bias=True, nan_policy=\"propagate\"):\n        if nan_policy not in [\"propagate\", \"raise\", \"omit\"]:\n            raise ValueError(\"Invalid nan_policy\")\n        self.fisher = fisher\n        self.bias = bias\n        self.nan_policy = nan_policy\n\n    def get_function(self):\n        def kurtosis_func(x):\n            return kurtosis(\n                x,\n                axis=0,\n                fisher=self.fisher,\n                bias=self.bias,\n                nan_policy=self.nan_policy,\n            )\n\n        return kurtosis_func\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/last.py",
    "content": "from woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass Last(AggregationPrimitive):\n    \"\"\"Determines the last value in a list.\n\n    Examples:\n        >>> last = Last()\n        >>> last([1, 2, 3, 4, 5, None])\n        nan\n    \"\"\"\n\n    name = \"last\"\n    input_types = [ColumnSchema()]\n    return_type = None\n    stack_on_self = False\n    description_template = \"the last instance of {}\"\n\n    def get_function(self):\n        def pd_last(x):\n            return x.iloc[-1]\n\n        return pd_last\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/max_consecutive_false.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Boolean, Integer\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass MaxConsecutiveFalse(AggregationPrimitive):\n    \"\"\"Determines the maximum number of consecutive False values in the input\n\n    Examples:\n        >>> max_consecutive_false = MaxConsecutiveFalse()\n        >>> max_consecutive_false([True, False, False, True, True, False])\n        2\n    \"\"\"\n\n    name = \"max_consecutive_false\"\n    input_types = [ColumnSchema(logical_type=Boolean)]\n    return_type = ColumnSchema(logical_type=Integer, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n\n    def get_function(self):\n        def max_consecutive_false(x):\n            # invert the input array to work properly with the computation\n            x[x.notnull()] = ~(x[x.notnull()].astype(bool))\n            # find the locations where the value changes from the previous value\n            not_equal = x != x.shift()\n            # Use cumulative sum to determine where consecutive values occur. When the\n            # sum changes, consecutive False values are present, when the cumulative\n            # sum remains unchnaged, consecutive True values are present.\n            not_equal_sum = not_equal.cumsum()\n            # group the input by the cumulative sum values and use cumulative count\n            # to count the number of consecutive values. Add 1 to account for the cumulative\n            # sum starting at zero where the first True occurs\n            consecutive = x.groupby(not_equal_sum).cumcount() + 1\n            # multiply by the inverted input to keep only the counts that correspond to\n            # false values\n            consecutive_false = consecutive * x\n            # return the max of all the consecutive false values\n            return consecutive_false.max()\n\n        return max_consecutive_false\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/max_consecutive_negatives.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Double, Integer\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass MaxConsecutiveNegatives(AggregationPrimitive):\n    \"\"\"Determines the maximum number of consecutive negative values in the input\n\n    Args:\n        skipna (bool): Ignore any `NaN` values in the input. Default is True.\n\n    Examples:\n        >>> max_consecutive_negatives = MaxConsecutiveNegatives()\n        >>> max_consecutive_negatives([1.0, -1.4, -2.4, -5.4, 2.9, -4.3])\n        3\n\n        `NaN` values can be ignored with the `skipna` parameter\n\n        >>> max_consecutive_negatives_skipna = MaxConsecutiveNegatives(skipna=False)\n        >>> max_consecutive_negatives_skipna([1.0, 1.4, -2.4, None, -2.9, -4.3])\n        2\n    \"\"\"\n\n    name = \"max_consecutive_negatives\"\n    input_types = [\n        [ColumnSchema(logical_type=Integer)],\n        [ColumnSchema(logical_type=Double)],\n    ]\n    return_type = ColumnSchema(logical_type=Integer, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n\n    def __init__(self, skipna=True):\n        self.skipna = skipna\n\n    def get_function(self):\n        def max_consecutive_negatives(x):\n            if self.skipna:\n                x = x.dropna()\n            # convert the numeric values to booleans for processing\n            x[x.notnull()] = x[x.notnull()].lt(0)\n            # find the locations where the value changes from the previous value\n            not_equal = x != x.shift()\n            # Use cumulative sum to determine where consecutive values occur. When the\n            # sum changes, consecutive non-negative values are present, when the cumulative\n            # sum remains unchnaged, consecutive negative values are present.\n            not_equal_sum = not_equal.cumsum()\n            # group the input by the cumulative sum values and use cumulative count\n            # to count the number of consecutive values. Add 1 to account for the cumulative\n            # sum starting at zero where the first negative occurs\n            consecutive = x.groupby(not_equal_sum).cumcount() + 1\n            # multiply by the inverted input to keep only the counts that correspond to\n            # negative values\n            consecutive_neg = consecutive * x\n            # return the max of all the consecutive negative values\n            return consecutive_neg.max()\n\n        return max_consecutive_negatives\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/max_consecutive_positives.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Double, Integer\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass MaxConsecutivePositives(AggregationPrimitive):\n    \"\"\"Determines the maximum number of consecutive positive values in the input\n\n    Args:\n        skipna (bool): Ignore any `NaN` values in the input. Default is True.\n\n    Examples:\n        >>> max_consecutive_positives = MaxConsecutivePositives()\n        >>> max_consecutive_positives([1.0, -1.4, 2.4, 5.4, 2.9, -4.3])\n        3\n\n        `NaN` values can be ignored with the `skipna` parameter\n\n        >>> max_consecutive_positives_skipna = MaxConsecutivePositives(skipna=False)\n        >>> max_consecutive_positives_skipna([1.0, -1.4, 2.4, None, 2.9, 4.3])\n        2\n    \"\"\"\n\n    name = \"max_consecutive_positives\"\n    input_types = [\n        [ColumnSchema(logical_type=Integer)],\n        [ColumnSchema(logical_type=Double)],\n    ]\n    return_type = ColumnSchema(logical_type=Integer, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n\n    def __init__(self, skipna=True):\n        self.skipna = skipna\n\n    def get_function(self):\n        def max_consecutive_positives(x):\n            if self.skipna:\n                x = x.dropna()\n            # convert the numeric values to booleans for processing\n            x[x.notnull()] = x[x.notnull()].gt(0)\n            # find the locations where the value changes from the previous value\n            not_equal = x != x.shift()\n            # Use cumulative sum to determine where consecutive values occur. When the\n            # sum changes, consecutive non-positive values are present, when the cumulative\n            # sum remains unchnaged, consecutive positive values are present.\n            not_equal_sum = not_equal.cumsum()\n            # group the input by the cumulative sum values and use cumulative count\n            # to count the number of consecutive values. Add 1 to account for the cumulative\n            # sum starting at zero where the first positive occurs\n            consecutive = x.groupby(not_equal_sum).cumcount() + 1\n            # multiply by the inverted input to keep only the counts that correspond to\n            # positive values\n            consecutive_pos = consecutive * x\n            # return the max of all the consecutive positive values\n            return consecutive_pos.max()\n\n        return max_consecutive_positives\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/max_consecutive_true.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Boolean, Integer\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass MaxConsecutiveTrue(AggregationPrimitive):\n    \"\"\"Determines the maximum number of consecutive True values in the input\n\n    Examples:\n        >>> max_consecutive_true = MaxConsecutiveTrue()\n        >>> max_consecutive_true([True, False, True, True, True, False])\n        3\n    \"\"\"\n\n    name = \"max_consecutive_true\"\n    input_types = [ColumnSchema(logical_type=Boolean)]\n    return_type = ColumnSchema(logical_type=Integer, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n\n    def get_function(self):\n        def max_consecutive_true(x):\n            # find the locations where the value changes from the previous value\n            not_equal = x != x.shift()\n            # use cumulative sum to determine where consecutive values occur. When the\n            # sum changes, consecutive False values are present, when the cumulative\n            # sum remains unchnaged, consecutive True values are present.\n            not_equal_sum = not_equal.cumsum()\n            # group the input by the cumulative sum values and use cumulative count\n            # to count the number of consecutive values. Add 1 to account for the cumulative\n            # sum starting at zero where the first True occurs\n            consecutive = x.groupby(not_equal_sum).cumcount() + 1\n            # multiply by the original input to keep only the counts that correspond to\n            # true values\n            consecutive_true = consecutive * x\n            # return the max of all the consecutive true values\n            return consecutive_true.max()\n\n        return max_consecutive_true\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/max_consecutive_zeros.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Double, Integer\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass MaxConsecutiveZeros(AggregationPrimitive):\n    \"\"\"Determines the maximum number of consecutive zero values in the input\n\n    Args:\n        skipna (bool): Ignore any `NaN` values in the input. Default is True.\n\n    Examples:\n        >>> max_consecutive_zeros = MaxConsecutiveZeros()\n        >>> max_consecutive_zeros([1.0, -1.4, 0, 0.0, 0, -4.3])\n        3\n\n        `NaN` values can be ignored with the `skipna` parameter\n\n        >>> max_consecutive_zeros_skipna = MaxConsecutiveZeros(skipna=False)\n        >>> max_consecutive_zeros_skipna([1.0, -1.4, 0, None, 0.0, -4.3])\n        1\n    \"\"\"\n\n    name = \"max_consecutive_zeros\"\n    input_types = [\n        [ColumnSchema(logical_type=Integer)],\n        [ColumnSchema(logical_type=Double)],\n    ]\n    return_type = ColumnSchema(logical_type=Integer, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n\n    def __init__(self, skipna=True):\n        self.skipna = skipna\n\n    def get_function(self):\n        def max_consecutive_zeros(x):\n            if self.skipna:\n                x = x.dropna()\n            # convert the numeric values to booleans for processing\n            x[x.notnull()] = x[x.notnull()].eq(0)\n            # find the locations where the value changes from the previous value\n            not_equal = x != x.shift()\n            # Use cumulative sum to determine where consecutive values occur. When the\n            # sum changes, consecutive non-zero values are present, when the cumulative\n            # sum remains unchnaged, consecutive zero values are present.\n            not_equal_sum = not_equal.cumsum()\n            # group the input by the cumulative sum values and use cumulative count\n            # to count the number of consecutive values. Add 1 to account for the cumulative\n            # sum starting at zero where the first zero occurs\n            consecutive = x.groupby(not_equal_sum).cumcount() + 1\n            # multiply by the boolean input to keep only the counts that correspond to\n            # zero values\n            consecutive_zero = consecutive * x\n            # return the max of all the consecutive zero values\n            return consecutive_zero.max()\n\n        return max_consecutive_zeros\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/max_count.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass MaxCount(AggregationPrimitive):\n    \"\"\"Calculates the number of occurrences of the max value in a list\n\n    Args:\n        skipna (bool): Determines if to use NA/null values. Defaults to\n            True to skip NA/null. If skipna is False, and there are NaN\n            values in the array, the max will be NaN regardless of\n            the other values, and NaN will be returned.\n\n    Examples:\n        >>> max_count = MaxCount()\n        >>> max_count([1, 2, 5, 1, 5, 3, 5])\n        3\n\n        You can optionally specify how to handle NaN values\n\n        >>> max_count_skipna = MaxCount(skipna=False)\n        >>> max_count_skipna([1, 2, 5, 1, 5, 3, None])\n        nan\n    \"\"\"\n\n    name = \"max_count\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n    def __init__(self, skipna=True):\n        self.skipna = skipna\n\n    def get_function(self):\n        def max_count(x):\n            xmax = x.max(skipna=self.skipna)\n            if np.isnan(xmax):\n                return np.nan\n            return x.eq(xmax).sum()\n\n        return max_count\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/max_min_delta.py",
    "content": "from woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass MaxMinDelta(AggregationPrimitive):\n    \"\"\"Determines the difference between the max and min value.\n\n    Args:\n        skipna (bool): Determines if to use NA/null values.\n            Defaults to True to skip NA/null.\n\n    Examples:\n        >>> max_min_delta = MaxMinDelta()\n        >>> max_min_delta([7, 2, 5, 3, 10])\n        8\n\n        You can optionally specify how to handle NaN values\n\n        >>> max_min_delta_skipna = MaxMinDelta(skipna=False)\n        >>> max_min_delta_skipna([7, 2, None, 3, 10])\n        nan\n    \"\"\"\n\n    name = \"max_min_delta\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n\n    def __init__(self, skipna=True):\n        self.skipna = skipna\n\n    def get_function(self):\n        def max_min_delta(x):\n            max_val = x.max(skipna=self.skipna)\n            min_val = x.min(skipna=self.skipna)\n            return max_val - min_val\n\n        return max_min_delta\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/max_primitive.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass Max(AggregationPrimitive):\n    \"\"\"Calculates the highest value, ignoring `NaN` values.\n\n    Examples:\n        >>> max = Max()\n        >>> max([1, 2, 3, 4, 5, None])\n        5.0\n    \"\"\"\n\n    name = \"max\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    stack_on_self = False\n    description_template = \"the maximum of {}\"\n\n    def get_function(self):\n        return np.max\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/mean.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass Mean(AggregationPrimitive):\n    \"\"\"Computes the average for a list of values.\n\n    Args:\n        skipna (bool): Determines if to use NA/null values. Defaults to\n            True to skip NA/null.\n\n    Examples:\n        >>> mean = Mean()\n        >>> mean([1, 2, 3, 4, 5, None])\n        3.0\n\n        We can also control the way `NaN` values are handled.\n\n        >>> mean = Mean(skipna=False)\n        >>> mean([1, 2, 3, 4, 5, None])\n        nan\n    \"\"\"\n\n    name = \"mean\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    description_template = \"the average of {}\"\n\n    def __init__(self, skipna=True):\n        self.skipna = skipna\n\n    def get_function(self):\n        if self.skipna:\n            # np.mean of series is functionally nanmean\n            return np.mean\n\n        def mean(series):\n            return np.mean(series.values)\n\n        return mean\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/median.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass Median(AggregationPrimitive):\n    \"\"\"Determines the middlemost number in a list of values.\n\n    Examples:\n        >>> median = Median()\n        >>> median([5, 3, 2, 1, 4])\n        3.0\n\n        `NaN` values are ignored.\n\n        >>> median([5, 3, 2, 1, 4, None])\n        3.0\n    \"\"\"\n\n    name = \"median\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    description_template = \"the median of {}\"\n\n    def get_function(self):\n        return pd.Series.median\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/median_count.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass MedianCount(AggregationPrimitive):\n    \"\"\"Calculates the number of occurrences of the median value in a list\n\n    Args:\n        skipna (bool): Determines if to use NA/null values. Defaults to\n            True to skip NA/null. If skipna is False, and there are NaN\n            values in the array, the median will be NaN, regardless of\n            the other values.\n\n    Examples:\n        >>> median_count = MedianCount()\n        >>> median_count([1, 2, 3, 1, 5, 3, 5])\n        2\n\n        You can optionally specify how to handle NaN values\n\n        >>> median_count_skipna = MedianCount(skipna=False)\n        >>> median_count_skipna([1, 2, 3, 1, 5, 3, None])\n        nan\n    \"\"\"\n\n    name = \"median_count\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n\n    def __init__(self, skipna=True):\n        self.skipna = skipna\n\n    def get_function(self):\n        def median_count(x):\n            median = x.median(skipna=self.skipna)\n            if np.isnan(median):\n                return np.nan\n            return x.eq(median).sum()\n\n        return median_count\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/min_count.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass MinCount(AggregationPrimitive):\n    \"\"\"Calculates the number of occurrences of the min value in a list\n\n    Args:\n        skipna (bool): Determines if to use NA/null values. Defaults to\n            True to skip NA/null. If skipna is False, and there are NaN\n            values in the array, the min will be NaN regardless of\n            the other values, and NaN will be returned.\n\n    Examples:\n        >>> min_count = MinCount()\n        >>> min_count([1, 2, 5, 1, 5, 3, 5])\n        2\n\n        You can optionally specify how to handle NaN values\n\n        >>> min_count_skipna = MinCount(skipna=False)\n        >>> min_count_skipna([1, 2, 5, 1, 5, 3, None])\n        nan\n    \"\"\"\n\n    name = \"min_count\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n\n    def __init__(self, skipna=True):\n        self.skipna = skipna\n\n    def get_function(self):\n        def min_count(x):\n            xmin = x.min(skipna=self.skipna)\n            if np.isnan(xmin):\n                return np.nan\n            return x.eq(xmin).sum()\n\n        return min_count\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/min_primitive.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass Min(AggregationPrimitive):\n    \"\"\"Calculates the smallest value, ignoring `NaN` values.\n\n    Examples:\n        >>> min = Min()\n        >>> min([1, 2, 3, 4, 5, None])\n        1.0\n    \"\"\"\n\n    name = \"min\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    stack_on_self = False\n    description_template = \"the minimum of {}\"\n\n    def get_function(self):\n        return np.min\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/mode.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass Mode(AggregationPrimitive):\n    \"\"\"Determines the most commonly repeated value.\n\n    Description:\n        Given a list of values, return the value with the\n        highest number of occurences. If list is\n        empty, return `NaN`.\n\n    Examples:\n        >>> mode = Mode()\n        >>> mode(['red', 'blue', 'green', 'blue'])\n        'blue'\n    \"\"\"\n\n    name = \"mode\"\n    input_types = [ColumnSchema(semantic_tags={\"category\"})]\n    return_type = None\n    description_template = \"the most frequently occurring value of {}\"\n\n    def get_function(self):\n        def pd_mode(s):\n            return s.mode().get(0, np.nan)\n\n        return pd_mode\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/n_most_common.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass NMostCommon(AggregationPrimitive):\n    \"\"\"Determines the `n` most common elements.\n\n    Description:\n        Given a list of values, return the `n` values\n        which appear the most frequently. If there are\n        fewer than `n` unique values, the output will be\n        filled with `NaN`.\n\n    Args:\n        n (int): defines \"n\" in \"n most common.\" Defaults\n            to 3.\n\n    Examples:\n        >>> n_most_common = NMostCommon(n=2)\n        >>> x = ['orange', 'apple', 'orange', 'apple', 'orange', 'grapefruit']\n        >>> n_most_common(x).tolist()\n        ['orange', 'apple']\n    \"\"\"\n\n    name = \"n_most_common\"\n    input_types = [ColumnSchema(semantic_tags={\"category\"})]\n    return_type = None\n\n    def __init__(self, n=3):\n        self.n = n\n        self.number_output_features = n\n        self.description_template = [\n            \"the {} most common values of {{}}\".format(n),\n            \"the most common value of {}\",\n            *[\"the {nth_slice} most common value of {}\"] * (n - 1),\n        ]\n\n    def get_function(self):\n        def n_most_common(x):\n            # Counts of 0 remain in value_counts output if dtype is category\n            # so we need to remove them\n            counts = x.value_counts()\n            counts = counts[counts > 0]\n            array = np.array(counts.index[: self.n])\n            if len(array) < self.n:\n                filler = np.full(self.n - len(array), np.nan)\n                array = np.append(array, filler)\n            return array\n\n        return n_most_common\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/n_most_common_frequency.py",
    "content": "import numpy as np\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Categorical\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass NMostCommonFrequency(AggregationPrimitive):\n    \"\"\"Determines the frequency of the n most common items.\n\n    Args:\n        n (int): defines \"n\" in \"n most common\". Defaults to 3.\n        skipna (bool): Determines if to use NA/null values.\n            Defaults to True to skip NA/null.\n\n    Description:\n        Given a list, find the n most common items, and return a series\n        showing the frequency of each item. If the list has less than n unique\n        values, the resulting series will be padded with nan.\n\n    Examples:\n        >>> n_most_common_frequency = NMostCommonFrequency()\n        >>> n_most_common_frequency([1, 1, 1, 2, 2, 3, 4, 4]).to_list()\n        [3, 2, 2]\n\n        We can increase n to include more items.\n\n        >>> n_most_common_frequency = NMostCommonFrequency(4)\n        >>> n_most_common_frequency([1, 1, 1, 2, 2, 3, 4, 4]).to_list()\n        [3, 2, 2, 1]\n\n        NaNs are skipped by default.\n\n        >>> n_most_common_frequency = NMostCommonFrequency(3)\n        >>> n_most_common_frequency([1, 1, 1, 2, 2, 3, 4, 4, None, None, None]).to_list()\n        [3, 2, 2]\n\n        However, the way NaNs are treated can be controlled.\n\n        >>> n_most_common_frequency = NMostCommonFrequency(3, skipna=False)\n        >>> n_most_common_frequency([1, 1, 1, 2, 2, 3, 4, 4, None, None, None]).to_list()\n        [3, 3, 2]\n    \"\"\"\n\n    name = \"n_most_common_frequency\"\n    input_types = [ColumnSchema(semantic_tags={\"category\"})]\n    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={\"category\"})\n\n    def __init__(self, n=3, skipna=True):\n        self.n = n\n        self.number_output_features = n\n        self.skipna = skipna\n\n    def get_function(self):\n        def n_most_common_frequency(data, n=self.n):\n            frequencies = data.value_counts(dropna=self.skipna)\n            n_most_common = frequencies.iloc[0:n]\n            nan_add = n - frequencies.shape[0]\n            if nan_add > 0:\n                n_most_common = pd.concat(\n                    [n_most_common, pd.Series([np.nan] * nan_add)],\n                )\n            return n_most_common\n\n        return n_most_common_frequency\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/n_unique_days.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Integer\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass NUniqueDays(AggregationPrimitive):\n    \"\"\"Determines the number of unique days.\n\n    Description:\n        Given a list of datetimes, return the number of unique days.\n        The same day in two different years is treated as different. So\n        Feb 21, 2017 is different than Feb 21, 2019, even though they are\n        both the 21st of February.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> n_unique_days = NUniqueDays()\n        >>> times = [datetime(2019, 2, 1),\n        ...          datetime(2019, 2, 1),\n        ...          datetime(2018, 2, 1),\n        ...          datetime(2019, 1, 1)]\n        >>> n_unique_days(times)\n        3\n    \"\"\"\n\n    name = \"n_unique_days\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(logical_type=Integer, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n\n    def get_function(self):\n        def n_unique_days(x):\n            return x.dt.floor(\"D\").nunique()\n\n        return n_unique_days\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/n_unique_days_of_calendar_year.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Integer\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass NUniqueDaysOfCalendarYear(AggregationPrimitive):\n    \"\"\"Determines the number of unique calendar days.\n\n    Description:\n        Given a list of datetimes, return the number of unique calendar\n        days. The same date in two different years is counted as one. So\n        Feb 21, 2017 is not unique from Feb 21, 2019.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> n_unique_days_of_calendar_year = NUniqueDaysOfCalendarYear()\n        >>> times = [datetime(2019, 2, 1),\n        ...          datetime(2019, 2, 1),\n        ...          datetime(2018, 2, 1),\n        ...          datetime(2019, 1, 1)]\n        >>> n_unique_days_of_calendar_year(times)\n        2\n    \"\"\"\n\n    name = \"n_unique_days_of_calendar_year\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(logical_type=Integer, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n\n    def get_function(self):\n        def n_unique_days_of_calendar_year(x):\n            return x.dropna().dt.strftime(\"%m-%d\").nunique()\n\n        return n_unique_days_of_calendar_year\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/n_unique_days_of_month.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Integer\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass NUniqueDaysOfMonth(AggregationPrimitive):\n    \"\"\"Determines the number of unique days of month.\n\n    Description:\n        Given a list of datetimes, return the number of unique days\n        of month. The maximum value is 31. 2018-01-01 and 2018-02-01\n        will be counted as 1 unique day. 2019-01-01 and 2018-01-01\n        will also be counted as 1.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> n_unique_days_of_month = NUniqueDaysOfMonth()\n        >>> times = [datetime(2019, 1, 1),\n        ...          datetime(2019, 2, 1),\n        ...          datetime(2018, 2, 1),\n        ...          datetime(2019, 1, 2),\n        ...          datetime(2019, 1, 3)]\n        >>> n_unique_days_of_month(times)\n        3\n    \"\"\"\n\n    name = \"n_unique_days_of_month\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(logical_type=Integer, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n\n    def get_function(self):\n        def n_unique_days_of_month(x):\n            return x.dropna().dt.day.nunique()\n\n        return n_unique_days_of_month\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/n_unique_months.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Integer\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass NUniqueMonths(AggregationPrimitive):\n    \"\"\"Determines the number of unique months.\n\n    Description:\n        Given a list of datetimes, return the number of unique months.\n        NUniqueMonths counts absolute month, not month of year, so the\n        same month in two different years is treated as different. (i.e.\n        Feb 2017 is different than Feb 2019.)\n\n    Examples:\n        >>> from datetime import datetime\n        >>> n_unique_months = NUniqueMonths()\n        >>> times = [datetime(2019, 1, 1),\n        ...          datetime(2019, 1, 2),\n        ...          datetime(2019, 1, 3),\n        ...          datetime(2019, 2, 1),\n        ...          datetime(2018, 2, 1)]\n        >>> n_unique_months(times)\n        3\n    \"\"\"\n\n    name = \"n_unique_months\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(logical_type=Integer, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n\n    def get_function(self):\n        def n_unique_months(x):\n            return x.dt.to_period(\"M\").nunique()\n\n        return n_unique_months\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/n_unique_weeks.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Integer\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass NUniqueWeeks(AggregationPrimitive):\n    \"\"\"Determines the number of unique weeks.\n\n    Description:\n        Given a list of datetimes, return the number of unique\n        weeks (Monday-Sunday). NUniqueWeeks counts by absolute\n        week, not week of year, so the first week of 2018 and\n        the first week of 2019 count as two unique values.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> n_unique_weeks = NUniqueWeeks()\n        >>> times = [datetime(2018, 2, 2),\n        ...          datetime(2019, 1, 1),\n        ...          datetime(2019, 2, 1),\n        ...          datetime(2019, 2, 1),\n        ...          datetime(2019, 2, 3),\n        ...          datetime(2019, 2, 21)]\n        >>> n_unique_weeks(times)\n        4\n    \"\"\"\n\n    name = \"n_unique_weeks\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(logical_type=Integer, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n\n    def get_function(self):\n        def n_unique_weeks(x):\n            return x.dt.to_period(\"W\").nunique()\n\n        return n_unique_weeks\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/num_consecutive_greater_mean.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass NumConsecutiveGreaterMean(AggregationPrimitive):\n    \"\"\"Determines the length of the longest subsequence above the mean.\n\n    Description:\n        Given a list of numbers, find the longest subsequence of numbers\n        larger than the mean of the entire sequence. Return the length\n        of the longest subsequence.\n\n    Args:\n        skipna (bool): If this is False and any value in x is `NaN`, then\n            the result will be `NaN`. If True, `NaN` values are skipped.\n            Default is True.\n\n    Examples:\n        >>> num_consecutive_greater_mean = NumConsecutiveGreaterMean()\n        >>> num_consecutive_greater_mean([1, 2, 3, 4, 5, 6])\n        3.0\n\n        We can also control the way `NaN` values are handled.\n\n        >>> num_consecutive_greater_mean = NumConsecutiveGreaterMean(skipna=False)\n        >>> num_consecutive_greater_mean([1, 2, 3, 4, 5, 6, None])\n        nan\n    \"\"\"\n\n    name = \"num_consecutive_greater_mean\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n\n    def __init__(self, skipna=True):\n        self.skipna = skipna\n\n    def get_function(self):\n        def num_consecutive_greater_mean(x):\n            # check for NaN cases\n            if x.isnull().all():\n                return np.nan\n            if not self.skipna and x.isnull().values.any():\n                return np.nan\n            x_mean = x.mean()\n\n            # In some cases, the mean of x may be NaN\n            #   (such as when x has both inf and -inf values)\n            if np.isnan(x.mean()):\n                return np.nan\n\n            # Find indices of points at or below mean\n            x = x.dropna().reset_index(drop=True)\n            below_mean_indices = x[x <= x_mean].index.to_series()\n\n            # If none of x is below the mean, return the length of x\n            if below_mean_indices.empty:\n                return len(x)\n\n            # Pad index with start/end values, in case the longest\n            #   sequence occurs at the beginning or end of x\n            below_mean_indices[-1] = -1\n            below_mean_indices[len(x)] = len(x)\n            below_mean_indices = below_mean_indices.sort_index()\n\n            # Calculate gaps between points below mean\n            below_mean_indices_shifted = below_mean_indices.shift(1)\n            diffs = below_mean_indices - below_mean_indices_shifted\n\n            # Take biggest gap, and subtract 1 to get result\n            max_gap = (diffs).max() - 1\n            return max_gap\n\n        return num_consecutive_greater_mean\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/num_consecutive_less_mean.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass NumConsecutiveLessMean(AggregationPrimitive):\n    \"\"\"Determines the length of the longest subsequence below the mean.\n\n    Description:\n        Given a list of numbers, find the longest subsequence of numbers\n        smaller than the mean of the entire sequence. Return the length\n        of the longest subsequence.\n\n    Args:\n        skipna (bool): If this is False and any value in x is `NaN`, then\n            the result will be `NaN`. If True, `NaN` values are skipped.\n            Default is True.\n\n    Examples:\n        >>> num_consecutive_less_mean = NumConsecutiveLessMean()\n        >>> num_consecutive_less_mean([1, 2, 3, 4, 5, 6])\n        3.0\n\n        We can also control the way `NaN` values are handled.\n\n        >>> num_consecutive_less_mean = NumConsecutiveLessMean(skipna=False)\n        >>> num_consecutive_less_mean([1, 2, 3, 4, 5, 6, None])\n        nan\n    \"\"\"\n\n    name = \"num_consecutive_less_mean\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n\n    def __init__(self, skipna=True):\n        self.skipna = skipna\n\n    def get_function(self):\n        def num_consecutive_less_mean(x):\n            # check for NaN cases\n            if x.isnull().all():\n                return np.nan\n            if not self.skipna and x.isnull().values.any():\n                return np.nan\n            x_mean = x.mean()\n\n            # In some cases, the mean of x may be NaN\n            #   (such as when x has both inf and -inf values)\n            if np.isnan(x.mean()):\n                return np.nan\n\n            # Find indices of points at or above mean\n            x = x.dropna().reset_index(drop=True)\n            above_mean_indices = x[x >= x_mean].index.to_series()\n\n            # If none of x is above the mean, return the length of x\n            if above_mean_indices.empty:\n                return len(x)\n\n            # Pad index with start/end values, in case the longest\n            #   sequence occurs at the beginning or end of x\n            above_mean_indices[-1] = -1\n            above_mean_indices[len(x)] = len(x)\n            above_mean_indices = above_mean_indices.sort_index()\n\n            # Calculate gaps between points above mean\n            above_mean_indices_shifted = above_mean_indices.shift(1)\n            diffs = above_mean_indices - above_mean_indices_shifted\n\n            # Take biggest gap, and subtract 1 to get result\n            max_gap = (diffs).max() - 1\n            return max_gap\n\n        return num_consecutive_less_mean\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/num_false_since_last_true.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Boolean, IntegerNullable\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass NumFalseSinceLastTrue(AggregationPrimitive):\n    \"\"\"Calculates the number of `False` values since the last `True` value.\n\n    Description:\n        From a series of Booleans, find the last record with a `True` value.\n        Return the count of `False` values between that record and the end of\n        the series. Return nan if no values are `True`. Any nan values in the\n        input are ignored. A `True` value in the last row will result in a\n        count of 0.  Inputs are converted too booleans before calculating\n        the result.\n\n    Examples:\n        >>> num_false_since_last_true = NumFalseSinceLastTrue()\n        >>> num_false_since_last_true([True, False, True, False, False])\n        2\n    \"\"\"\n\n    name = \"num_false_since_last_true\"\n    input_types = [ColumnSchema(logical_type=Boolean)]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n\n    def get_function(self):\n        def num_false_since_last_true(x):\n            if x.empty:\n                return np.nan\n            x = x.dropna().astype(bool)\n            true_indices = x[x]\n            if true_indices.empty:\n                return np.nan\n            last_true_index = true_indices.index[-1]\n            x_slice = x.loc[last_true_index:]\n            return np.invert(x_slice).sum()\n\n        return num_false_since_last_true\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/num_peaks.py",
    "content": "import pandas as pd\nfrom scipy.signal import find_peaks\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Integer\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass NumPeaks(AggregationPrimitive):\n    \"\"\"Determines the number of peaks in a list of numbers.\n\n    Description:\n        Given a list of numbers, count the number of local\n        maxima. Uses the find_peaks function from scipy.signal.\n\n    Examples:\n        >>> num_peaks = NumPeaks()\n        >>> num_peaks([-5, 0, 10, 0, 10, -5, -4, -5, 10, 0])\n        4\n    \"\"\"\n\n    name = \"num_peaks\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=Integer, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n\n    def get_function(self):\n        def num_peaks(x):\n            if x.dtype == \"Int64\":\n                x = x.astype(\"float64\")\n            peaks = find_peaks(x)[0]\n            return len(peaks[~pd.isna(peaks)])\n\n        return num_peaks\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/num_true.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Boolean, BooleanNullable, IntegerNullable\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass NumTrue(AggregationPrimitive):\n    \"\"\"Counts the number of `True` values.\n\n    Description:\n        Given a list of booleans, return the number\n        of `True` values. Ignores 'NaN'.\n\n    Examples:\n        >>> num_true = NumTrue()\n        >>> num_true([True, False, True, True, None])\n        3\n    \"\"\"\n\n    name = \"num_true\"\n    input_types = [\n        [ColumnSchema(logical_type=Boolean)],\n        [ColumnSchema(logical_type=BooleanNullable)],\n    ]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n    default_value = 0\n    stack_on = []\n    stack_on_exclude = []\n    description_template = \"the number of times {} is true\"\n\n    def get_function(self):\n        return np.sum\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/num_true_since_last_false.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Boolean, IntegerNullable\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass NumTrueSinceLastFalse(AggregationPrimitive):\n    \"\"\"Calculates the number of `True` values since the last `False` value.\n\n    Description:\n        From a series of Booleans, find the last record with a `False` value.\n        Return the count of `True` values between that record and the end of\n        the series. Return nan if no values are `False`. Any nan values in the\n        input are ignored. A `False` value in the last row will result in a\n        count of 0.\n\n    Examples:\n        >>> num_true_since_last_false = NumTrueSinceLastFalse()\n        >>> num_true_since_last_false([False, True, False, True, True])\n        2\n    \"\"\"\n\n    name = \"num_true_since_last_false\"\n    input_types = [ColumnSchema(logical_type=Boolean)]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = 0\n\n    def get_function(self):\n        def num_true_since_last_false(x):\n            x = x.dropna().astype(bool)\n            false_indices = x[~x]\n            if false_indices.empty:\n                return np.nan\n            last_false_index = false_indices.index[-1]\n            x_slice = x.loc[last_false_index:]\n            return x_slice.sum()\n\n        return num_true_since_last_false\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/num_unique.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass NumUnique(AggregationPrimitive):\n    \"\"\"Determines the number of distinct values, ignoring `NaN` values.\n\n    Args:\n        use_string_for_pd_calc (bool): Determines if the string 'nunique' or the function\n            pd.Series.nunique is used for making the primitive calculation. Put in place to\n            account for the bug https://github.com/pandas-dev/pandas/issues/57317.\n            Defaults to using the string.\n\n    Examples:\n        >>> num_unique = NumUnique(use_string_for_pd_calc=False)\n        >>> num_unique(['red', 'blue', 'green', 'yellow'])\n        4\n\n        `NaN` values will be ignored.\n\n        >>> num_unique(['red', 'blue', 'green', 'yellow', None])\n        4\n    \"\"\"\n\n    name = \"num_unique\"\n    input_types = [ColumnSchema(semantic_tags={\"category\"})]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    description_template = \"the number of unique elements in {}\"\n\n    def __init__(self, use_string_for_pd_calc=True):\n        self.use_string_for_pd_calc = use_string_for_pd_calc\n\n    def get_function(self):\n        if self.use_string_for_pd_calc:\n            return \"nunique\"\n        return pd.Series.nunique\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/num_zero_crossings.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Integer\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass NumZeroCrossings(AggregationPrimitive):\n    \"\"\"Determines the number of times a list crosses 0.\n\n    Description:\n        Given a list of numbers, return the number of times the value\n        crosses 0. It is the number of times the value goes from a\n        positive number to a negative number, or a negative number to\n        a positive number. NaN values are ignored.\n\n    Examples:\n        >>> num_zero_crossings = NumZeroCrossings()\n        >>> num_zero_crossings([1, -1, 2, -2, 3, -3])\n        5\n    \"\"\"\n\n    name = \"num_zero_crossings\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=Integer, semantic_tags={\"numeric\"})\n\n    def get_function(self):\n        def num_zero_crossings(x):\n            cleaned = x[(x != 0) & (x == x)]\n            signs = np.sign(cleaned)\n            difference = np.diff(signs)\n            crossings = np.where(difference)[0]\n            return len(crossings)\n\n        return num_zero_crossings\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/percent_true.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Boolean, BooleanNullable, Double\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass PercentTrue(AggregationPrimitive):\n    \"\"\"Determines the percent of `True` values.\n\n    Description:\n        Given a list of booleans, return the percent\n        of values which are `True` as a decimal.\n        `NaN` values are treated as `False`,\n        adding to the denominator.\n\n    Examples:\n        >>> percent_true = PercentTrue()\n        >>> percent_true([True, False, True, True, None])\n        0.6\n    \"\"\"\n\n    name = \"percent_true\"\n    input_types = [\n        [ColumnSchema(logical_type=BooleanNullable)],\n        [ColumnSchema(logical_type=Boolean)],\n    ]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    stack_on = []\n    stack_on_exclude = []\n    default_value = pd.NA\n    description_template = \"the percentage of true values in {}\"\n\n    def get_function(self):\n        def percent_true(s):\n            return s.fillna(False).mean()\n\n        return percent_true\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/percent_unique.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Double\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass PercentUnique(AggregationPrimitive):\n    \"\"\"Determines the percent of unique values.\n\n    Description:\n        Given a list of values, determine what percent of the\n        list is made up of unique values.  Multiple `NaN` values\n        are treated as one unique value.\n\n    Args:\n        skipna (bool): Determines whether to ignore `NaN` values.\n            Defaults to True.\n\n    Examples:\n        >>> percent_unique = PercentUnique()\n        >>> percent_unique([1, 1, 2, 2, 3, 4, 5, 6, 7, 8])\n        0.8\n\n        We can control whether or not `NaN` values are ignored.\n\n        >>> percent_unique = PercentUnique()\n        >>> percent_unique([1, 1, 2, None])\n        0.5\n        >>> percent_unique_skipna = PercentUnique(skipna=False)\n        >>> percent_unique_skipna([1, 1, 2, None])\n        0.75\n    \"\"\"\n\n    name = \"percent_unique\"\n    input_types = [ColumnSchema(semantic_tags={\"category\"})]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    default_value = 0\n\n    def __init__(self, skipna=True):\n        self.skipna = skipna\n\n    def get_function(self):\n        def percent_unique(x):\n            return x.nunique(dropna=self.skipna) / (x.shape[0] * 1.0)\n\n        return percent_unique\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/skew.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass Skew(AggregationPrimitive):\n    \"\"\"Computes the extent to which a distribution differs from a normal distribution.\n\n    Description:\n        For normally distributed data, the skewness should be about 0.\n        A skewness value > 0 means that there is more weight in the\n        left tail of the distribution.\n\n    Examples:\n        >>> skew = Skew()\n        >>> skew([1, 10, 30, None])\n        1.0437603722639681\n    \"\"\"\n\n    name = \"skew\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    stack_on = []\n    stack_on_self = False\n    description_template = \"the skewness of {}\"\n\n    def get_function(self):\n        return pd.Series.skew\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/std.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\n\n\nclass Std(AggregationPrimitive):\n    \"\"\"Computes the dispersion relative to the mean value, ignoring `NaN`.\n\n    Examples:\n        >>> std = Std()\n        >>> round(std([1, 2, 3, 4, 5, None]), 3)\n        1.414\n    \"\"\"\n\n    name = \"std\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    stack_on_self = False\n    description_template = \"the standard deviation of {}\"\n\n    def get_function(self):\n        return np.std\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/sum_primitive.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\nfrom featuretools.primitives.standard.aggregation.count import Count\n\n\nclass Sum(AggregationPrimitive):\n    \"\"\"Calculates the total addition, ignoring `NaN`.\n\n    Examples:\n        >>> sum = Sum()\n        >>> sum([1, 2, 3, 4, 5, None])\n        15.0\n    \"\"\"\n\n    name = \"sum\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    stack_on_self = False\n    stack_on_exclude = [Count]\n    default_value = 0\n    description_template = \"the sum of {}\"\n\n    def get_function(self):\n        return np.sum\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/time_since_first.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Double\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\nfrom featuretools.utils import convert_time_units\n\n\nclass TimeSinceFirst(AggregationPrimitive):\n    \"\"\"Calculates the time elapsed since the first datetime (in seconds).\n\n    Description:\n        Given a list of datetimes, calculate the\n        time elapsed since the first datetime (in\n        seconds). Uses the instance's cutoff time.\n\n    Args:\n        unit (str): Defines the unit of time to count from.\n            Defaults to seconds. Acceptable values:\n            years, months, days, hours, minutes, seconds, milliseconds, nanoseconds\n\n    Examples:\n        >>> from datetime import datetime\n        >>> time_since_first = TimeSinceFirst()\n        >>> cutoff_time = datetime(2010, 1, 1, 12, 0, 0)\n        >>> times = [datetime(2010, 1, 1, 11, 45, 0),\n        ...          datetime(2010, 1, 1, 11, 55, 15),\n        ...          datetime(2010, 1, 1, 11, 57, 30)]\n        >>> time_since_first(times, time=cutoff_time)\n        900.0\n\n        >>> from datetime import datetime\n        >>> time_since_first = TimeSinceFirst(unit = \"minutes\")\n        >>> cutoff_time = datetime(2010, 1, 1, 12, 0, 0)\n        >>> times = [datetime(2010, 1, 1, 11, 45, 0),\n        ...          datetime(2010, 1, 1, 11, 55, 15),\n        ...          datetime(2010, 1, 1, 11, 57, 30)]\n        >>> time_since_first(times, time=cutoff_time)\n        15.0\n\n    \"\"\"\n\n    name = \"time_since_first\"\n    input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"})]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    uses_calc_time = True\n    description_template = \"the time since the first {}\"\n\n    def __init__(self, unit=\"seconds\"):\n        self.unit = unit.lower()\n\n    def get_function(self):\n        def time_since_first(values, time=None):\n            time_since = time - values.iloc[0]\n            return convert_time_units(time_since.total_seconds(), self.unit)\n\n        return time_since_first\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/time_since_last.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Double\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\nfrom featuretools.utils import convert_time_units\n\n\nclass TimeSinceLast(AggregationPrimitive):\n    \"\"\"Calculates the time elapsed since the last datetime (default in seconds).\n\n    Description:\n        Given a list of datetimes, calculate the\n        time elapsed since the last datetime (default in\n        seconds). Uses the instance's cutoff time.\n\n    Args:\n        unit (str): Defines the unit of time to count from.\n            Defaults to seconds. Acceptable values:\n            years, months, days, hours, minutes, seconds, milliseconds, nanoseconds\n\n    Examples:\n        >>> from datetime import datetime\n        >>> time_since_last = TimeSinceLast()\n        >>> cutoff_time = datetime(2010, 1, 1, 12, 0, 0)\n        >>> times = [datetime(2010, 1, 1, 11, 45, 0),\n        ...          datetime(2010, 1, 1, 11, 55, 15),\n        ...          datetime(2010, 1, 1, 11, 57, 30)]\n        >>> time_since_last(times, time=cutoff_time)\n        150.0\n\n        >>> from datetime import datetime\n        >>> time_since_last = TimeSinceLast(unit = \"minutes\")\n        >>> cutoff_time = datetime(2010, 1, 1, 12, 0, 0)\n        >>> times = [datetime(2010, 1, 1, 11, 45, 0),\n        ...          datetime(2010, 1, 1, 11, 55, 15),\n        ...          datetime(2010, 1, 1, 11, 57, 30)]\n        >>> time_since_last(times, time=cutoff_time)\n        2.5\n\n    \"\"\"\n\n    name = \"time_since_last\"\n    input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"})]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    uses_calc_time = True\n    description_template = \"the time since the last {}\"\n\n    def __init__(self, unit=\"seconds\"):\n        self.unit = unit.lower()\n\n    def get_function(self):\n        def time_since_last(values, time=None):\n            time_since = time - values.iloc[-1]\n            return convert_time_units(time_since.total_seconds(), self.unit)\n\n        return time_since_last\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/time_since_last_false.py",
    "content": "import numpy as np\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Boolean, BooleanNullable, Datetime, Double\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass TimeSinceLastFalse(AggregationPrimitive):\n    \"\"\"Calculates the time since the last `False` value.\n\n    Description:\n        Using a series of Datetimes and a series of Booleans, find the last\n        record with a `False` value. Return the seconds elapsed between that record\n        and the instance's cutoff time. Return nan if no values are `False`.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> time_since_last_false = TimeSinceLastFalse()\n        >>> cutoff_time = datetime(2010, 1, 1, 12, 0, 0)\n        >>> times = [datetime(2010, 1, 1, 11, 45, 0),\n        ...          datetime(2010, 1, 1, 11, 55, 15),\n        ...          datetime(2010, 1, 1, 11, 57, 30)]\n        >>> booleans = [True, False, True]\n        >>> time_since_last_false(times, booleans, time=cutoff_time)\n        285.0\n    \"\"\"\n\n    name = \"time_since_last_false\"\n    input_types = [\n        [\n            ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"}),\n            ColumnSchema(logical_type=Boolean),\n        ],\n        [\n            ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"}),\n            ColumnSchema(logical_type=BooleanNullable),\n        ],\n    ]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    uses_calc_time = True\n    stack_on_self = False\n    default_value = 0\n\n    def get_function(self):\n        def time_since_last_false(datetime_col, bool_col, time=None):\n            df = pd.DataFrame(\n                {\n                    \"datetime\": datetime_col,\n                    \"bool\": bool_col,\n                },\n            ).dropna()\n            if df.empty:\n                return np.nan\n            false_indices = df[~df[\"bool\"]]\n            if false_indices.empty:\n                return np.nan\n            last_false_index = false_indices.index[-1]\n            time_since = time - datetime_col.loc[last_false_index]\n            return time_since.total_seconds()\n\n        return time_since_last_false\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/time_since_last_max.py",
    "content": "import numpy as np\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Double\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass TimeSinceLastMax(AggregationPrimitive):\n    \"\"\"Calculates the time since the maximum value occurred.\n\n    Description:\n        Given a list of numbers, and a corresponding index of\n        datetimes, find the time of the maximum value, and return\n        the time elapsed since it occured. This calculation is done\n        using an instance id's cutoff time.\n\n        If multiple values equal the maximum, use the first occuring\n        maximum.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> time_since_last_max = TimeSinceLastMax()\n        >>> cutoff_time = datetime(2010, 1, 1, 12, 0, 0)\n        >>> times = [datetime(2010, 1, 1, 11, 45, 0),\n        ...          datetime(2010, 1, 1, 11, 55, 15),\n        ...          datetime(2010, 1, 1, 11, 57, 30)]\n        >>> time_since_last_max(times, [1, 3, 2], time=cutoff_time)\n        285.0\n    \"\"\"\n\n    name = \"time_since_last_max\"\n    input_types = [\n        ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"}),\n        ColumnSchema(semantic_tags={\"numeric\"}),\n    ]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    uses_calc_time = True\n    stack_on_self = False\n    default_value = 0\n\n    def get_function(self):\n        def time_since_last_max(datetime_col, numeric_col, time=None):\n            df = pd.DataFrame(\n                {\n                    \"datetime\": datetime_col,\n                    \"numeric\": numeric_col,\n                },\n            ).dropna()\n            if df.empty:\n                return np.nan\n            max_row = df.loc[df[\"numeric\"].idxmax()]\n            max_time = max_row[\"datetime\"]\n            time_since = time - max_time\n            return time_since.total_seconds()\n\n        return time_since_last_max\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/time_since_last_min.py",
    "content": "import numpy as np\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Double\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass TimeSinceLastMin(AggregationPrimitive):\n    \"\"\"Calculates the time since the minimum value occurred.\n\n    Description:\n        Given a list of numbers, and a corresponding index of\n        datetimes, find the time of the minimum value, and return\n        the time elapsed since it occured. This calculation is done\n        using an instance id's cutoff time.\n\n        If multiple values equal the minimum, use the first occuring\n        minimum.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> time_since_last_min = TimeSinceLastMin()\n        >>> cutoff_time = datetime(2010, 1, 1, 12, 0, 0)\n        >>> times = [datetime(2010, 1, 1, 11, 45, 0),\n        ...          datetime(2010, 1, 1, 11, 55, 15),\n        ...          datetime(2010, 1, 1, 11, 57, 30)]\n        >>> time_since_last_min(times, [1, 3, 2], time=cutoff_time)\n        900.0\n    \"\"\"\n\n    name = \"time_since_last_min\"\n    input_types = [\n        ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"}),\n        ColumnSchema(semantic_tags={\"numeric\"}),\n    ]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    uses_calc_time = True\n    stack_on_self = False\n    default_value = 0\n\n    def get_function(self):\n        def time_since_last_min(datetime_col, numeric_col, time=None):\n            df = pd.DataFrame(\n                {\n                    \"datetime\": datetime_col,\n                    \"numeric\": numeric_col,\n                },\n            ).dropna()\n            if df.empty:\n                return np.nan\n            min_row = df.loc[df[\"numeric\"].idxmin()]\n            min_time = min_row[\"datetime\"]\n            time_since = time - min_time\n            return time_since.total_seconds()\n\n        return time_since_last_min\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/time_since_last_true.py",
    "content": "import numpy as np\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Boolean, BooleanNullable, Datetime, Double\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass TimeSinceLastTrue(AggregationPrimitive):\n    \"\"\"Calculates the time since the last `True` value.\n\n    Description:\n        Using a series of Datetimes and a series of Booleans, find the last\n        record with a `True` value. Return the seconds elapsed between that record\n        and the instance's cutoff time. Return nan if no values are `True`.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> time_since_last_true = TimeSinceLastTrue()\n        >>> cutoff_time = datetime(2010, 1, 1, 12, 0, 0)\n        >>> times = [datetime(2010, 1, 1, 11, 45, 0),\n        ...          datetime(2010, 1, 1, 11, 55, 15),\n        ...          datetime(2010, 1, 1, 11, 57, 30)]\n        >>> booleans = [True, True, False]\n        >>> time_since_last_true(times, booleans, time=cutoff_time)\n        285.0\n    \"\"\"\n\n    name = \"time_since_last_true\"\n    input_types = [\n        [\n            ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"}),\n            ColumnSchema(logical_type=Boolean),\n        ],\n        [\n            ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"}),\n            ColumnSchema(logical_type=BooleanNullable),\n        ],\n    ]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    uses_calc_time = True\n    stack_on_self = False\n    default_value = 0\n\n    def get_function(self):\n        def time_since_last_true(datetime_col, bool_col, time=None):\n            df = pd.DataFrame(\n                {\n                    \"datetime\": datetime_col,\n                    \"bool\": bool_col,\n                },\n            ).dropna()\n            if df.empty:\n                return np.nan\n            true_indices = df[df[\"bool\"]]\n            if true_indices.empty:\n                return np.nan\n            last_true_index = true_indices.index[-1]\n            time_since = time - datetime_col.loc[last_true_index]\n            return time_since.total_seconds()\n\n        return time_since_last_true\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/trend.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime\n\nfrom featuretools.primitives.base.aggregation_primitive_base import AggregationPrimitive\nfrom featuretools.utils import calculate_trend\n\n\nclass Trend(AggregationPrimitive):\n    \"\"\"Calculates the trend of a column over time.\n\n    Description:\n        Given a list of values and a corresponding list of\n        datetimes, calculate the slope of the linear trend\n        of values.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> trend = Trend()\n        >>> times = [datetime(2010, 1, 1, 11, 45, 0),\n        ...          datetime(2010, 1, 1, 11, 55, 15),\n        ...          datetime(2010, 1, 1, 11, 57, 30),\n        ...          datetime(2010, 1, 1, 11, 12),\n        ...          datetime(2010, 1, 1, 11, 12, 15)]\n        >>> round(trend([1, 2, 3, 4, 5], times), 3)\n        -0.053\n    \"\"\"\n\n    name = \"trend\"\n    input_types = [\n        ColumnSchema(semantic_tags={\"numeric\"}),\n        ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"}),\n    ]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    description_template = \"the linear trend of {} over time\"\n\n    def get_function(self):\n        def pd_trend(y, x):\n            return calculate_trend(pd.Series(data=y.values, index=x.values))\n\n        return pd_trend\n"
  },
  {
    "path": "featuretools/primitives/standard/aggregation/variance.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Double\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass Variance(AggregationPrimitive):\n    \"\"\"Calculates the variance of a list of numbers.\n\n    Description:\n        Given a list of numbers, return the variance,\n        using numpy's built-in variance function. Nan\n        values in a series will be ignored. Return nan\n        when the series is empty or entirely null.\n\n    Examples:\n        >>> variance = Variance()\n        >>> variance([0, 3, 4, 3])\n        2.25\n\n        Null values in a series will be ignored.\n\n        >>> variance = Variance()\n        >>> variance([0, 3, 4, 3, None])\n        2.25\n    \"\"\"\n\n    name = \"variance\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    stack_on_self = False\n    default_value = np.nan\n\n    def get_function(self):\n        return np.var\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/__init__.py",
    "content": "# flake8: noqa\nfrom featuretools.primitives.standard.transform.absolute_diff import AbsoluteDiff\nfrom featuretools.primitives.standard.transform.binary import *\nfrom featuretools.primitives.standard.transform.cumulative import *\nfrom featuretools.primitives.standard.transform.datetime import *\nfrom featuretools.primitives.standard.transform.email import *\nfrom featuretools.primitives.standard.transform.exponential import *\nfrom featuretools.primitives.standard.transform.file_extension import FileExtension\nfrom featuretools.primitives.standard.transform.full_name_to_first_name import (\n    FullNameToFirstName,\n)\nfrom featuretools.primitives.standard.transform.full_name_to_last_name import (\n    FullNameToLastName,\n)\nfrom featuretools.primitives.standard.transform.full_name_to_title import (\n    FullNameToTitle,\n)\nfrom featuretools.primitives.standard.transform.nth_week_of_month import NthWeekOfMonth\nfrom featuretools.primitives.standard.transform.is_in import IsIn\nfrom featuretools.primitives.standard.transform.is_null import IsNull\nfrom featuretools.primitives.standard.transform.latlong import *\nfrom featuretools.primitives.standard.transform.natural_language import *\nfrom featuretools.primitives.standard.transform.not_primitive import Not\nfrom featuretools.primitives.standard.transform.numeric import *\nfrom featuretools.primitives.standard.transform.percent_change import PercentChange\nfrom featuretools.primitives.standard.transform.postal import *\nfrom featuretools.primitives.standard.transform.savgol_filter import SavgolFilter\nfrom featuretools.primitives.standard.transform.time_series import *\nfrom featuretools.primitives.standard.transform.url import *\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/absolute_diff.py",
    "content": "from woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass AbsoluteDiff(TransformPrimitive):\n    \"\"\"Calculates the absolute difference from the previous element\n       in a list of numbers.\n\n    Description:\n        The absolute difference from the previous element is computed for\n        all elements in the input. The first item in the output will always\n        be nan, since there is no previous element for the first element.\n        Elements in the input containing nan will be filled using either a\n        forward-fill or backward-fill method, specified by the method argument.\n\n    Args:\n        method (str): Method to use for filling nan values in reindexed\n            Series. Possible values are ['pad', 'ffill', 'backfill', 'bfill'].\n            Default is 'ffill'.\n\n            `pad / ffill`: propagate last valid observation forward\n                to fill gap\n\n            `backfill / bfill`: propagate next valid observation backward\n                to fill gap\n\n        limit (int): The max number of consecutive NaN values in a gap that\n            can be filled. Default is None.\n\n    Examples:\n        >>> absolute_diff = AbsoluteDiff()\n        >>> absolute_diff([2, 5, 15, 3]).tolist()\n        [nan, 3.0, 10.0, 12.0]\n\n        Forward filling of input elements using the 'ffill' argument\n\n        >>> absolute_diff_ffill = AbsoluteDiff(method=\"ffill\")\n        >>> absolute_diff_ffill([None, 5, 10, 20, None, 10, None]).tolist()\n        [nan, nan, 5.0, 10.0, 0.0, 10.0, 0.0]\n\n        Backward filling of input element using the 'bfill' argument\n\n        >>> absolute_diff_bfill = AbsoluteDiff(method=\"bfill\")\n        >>> absolute_diff_bfill([None, 5, 10, 20, None, 10, None]).tolist()\n        [nan, 0.0, 5.0, 10.0, 10.0, 0.0, nan]\n\n        The number of nan values that are filled can be limited\n\n        >>> absolute_diff_limitfill = AbsoluteDiff(limit=2)\n        >>> absolute_diff_limitfill([2, None, None, None, 3, 1]).tolist()\n        [nan, 0.0, 0.0, nan, nan, 2.0]\n\n    \"\"\"\n\n    name = \"absolute_diff\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n    def __init__(self, method=\"ffill\", limit=None):\n        if method not in [\"backfill\", \"bfill\", \"pad\", \"ffill\"]:\n            raise ValueError(\"Invalid method\")\n        self.method = method\n        self.limit = limit\n\n    def get_function(self):\n        def absolute_diff(data):\n            return data.fillna(method=self.method, limit=self.limit).diff().abs()\n\n        return absolute_diff\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/__init__.py",
    "content": "from featuretools.primitives.standard.transform.binary.add_numeric import AddNumeric\nfrom featuretools.primitives.standard.transform.binary.add_numeric_scalar import (\n    AddNumericScalar,\n)\nfrom featuretools.primitives.standard.transform.binary.and_primitive import And\nfrom featuretools.primitives.standard.transform.binary.divide_by_feature import (\n    DivideByFeature,\n)\nfrom featuretools.primitives.standard.transform.binary.divide_numeric import (\n    DivideNumeric,\n)\nfrom featuretools.primitives.standard.transform.binary.divide_numeric_scalar import (\n    DivideNumericScalar,\n)\nfrom featuretools.primitives.standard.transform.binary.equal import Equal\nfrom featuretools.primitives.standard.transform.binary.equal_scalar import EqualScalar\nfrom featuretools.primitives.standard.transform.binary.greater_than import GreaterThan\nfrom featuretools.primitives.standard.transform.binary.greater_than_equal_to import (\n    GreaterThanEqualTo,\n)\nfrom featuretools.primitives.standard.transform.binary.greater_than_equal_to_scalar import (\n    GreaterThanEqualToScalar,\n)\nfrom featuretools.primitives.standard.transform.binary.greater_than_scalar import (\n    GreaterThanScalar,\n)\nfrom featuretools.primitives.standard.transform.binary.less_than import LessThan\nfrom featuretools.primitives.standard.transform.binary.less_than_equal_to import (\n    LessThanEqualTo,\n)\nfrom featuretools.primitives.standard.transform.binary.less_than_equal_to_scalar import (\n    LessThanEqualToScalar,\n)\nfrom featuretools.primitives.standard.transform.binary.less_than_scalar import (\n    LessThanScalar,\n)\nfrom featuretools.primitives.standard.transform.binary.modulo_by_feature import (\n    ModuloByFeature,\n)\nfrom featuretools.primitives.standard.transform.binary.modulo_numeric import (\n    ModuloNumeric,\n)\nfrom featuretools.primitives.standard.transform.binary.modulo_numeric_scalar import (\n    ModuloNumericScalar,\n)\nfrom featuretools.primitives.standard.transform.binary.multiply_boolean import (\n    MultiplyBoolean,\n)\nfrom featuretools.primitives.standard.transform.binary.multiply_numeric import (\n    MultiplyNumeric,\n)\nfrom featuretools.primitives.standard.transform.binary.multiply_numeric_boolean import (\n    MultiplyNumericBoolean,\n)\nfrom featuretools.primitives.standard.transform.binary.multiply_numeric_scalar import (\n    MultiplyNumericScalar,\n)\nfrom featuretools.primitives.standard.transform.binary.not_equal import NotEqual\nfrom featuretools.primitives.standard.transform.binary.not_equal_scalar import (\n    NotEqualScalar,\n)\nfrom featuretools.primitives.standard.transform.binary.or_primitive import Or\nfrom featuretools.primitives.standard.transform.binary.scalar_subtract_numeric_feature import (\n    ScalarSubtractNumericFeature,\n)\nfrom featuretools.primitives.standard.transform.binary.subtract_numeric import (\n    SubtractNumeric,\n)\nfrom featuretools.primitives.standard.transform.binary.subtract_numeric_scalar import (\n    SubtractNumericScalar,\n)\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/add_numeric.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass AddNumeric(TransformPrimitive):\n    \"\"\"Performs element-wise addition of two lists.\n\n    Description:\n        Given a list of values X and a list of values\n        Y, determine the sum of each value in X with its\n        corresponding value in Y.\n\n    Examples:\n        >>> add_numeric = AddNumeric()\n        >>> add_numeric([2, 1, 2], [1, 2, 2]).tolist()\n        [3, 3, 4]\n    \"\"\"\n\n    name = \"add_numeric\"\n    input_types = [\n        ColumnSchema(semantic_tags={\"numeric\"}),\n        ColumnSchema(semantic_tags={\"numeric\"}),\n    ]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    commutative = True\n\n    description_template = \"the sum of {} and {}\"\n\n    def get_function(self):\n        return np.add\n\n    def generate_name(self, base_feature_names):\n        return \"%s + %s\" % (base_feature_names[0], base_feature_names[1])\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/add_numeric_scalar.py",
    "content": "from woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass AddNumericScalar(TransformPrimitive):\n    \"\"\"Adds a scalar to each value in the list.\n\n    Description:\n        Given a list of numeric values and a scalar, add\n        the given scalar to each value in the list.\n\n    Examples:\n        >>> add_numeric_scalar = AddNumericScalar(value=2)\n        >>> add_numeric_scalar([3, 1, 2]).tolist()\n        [5, 3, 4]\n    \"\"\"\n\n    name = \"add_numeric_scalar\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n    def __init__(self, value=0):\n        self.value = value\n        self.description_template = \"the sum of {{}} and {}\".format(self.value)\n\n    def get_function(self):\n        def add_scalar(vals):\n            return vals + self.value\n\n        return add_scalar\n\n    def generate_name(self, base_feature_names):\n        return \"%s + %s\" % (base_feature_names[0], str(self.value))\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/and_primitive.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Boolean, BooleanNullable\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass And(TransformPrimitive):\n    \"\"\"Performs element-wise logical AND of two lists.\n\n    Description:\n        Given a list of booleans X and a list of booleans Y,\n        determine whether each value in X is `True`, and\n        whether its corresponding value in Y is also `True`.\n\n    Examples:\n        >>> _and = And()\n        >>> _and([False, True, False], [True, True, False]).tolist()\n        [False, True, False]\n    \"\"\"\n\n    name = \"and\"\n    input_types = [\n        [\n            ColumnSchema(logical_type=BooleanNullable),\n            ColumnSchema(logical_type=BooleanNullable),\n        ],\n        [ColumnSchema(logical_type=Boolean), ColumnSchema(logical_type=Boolean)],\n        [\n            ColumnSchema(logical_type=Boolean),\n            ColumnSchema(logical_type=BooleanNullable),\n        ],\n        [\n            ColumnSchema(logical_type=BooleanNullable),\n            ColumnSchema(logical_type=Boolean),\n        ],\n    ]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n    commutative = True\n\n    description_template = \"whether {} and {} are true\"\n\n    def get_function(self):\n        return np.logical_and\n\n    def generate_name(self, base_feature_names):\n        return \"AND(%s, %s)\" % (base_feature_names[0], base_feature_names[1])\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/divide_by_feature.py",
    "content": "from woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass DivideByFeature(TransformPrimitive):\n    \"\"\"Divides a scalar by each value in the list.\n\n    Description:\n        Given a list of numeric values and a scalar, divide\n        the scalar by each value and return the list of\n        quotients.\n\n    Examples:\n        >>> divide_by_feature = DivideByFeature(value=2)\n        >>> divide_by_feature([4, 1, 2]).tolist()\n        [0.5, 2.0, 1.0]\n    \"\"\"\n\n    name = \"divide_by_feature\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n    def __init__(self, value=1):\n        self.value = value\n        self.description_template = \"the result of {} divided by {{}}\".format(\n            self.value,\n        )\n\n    def get_function(self):\n        def divide_by_feature(vals):\n            return self.value / vals\n\n        return divide_by_feature\n\n    def generate_name(self, base_feature_names):\n        return \"%s / %s\" % (str(self.value), base_feature_names[0])\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/divide_numeric.py",
    "content": "from woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass DivideNumeric(TransformPrimitive):\n    \"\"\"Performs element-wise division of two lists.\n\n    Description:\n        Given a list of values X and a list of values\n        Y, determine the quotient of each value in X\n        divided by its corresponding value in Y.\n\n    Args:\n        commutative (bool): determines if Deep Feature Synthesis should\n            generate both x / y and y / x, or just one. If True, there is\n            no guarantee which of the two will be generated. Defaults to False.\n\n    Examples:\n        >>> divide_numeric = DivideNumeric()\n        >>> divide_numeric([2.0, 1.0, 2.0], [1.0, 2.0, 2.0]).tolist()\n        [2.0, 0.5, 1.0]\n    \"\"\"\n\n    name = \"divide_numeric\"\n    input_types = [\n        ColumnSchema(semantic_tags={\"numeric\"}),\n        ColumnSchema(semantic_tags={\"numeric\"}),\n    ]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n    description_template = \"the result of {} divided by {}\"\n\n    def __init__(self, commutative=False):\n        self.commutative = commutative\n\n    def get_function(self):\n        def divide_numeric(val1, val2):\n            return val1 / val2\n\n        return divide_numeric\n\n    def generate_name(self, base_feature_names):\n        return \"%s / %s\" % (base_feature_names[0], base_feature_names[1])\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/divide_numeric_scalar.py",
    "content": "from woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass DivideNumericScalar(TransformPrimitive):\n    \"\"\"Divides each element in the list by a scalar.\n\n    Description:\n        Given a list of numeric values and a scalar, divide\n        each value in the list by the scalar.\n\n    Examples:\n        >>> divide_numeric_scalar = DivideNumericScalar(value=2)\n        >>> divide_numeric_scalar([3, 1, 2]).tolist()\n        [1.5, 0.5, 1.0]\n    \"\"\"\n\n    name = \"divide_numeric_scalar\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n    def __init__(self, value=1):\n        self.value = value\n        self.description_template = \"the result of {{}} divided by {}\".format(\n            self.value,\n        )\n\n    def get_function(self):\n        def divide_scalar(vals):\n            return vals / self.value\n\n        return divide_scalar\n\n    def generate_name(self, base_feature_names):\n        return \"%s / %s\" % (base_feature_names[0], str(self.value))\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/equal.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass Equal(TransformPrimitive):\n    \"\"\"Determines if values in one list are equal to another list.\n\n    Description:\n        Given a list of values X and a list of values Y, determine\n        whether each value in X is equal to each corresponding value\n        in Y.\n\n    Examples:\n        >>> equal = Equal()\n        >>> equal([2, 1, 2], [1, 2, 2]).tolist()\n        [False, False, True]\n    \"\"\"\n\n    name = \"equal\"\n    input_types = [ColumnSchema(), ColumnSchema()]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n    commutative = True\n\n    description_template = \"whether {} equals {}\"\n\n    def get_function(self):\n        def equal(x_vals, y_vals):\n            if isinstance(x_vals.dtype, pd.CategoricalDtype) and isinstance(\n                y_vals.dtype,\n                pd.CategoricalDtype,\n            ):\n                categories = set(x_vals.cat.categories).union(\n                    set(y_vals.cat.categories),\n                )\n                x_vals = x_vals.cat.add_categories(\n                    categories.difference(set(x_vals.cat.categories)),\n                )\n                y_vals = y_vals.cat.add_categories(\n                    categories.difference(set(y_vals.cat.categories)),\n                )\n            return x_vals.eq(y_vals)\n\n        return equal\n\n    def generate_name(self, base_feature_names):\n        return \"%s = %s\" % (base_feature_names[0], base_feature_names[1])\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/equal_scalar.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass EqualScalar(TransformPrimitive):\n    \"\"\"Determines if values in a list are equal to a given scalar.\n\n    Description:\n        Given a list of values and a constant scalar, determine\n        whether each of the values is equal to the scalar.\n\n    Examples:\n        >>> equal_scalar = EqualScalar(value=2)\n        >>> equal_scalar([3, 1, 2]).tolist()\n        [False, False, True]\n    \"\"\"\n\n    name = \"equal_scalar\"\n    input_types = [ColumnSchema()]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n\n    def __init__(self, value=None):\n        self.value = value\n        self.description_template = \"whether {{}} equals {}\".format(self.value)\n\n    def get_function(self):\n        def equal_scalar(vals):\n            return vals == self.value\n\n        return equal_scalar\n\n    def generate_name(self, base_feature_names):\n        return \"%s = %s\" % (base_feature_names[0], str(self.value))\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/greater_than.py",
    "content": "import numpy as np\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable, Datetime, Ordinal\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass GreaterThan(TransformPrimitive):\n    \"\"\"Determines if values in one list are greater than another list.\n\n    Description:\n        Given a list of values X and a list of values Y, determine\n        whether each value in X is greater than each corresponding\n        value in Y. Equal pairs will return `False`.\n\n    Examples:\n        >>> greater_than = GreaterThan()\n        >>> greater_than([2, 1, 2], [1, 2, 2]).tolist()\n        [True, False, False]\n    \"\"\"\n\n    name = \"greater_than\"\n    input_types = [\n        [\n            ColumnSchema(semantic_tags={\"numeric\"}),\n            ColumnSchema(semantic_tags={\"numeric\"}),\n        ],\n        [ColumnSchema(logical_type=Datetime), ColumnSchema(logical_type=Datetime)],\n        [ColumnSchema(logical_type=Ordinal), ColumnSchema(logical_type=Ordinal)],\n    ]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n    description_template = \"whether {} is greater than {}\"\n\n    def get_function(self):\n        def greater_than(val1, val2):\n            val1_is_categorical = isinstance(val1.dtype, pd.CategoricalDtype)\n            val2_is_categorical = isinstance(val2.dtype, pd.CategoricalDtype)\n            if val1_is_categorical and val2_is_categorical:\n                if not all(val1.cat.categories == val2.cat.categories):\n                    return val1.where(pd.isnull, np.nan)\n            elif val1_is_categorical or val2_is_categorical:\n                # This can happen because CFM does not set proper dtypes for intermediate\n                # features, so some agg features that should be Ordinal don't yet have correct type.\n                return val1.where(pd.isnull, np.nan)\n            return val1 > val2\n\n        return greater_than\n\n    def generate_name(self, base_feature_names):\n        return \"%s > %s\" % (base_feature_names[0], base_feature_names[1])\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/greater_than_equal_to.py",
    "content": "import numpy as np\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable, Datetime, Ordinal\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass GreaterThanEqualTo(TransformPrimitive):\n    \"\"\"Determines if values in one list are greater than or equal to another list.\n\n    Description:\n        Given a list of values X and a list of values Y, determine\n        whether each value in X is greater than or equal to each\n        corresponding value in Y. Equal pairs will return `True`.\n\n    Examples:\n        >>> greater_than_equal_to = GreaterThanEqualTo()\n        >>> greater_than_equal_to([2, 1, 2], [1, 2, 2]).tolist()\n        [True, False, True]\n    \"\"\"\n\n    name = \"greater_than_equal_to\"\n    input_types = [\n        [\n            ColumnSchema(semantic_tags={\"numeric\"}),\n            ColumnSchema(semantic_tags={\"numeric\"}),\n        ],\n        [ColumnSchema(logical_type=Datetime), ColumnSchema(logical_type=Datetime)],\n        [ColumnSchema(logical_type=Ordinal), ColumnSchema(logical_type=Ordinal)],\n    ]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n\n    description_template = \"whether {} is greater than or equal to {}\"\n\n    def get_function(self):\n        def greater_than_equal(val1, val2):\n            val1_is_categorical = isinstance(val1.dtype, pd.CategoricalDtype)\n            val2_is_categorical = isinstance(val2.dtype, pd.CategoricalDtype)\n            if val1_is_categorical and val2_is_categorical:\n                if not all(val1.cat.categories == val2.cat.categories):\n                    return val1.where(pd.isnull, np.nan)\n            elif val1_is_categorical or val2_is_categorical:\n                # This can happen because CFM does not set proper dtypes for intermediate\n                # features, so some agg features that should be Ordinal don't yet have correct type.\n                return val1.where(pd.isnull, np.nan)\n            return val1 >= val2\n\n        return greater_than_equal\n\n    def generate_name(self, base_feature_names):\n        return \"%s >= %s\" % (base_feature_names[0], base_feature_names[1])\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/greater_than_equal_to_scalar.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass GreaterThanEqualToScalar(TransformPrimitive):\n    \"\"\"Determines if values are greater than or equal to a given scalar.\n\n    Description:\n        Given a list of values and a constant scalar, determine\n        whether each of the values is greater than or equal to the\n        scalar. If a value is equal to the scalar, return `True`.\n\n    Examples:\n        >>> greater_than_equal_to_scalar = GreaterThanEqualToScalar(value=2)\n        >>> greater_than_equal_to_scalar([3, 1, 2]).tolist()\n        [True, False, True]\n    \"\"\"\n\n    name = \"greater_than_equal_to_scalar\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n\n    def __init__(self, value=0):\n        self.value = value\n        self.description_template = (\n            \"whether {{}} is greater than or equal to {}\".format(self.value)\n        )\n\n    def get_function(self):\n        def greater_than_equal_to_scalar(vals):\n            return vals >= self.value\n\n        return greater_than_equal_to_scalar\n\n    def generate_name(self, base_feature_names):\n        return \"%s >= %s\" % (base_feature_names[0], str(self.value))\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/greater_than_scalar.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass GreaterThanScalar(TransformPrimitive):\n    \"\"\"Determines if values are greater than a given scalar.\n\n    Description:\n        Given a list of values and a constant scalar, determine\n        whether each of the values is greater than the scalar.\n        If a value is equal to the scalar, return `False`.\n\n    Examples:\n        >>> greater_than_scalar = GreaterThanScalar(value=2)\n        >>> greater_than_scalar([3, 1, 2]).tolist()\n        [True, False, False]\n    \"\"\"\n\n    name = \"greater_than_scalar\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n\n    def __init__(self, value=0):\n        self.value = value\n        self.description_template = \"whether {{}} is greater than {}\".format(self.value)\n\n    def get_function(self):\n        def greater_than_scalar(vals):\n            return vals > self.value\n\n        return greater_than_scalar\n\n    def generate_name(self, base_feature_names):\n        return \"%s > %s\" % (base_feature_names[0], str(self.value))\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/less_than.py",
    "content": "import numpy as np\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable, Datetime, Ordinal\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass LessThan(TransformPrimitive):\n    \"\"\"Determines if values in one list are less than another list.\n\n    Description:\n        Given a list of values X and a list of values Y, determine\n        whether each value in X is less than each corresponding value\n        in Y. Equal pairs will return `False`.\n\n    Examples:\n        >>> less_than = LessThan()\n        >>> less_than([2, 1, 2], [1, 2, 2]).tolist()\n        [False, True, False]\n    \"\"\"\n\n    name = \"less_than\"\n    input_types = [\n        [\n            ColumnSchema(semantic_tags={\"numeric\"}),\n            ColumnSchema(semantic_tags={\"numeric\"}),\n        ],\n        [ColumnSchema(logical_type=Datetime), ColumnSchema(logical_type=Datetime)],\n        [ColumnSchema(logical_type=Ordinal), ColumnSchema(logical_type=Ordinal)],\n    ]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n\n    description_template = \"whether {} is less than {}\"\n\n    def get_function(self):\n        def less_than(val1, val2):\n            val1_is_categorical = isinstance(val1.dtype, pd.CategoricalDtype)\n            val2_is_categorical = isinstance(val2.dtype, pd.CategoricalDtype)\n            if val1_is_categorical and val2_is_categorical:\n                if not all(val1.cat.categories == val2.cat.categories):\n                    return val1.where(pd.isnull, np.nan)\n            elif val1_is_categorical or val2_is_categorical:\n                # This can happen because CFM does not set proper dtypes for intermediate\n                # features, so some agg features that should be Ordinal don't yet have correct type.\n                return val1.where(pd.isnull, np.nan)\n            return val1 < val2\n\n        return less_than\n\n    def generate_name(self, base_feature_names):\n        return \"%s < %s\" % (base_feature_names[0], base_feature_names[1])\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/less_than_equal_to.py",
    "content": "import numpy as np\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable, Datetime, Ordinal\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass LessThanEqualTo(TransformPrimitive):\n    \"\"\"Determines if values in one list are less than or equal to another list.\n\n    Description:\n        Given a list of values X and a list of values Y, determine\n        whether each value in X is less than or equal to each\n        corresponding value in Y. Equal pairs will return `True`.\n\n    Examples:\n        >>> less_than_equal_to = LessThanEqualTo()\n        >>> less_than_equal_to([2, 1, 2], [1, 2, 2]).tolist()\n        [False, True, True]\n    \"\"\"\n\n    name = \"less_than_equal_to\"\n    input_types = [\n        [\n            ColumnSchema(semantic_tags={\"numeric\"}),\n            ColumnSchema(semantic_tags={\"numeric\"}),\n        ],\n        [ColumnSchema(logical_type=Datetime), ColumnSchema(logical_type=Datetime)],\n        [ColumnSchema(logical_type=Ordinal), ColumnSchema(logical_type=Ordinal)],\n    ]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n\n    description_template = \"whether {} is less than or equal to {}\"\n\n    def get_function(self):\n        def less_than_equal(val1, val2):\n            val1_is_categorical = isinstance(val1.dtype, pd.CategoricalDtype)\n            val2_is_categorical = isinstance(val2.dtype, pd.CategoricalDtype)\n            if val1_is_categorical and val2_is_categorical:\n                if not all(val1.cat.categories == val2.cat.categories):\n                    return val1.where(pd.isnull, np.nan)\n            elif val1_is_categorical or val2_is_categorical:\n                # This can happen because CFM does not set proper dtypes for intermediate\n                # features, so some agg features that should be Ordinal don't yet have correct type.\n                return val1.where(pd.isnull, np.nan)\n            return val1 <= val2\n\n        return less_than_equal\n\n    def generate_name(self, base_feature_names):\n        return \"%s <= %s\" % (base_feature_names[0], base_feature_names[1])\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/less_than_equal_to_scalar.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass LessThanEqualToScalar(TransformPrimitive):\n    \"\"\"Determines if values are less than or equal to a given scalar.\n\n    Description:\n        Given a list of values and a constant scalar, determine\n        whether each of the values is less than or equal to the\n        scalar. If a value is equal to the scalar, return `True`.\n\n    Examples:\n        >>> less_than_equal_to_scalar = LessThanEqualToScalar(value=2)\n        >>> less_than_equal_to_scalar([3, 1, 2]).tolist()\n        [False, True, True]\n    \"\"\"\n\n    name = \"less_than_equal_to_scalar\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n\n    def __init__(self, value=0):\n        self.value = value\n        self.description_template = \"whether {{}} is less than or equal to {}\".format(\n            self.value,\n        )\n\n    def get_function(self):\n        def less_than_equal_to_scalar(vals):\n            return vals <= self.value\n\n        return less_than_equal_to_scalar\n\n    def generate_name(self, base_feature_names):\n        return \"%s <= %s\" % (base_feature_names[0], str(self.value))\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/less_than_scalar.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass LessThanScalar(TransformPrimitive):\n    \"\"\"Determines if values are less than a given scalar.\n\n    Description:\n        Given a list of values and a constant scalar, determine\n        whether each of the values is less than the scalar.\n        If a value is equal to the scalar, return `False`.\n\n    Examples:\n        >>> less_than_scalar = LessThanScalar(value=2)\n        >>> less_than_scalar([3, 1, 2]).tolist()\n        [False, True, False]\n    \"\"\"\n\n    name = \"less_than_scalar\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n\n    def __init__(self, value=0):\n        self.value = value\n        self.description_template = \"whether {{}} is less than {}\".format(self.value)\n\n    def get_function(self):\n        def less_than_scalar(vals):\n            return vals < self.value\n\n        return less_than_scalar\n\n    def generate_name(self, base_feature_names):\n        return \"%s < %s\" % (base_feature_names[0], str(self.value))\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/modulo_by_feature.py",
    "content": "from woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass ModuloByFeature(TransformPrimitive):\n    \"\"\"Computes the modulo of a scalar by each element in a list.\n\n    Description:\n        Given a list of numeric values and a scalar, return the\n        modulo, or remainder of the scalar after being divided\n        by each value.\n\n    Examples:\n        >>> modulo_by_feature = ModuloByFeature(value=2)\n        >>> modulo_by_feature([4, 1, 2]).tolist()\n        [2, 0, 0]\n    \"\"\"\n\n    name = \"modulo_by_feature\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n    def __init__(self, value=1):\n        self.value = value\n        self.description_template = \"the remainder after dividing {} by {{}}\".format(\n            self.value,\n        )\n\n    def get_function(self):\n        def modulo_by_feature(vals):\n            return self.value % vals\n\n        return modulo_by_feature\n\n    def generate_name(self, base_feature_names):\n        return \"%s %% %s\" % (str(self.value), base_feature_names[0])\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/modulo_numeric.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass ModuloNumeric(TransformPrimitive):\n    \"\"\"Performs element-wise modulo of two lists.\n\n    Description:\n        Given a list of values X and a list of values Y,\n        determine the modulo, or remainder of each value in\n        X after it's divided by its corresponding value in Y.\n\n    Examples:\n        >>> modulo_numeric = ModuloNumeric()\n        >>> modulo_numeric([2, 1, 5], [1, 2, 2]).tolist()\n        [0, 1, 1]\n    \"\"\"\n\n    name = \"modulo_numeric\"\n    input_types = [\n        ColumnSchema(semantic_tags={\"numeric\"}),\n        ColumnSchema(semantic_tags={\"numeric\"}),\n    ]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n    description_template = \"the remainder after dividing {} by {}\"\n\n    def get_function(self):\n        return np.mod\n\n    def generate_name(self, base_feature_names):\n        return \"%s %% %s\" % (base_feature_names[0], base_feature_names[1])\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/modulo_numeric_scalar.py",
    "content": "from woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass ModuloNumericScalar(TransformPrimitive):\n    \"\"\"Computes the modulo of each element in the list by a given scalar.\n\n    Description:\n        Given a list of numeric values and a scalar, return\n        the modulo, or remainder of each value after being\n        divided by the scalar.\n\n    Examples:\n        >>> modulo_numeric_scalar = ModuloNumericScalar(value=2)\n        >>> modulo_numeric_scalar([3, 1, 2]).tolist()\n        [1, 1, 0]\n    \"\"\"\n\n    name = \"modulo_numeric_scalar\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n    def __init__(self, value=1):\n        self.value = value\n        self.description_template = \"the remainder after dividing {{}} by {}\".format(\n            self.value,\n        )\n\n    def get_function(self):\n        def modulo_scalar(vals):\n            return vals % self.value\n\n        return modulo_scalar\n\n    def generate_name(self, base_feature_names):\n        return \"%s %% %s\" % (base_feature_names[0], str(self.value))\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/multiply_boolean.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Boolean, BooleanNullable\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass MultiplyBoolean(TransformPrimitive):\n    \"\"\"Performs element-wise multiplication of two lists of boolean values.\n\n    Description:\n        Given a list of boolean values X and a list of boolean\n        values Y, determine the product of each value in X\n        with its corresponding value in Y.\n\n    Examples:\n        >>> multiply_boolean = MultiplyBoolean()\n        >>> multiply_boolean([True, True, False], [True, False, True]).tolist()\n        [True, False, False]\n    \"\"\"\n\n    name = \"multiply_boolean\"\n    input_types = [\n        [\n            ColumnSchema(logical_type=BooleanNullable),\n            ColumnSchema(logical_type=BooleanNullable),\n        ],\n        [ColumnSchema(logical_type=Boolean), ColumnSchema(logical_type=Boolean)],\n        [\n            ColumnSchema(logical_type=Boolean),\n            ColumnSchema(logical_type=BooleanNullable),\n        ],\n        [\n            ColumnSchema(logical_type=BooleanNullable),\n            ColumnSchema(logical_type=Boolean),\n        ],\n    ]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n    commutative = True\n    description_template = \"the product of {} and {}\"\n\n    def get_function(self):\n        return np.bitwise_and\n\n    def generate_name(self, base_feature_names):\n        return \"%s * %s\" % (base_feature_names[0], base_feature_names[1])\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/multiply_numeric.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass MultiplyNumeric(TransformPrimitive):\n    \"\"\"Performs element-wise multiplication of two lists.\n\n    Description:\n        Given a list of values X and a list of values\n        Y, determine the product of each value in X\n        with its corresponding value in Y.\n\n    Examples:\n        >>> multiply_numeric = MultiplyNumeric()\n        >>> multiply_numeric([2, 1, 2], [1, 2, 2]).tolist()\n        [2, 2, 4]\n    \"\"\"\n\n    name = \"multiply_numeric\"\n    input_types = [\n        ColumnSchema(semantic_tags={\"numeric\"}),\n        ColumnSchema(semantic_tags={\"numeric\"}),\n    ]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    commutative = True\n\n    description_template = \"the product of {} and {}\"\n\n    def get_function(self):\n        return np.multiply\n\n    def generate_name(self, base_feature_names):\n        return \"%s * %s\" % (base_feature_names[0], base_feature_names[1])\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/multiply_numeric_boolean.py",
    "content": "import pandas.api.types as pdtypes\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Boolean, BooleanNullable\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass MultiplyNumericBoolean(TransformPrimitive):\n    \"\"\"Performs element-wise multiplication of a numeric list with a boolean list.\n\n    Description:\n        Given a list of numeric values X and a list of\n        boolean values Y, return the values in X where\n        the corresponding value in Y is True.\n\n    Examples:\n        >>> import pandas as pd\n        >>> multiply_numeric_boolean = MultiplyNumericBoolean()\n        >>> multiply_numeric_boolean([2, 1, 2], [True, True, False]).tolist()\n        [2, 1, 0]\n        >>> multiply_numeric_boolean([2, None, None], [True, True, False]).astype(\"float64\").tolist()\n        [2.0, nan, nan]\n        >>> multiply_numeric_boolean([2, 1, 2], pd.Series([True, True, pd.NA], dtype=\"boolean\")).tolist()\n        [2, 1, <NA>]\n    \"\"\"\n\n    name = \"multiply_numeric_boolean\"\n    input_types = [\n        [\n            ColumnSchema(semantic_tags={\"numeric\"}),\n            ColumnSchema(logical_type=Boolean),\n        ],\n        [\n            ColumnSchema(semantic_tags={\"numeric\"}),\n            ColumnSchema(logical_type=BooleanNullable),\n        ],\n        [\n            ColumnSchema(logical_type=Boolean),\n            ColumnSchema(semantic_tags={\"numeric\"}),\n        ],\n        [\n            ColumnSchema(logical_type=BooleanNullable),\n            ColumnSchema(semantic_tags={\"numeric\"}),\n        ],\n    ]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    commutative = True\n    description_template = \"the product of {} and {}\"\n\n    def get_function(self):\n        def multiply_numeric_boolean(ser1, ser2):\n            if pdtypes.is_bool_dtype(ser1):\n                bools = ser1\n                vals = ser2\n            else:\n                bools = ser2\n                vals = ser1\n            result = vals * bools.astype(\"Int64\")\n            return result\n\n        return multiply_numeric_boolean\n\n    def generate_name(self, base_feature_names):\n        return \"%s * %s\" % (base_feature_names[0], base_feature_names[1])\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/multiply_numeric_scalar.py",
    "content": "from woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass MultiplyNumericScalar(TransformPrimitive):\n    \"\"\"Multiplies each element in the list by a scalar.\n\n    Description:\n        Given a list of numeric values and a scalar, multiply\n        each value in the list by the scalar.\n\n    Examples:\n        >>> multiply_numeric_scalar = MultiplyNumericScalar(value=2)\n        >>> multiply_numeric_scalar([3, 1, 2]).tolist()\n        [6, 2, 4]\n    \"\"\"\n\n    name = \"multiply_numeric_scalar\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n    def __init__(self, value=1):\n        self.value = value\n        self.description_template = \"the product of {{}} and {}\".format(self.value)\n\n    def get_function(self):\n        def multiply_scalar(vals):\n            return vals * self.value\n\n        return multiply_scalar\n\n    def generate_name(self, base_feature_names):\n        return \"%s * %s\" % (base_feature_names[0], str(self.value))\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/not_equal.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass NotEqual(TransformPrimitive):\n    \"\"\"Determines if values in one list are not equal to another list.\n\n    Description:\n        Given a list of values X and a list of values Y, determine\n        whether each value in X is not equal to each corresponding\n        value in Y.\n\n    Examples:\n        >>> not_equal = NotEqual()\n        >>> not_equal([2, 1, 2], [1, 2, 2]).tolist()\n        [True, True, False]\n    \"\"\"\n\n    name = \"not_equal\"\n    input_types = [ColumnSchema(), ColumnSchema()]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n    commutative = True\n    description_template = \"whether {} does not equal {}\"\n\n    def get_function(self):\n        def not_equal(x_vals, y_vals):\n            if isinstance(x_vals.dtype, pd.CategoricalDtype) and isinstance(\n                y_vals.dtype,\n                pd.CategoricalDtype,\n            ):\n                categories = set(x_vals.cat.categories).union(\n                    set(y_vals.cat.categories),\n                )\n                x_vals = x_vals.cat.add_categories(\n                    categories.difference(set(x_vals.cat.categories)),\n                )\n                y_vals = y_vals.cat.add_categories(\n                    categories.difference(set(y_vals.cat.categories)),\n                )\n            return x_vals.ne(y_vals)\n\n        return not_equal\n\n    def generate_name(self, base_feature_names):\n        return \"%s != %s\" % (base_feature_names[0], base_feature_names[1])\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/not_equal_scalar.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass NotEqualScalar(TransformPrimitive):\n    \"\"\"Determines if values in a list are not equal to a given scalar.\n\n    Description:\n        Given a list of values and a constant scalar, determine\n        whether each of the values is not equal to the scalar.\n\n    Examples:\n        >>> not_equal_scalar = NotEqualScalar(value=2)\n        >>> not_equal_scalar([3, 1, 2]).tolist()\n        [True, True, False]\n    \"\"\"\n\n    name = \"not_equal_scalar\"\n    input_types = [ColumnSchema()]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n\n    def __init__(self, value=None):\n        self.value = value\n        self.description_template = \"whether {{}} does not equal {}\".format(self.value)\n\n    def get_function(self):\n        def not_equal_scalar(vals):\n            return vals != self.value\n\n        return not_equal_scalar\n\n    def generate_name(self, base_feature_names):\n        return \"%s != %s\" % (base_feature_names[0], str(self.value))\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/or_primitive.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Boolean, BooleanNullable\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass Or(TransformPrimitive):\n    \"\"\"Performs element-wise logical OR of two lists.\n\n    Description:\n        Given a list of booleans X and a list of booleans Y,\n        determine whether each value in X is `True`, or\n        whether its corresponding value in Y is `True`.\n\n    Examples:\n        >>> _or = Or()\n        >>> _or([False, True, False], [True, True, False]).tolist()\n        [True, True, False]\n    \"\"\"\n\n    name = \"or\"\n    input_types = [\n        [\n            ColumnSchema(logical_type=BooleanNullable),\n            ColumnSchema(logical_type=BooleanNullable),\n        ],\n        [ColumnSchema(logical_type=Boolean), ColumnSchema(logical_type=Boolean)],\n        [\n            ColumnSchema(logical_type=Boolean),\n            ColumnSchema(logical_type=BooleanNullable),\n        ],\n        [\n            ColumnSchema(logical_type=BooleanNullable),\n            ColumnSchema(logical_type=Boolean),\n        ],\n    ]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n    commutative = True\n\n    description_template = \"whether {} is true or {} is true\"\n\n    def get_function(self):\n        return np.logical_or\n\n    def generate_name(self, base_feature_names):\n        return \"OR(%s, %s)\" % (base_feature_names[0], base_feature_names[1])\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/scalar_subtract_numeric_feature.py",
    "content": "from woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass ScalarSubtractNumericFeature(TransformPrimitive):\n    \"\"\"Subtracts each value in the list from a given scalar.\n\n    Description:\n        Given a list of numeric values and a scalar, subtract\n        the each value from the scalar and return the list of\n        differences.\n\n    Examples:\n        >>> scalar_subtract_numeric_feature = ScalarSubtractNumericFeature(value=2)\n        >>> scalar_subtract_numeric_feature([3, 1, 2]).tolist()\n        [-1, 1, 0]\n    \"\"\"\n\n    name = \"scalar_subtract_numeric_feature\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n    def __init__(self, value=0):\n        self.value = value\n        self.description_template = \"the result {} minus {{}}\".format(self.value)\n\n    def get_function(self):\n        def scalar_subtract_numeric_feature(vals):\n            return self.value - vals\n\n        return scalar_subtract_numeric_feature\n\n    def generate_name(self, base_feature_names):\n        return \"%s - %s\" % (str(self.value), base_feature_names[0])\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/subtract_numeric.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass SubtractNumeric(TransformPrimitive):\n    \"\"\"Performs element-wise subtraction of two lists.\n\n    Description:\n        Given a list of values X and a list of values\n        Y, determine the difference of each value\n        in X from its corresponding value in Y.\n\n    Args:\n        commutative (bool): determines if Deep Feature Synthesis should\n            generate both x - y and y - x, or just one. If True, there is no\n            guarantee which of the two will be generated. Defaults to True.\n\n    Notes:\n        commutative is True by default since False would result in 2 perfectly\n        correlated series.\n\n    Examples:\n        >>> subtract_numeric = SubtractNumeric()\n        >>> subtract_numeric([2, 1, 2], [1, 2, 2]).tolist()\n        [1, -1, 0]\n    \"\"\"\n\n    name = \"subtract_numeric\"\n    input_types = [\n        ColumnSchema(semantic_tags={\"numeric\"}),\n        ColumnSchema(semantic_tags={\"numeric\"}),\n    ]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    description_template = \"the result of {} minus {}\"\n    commutative = True\n\n    def __init__(self, commutative=True):\n        self.commutative = commutative\n\n    def get_function(self):\n        return np.subtract\n\n    def generate_name(self, base_feature_names):\n        return \"%s - %s\" % (base_feature_names[0], base_feature_names[1])\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/binary/subtract_numeric_scalar.py",
    "content": "from woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\n\n\nclass SubtractNumericScalar(TransformPrimitive):\n    \"\"\"Subtracts a scalar from each element in the list.\n\n    Description:\n        Given a list of numeric values and a scalar, subtract\n        the given scalar from each value in the list.\n\n    Examples:\n        >>> subtract_numeric_scalar = SubtractNumericScalar(value=2)\n        >>> subtract_numeric_scalar([3, 1, 2]).tolist()\n        [1, -1, 0]\n    \"\"\"\n\n    name = \"subtract_numeric_scalar\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n    def __init__(self, value=0):\n        self.value = value\n        self.description_template = \"the result of {{}} minus {}\".format(self.value)\n\n    def get_function(self):\n        def subtract_scalar(vals):\n            return vals - self.value\n\n        return subtract_scalar\n\n    def generate_name(self, base_feature_names):\n        return \"%s - %s\" % (base_feature_names[0], str(self.value))\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/cumulative/__init__.py",
    "content": "from featuretools.primitives.standard.transform.cumulative.cum_count import CumCount\nfrom featuretools.primitives.standard.transform.cumulative.cum_max import CumMax\nfrom featuretools.primitives.standard.transform.cumulative.cum_mean import CumMean\nfrom featuretools.primitives.standard.transform.cumulative.cum_min import CumMin\nfrom featuretools.primitives.standard.transform.cumulative.cum_sum import CumSum\nfrom featuretools.primitives.standard.transform.cumulative.cumulative_time_since_last_false import (\n    CumulativeTimeSinceLastFalse,\n)\nfrom featuretools.primitives.standard.transform.cumulative.cumulative_time_since_last_true import (\n    CumulativeTimeSinceLastTrue,\n)\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/cumulative/cum_count.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass CumCount(TransformPrimitive):\n    \"\"\"Calculates the cumulative count.\n\n    Description:\n        Given a list of values, return the cumulative count\n        (or running count). There is no set window, so the\n        count at each point is calculated over all prior\n        values. `NaN` values are counted.\n\n    Examples:\n        >>> cum_count = CumCount()\n        >>> cum_count([1, 2, 3, 4, None, 5]).tolist()\n        [1, 2, 3, 4, 5, 6]\n    \"\"\"\n\n    name = \"cum_count\"\n    input_types = [\n        [ColumnSchema(semantic_tags={\"foreign_key\"})],\n        [ColumnSchema(semantic_tags={\"category\"})],\n    ]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n    uses_full_dataframe = True\n    description_template = \"the cumulative count of {}\"\n\n    def get_function(self):\n        def cum_count(values):\n            return np.arange(1, len(values) + 1)\n\n        return cum_count\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/cumulative/cum_max.py",
    "content": "from woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass CumMax(TransformPrimitive):\n    \"\"\"Calculates the cumulative maximum.\n\n    Description:\n        Given a list of values, return the cumulative max\n        (or running max). There is no set window, so the max\n        at each point is calculated over all prior values.\n        `NaN` values will return `NaN`, but in the window of a\n        cumulative caluclation, they're ignored.\n\n    Examples:\n        >>> cum_max = CumMax()\n        >>> cum_max([1, 2, 3, 4, None, 5]).tolist()\n        [1.0, 2.0, 3.0, 4.0, nan, 5.0]\n    \"\"\"\n\n    name = \"cum_max\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    uses_full_dataframe = True\n    description_template = \"the cumulative maximum of {}\"\n\n    def get_function(self):\n        def cum_max(values):\n            return values.cummax()\n\n        return cum_max\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/cumulative/cum_mean.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass CumMean(TransformPrimitive):\n    \"\"\"Calculates the cumulative mean.\n\n    Description:\n        Given a list of values, return the cumulative mean\n        (or running mean). There is no set window, so the\n        mean at each point is calculated over all prior values.\n        `NaN` values will return `NaN`, but in the window of a\n        cumulative caluclation, they're treated as 0.\n\n    Examples:\n        >>> cum_mean = CumMean()\n        >>> cum_mean([1, 2, 3, 4, None, 5]).tolist()\n        [1.0, 1.5, 2.0, 2.5, nan, 2.5]\n    \"\"\"\n\n    name = \"cum_mean\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    uses_full_dataframe = True\n    description_template = \"the cumulative mean of {}\"\n\n    def get_function(self):\n        def cum_mean(values):\n            return values.cumsum() / np.arange(1, len(values) + 1)\n\n        return cum_mean\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/cumulative/cum_min.py",
    "content": "from woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass CumMin(TransformPrimitive):\n    \"\"\"Calculates the cumulative minimum.\n\n    Description:\n        Given a list of values, return the cumulative min\n        (or running min). There is no set window, so the min\n        at each point is calculated over all prior values.\n        `NaN` values will return `NaN`, but in the window of a\n        cumulative caluclation, they're ignored.\n\n    Examples:\n        >>> cum_min = CumMin()\n        >>> cum_min([1, 2, -3, 4, None, 5]).tolist()\n        [1.0, 1.0, -3.0, -3.0, nan, -3.0]\n    \"\"\"\n\n    name = \"cum_min\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    uses_full_dataframe = True\n    description_template = \"the cumulative minimum of {}\"\n\n    def get_function(self):\n        def cum_min(values):\n            return values.cummin()\n\n        return cum_min\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/cumulative/cum_sum.py",
    "content": "from woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass CumSum(TransformPrimitive):\n    \"\"\"Calculates the cumulative sum.\n\n    Description:\n        Given a list of values, return the cumulative sum\n        (or running total). There is no set window, so the\n        sum at each point is calculated over all prior values.\n        `NaN` values will return `NaN`, but in the window of a\n        cumulative caluclation, they're ignored.\n\n    Examples:\n        >>> cum_sum = CumSum()\n        >>> cum_sum([1, 2, 3, 4, None, 5]).tolist()\n        [1.0, 3.0, 6.0, 10.0, nan, 15.0]\n    \"\"\"\n\n    name = \"cum_sum\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    uses_full_dataframe = True\n    description_template = \"the cumulative sum of {}\"\n\n    def get_function(self):\n        def cum_sum(values):\n            return values.cumsum()\n\n        return cum_sum\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/cumulative/cumulative_time_since_last_false.py",
    "content": "import numpy as np\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Boolean, Datetime, Double\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass CumulativeTimeSinceLastFalse(TransformPrimitive):\n    \"\"\"Determines the time since last `False` value.\n\n    Description:\n        Given a list of booleans and a list of corresponding\n        datetimes, determine the time at each point since the\n        last `False` value. Returns time difference in seconds.\n        `NaN` values are ignored.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> cumulative_time_since_last_false = CumulativeTimeSinceLastFalse()\n        >>> booleans = [False, True, False, True]\n        >>> datetimes = [\n        ...     datetime(2011, 4, 9, 10, 30, 0),\n        ...     datetime(2011, 4, 9, 10, 30, 10),\n        ...     datetime(2011, 4, 9, 10, 30, 15),\n        ...     datetime(2011, 4, 9, 10, 30, 29)\n        ... ]\n        >>> cumulative_time_since_last_false(datetimes, booleans).tolist()\n        [0.0, 10.0, 0.0, 14.0]\n    \"\"\"\n\n    name = \"cumulative_time_since_last_false\"\n    input_types = [\n        ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"}),\n        ColumnSchema(logical_type=Boolean),\n    ]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n\n    def get_function(self):\n        def time_since_previous_false(datetime_col, bool_col):\n            if bool_col.dropna().empty:\n                return pd.Series([np.nan] * len(bool_col))\n            df = pd.DataFrame(\n                {\n                    \"datetime\": datetime_col,\n                    \"last_false_datetime\": datetime_col,\n                    \"bool\": bool_col,\n                },\n            )\n            not_false_indices = df[\"bool\"]\n            df.loc[not_false_indices, \"last_false_datetime\"] = np.nan\n            df[\"last_false_datetime\"] = df[\"last_false_datetime\"].fillna(method=\"ffill\")\n            total_seconds = (\n                pd.to_datetime(df[\"datetime\"]).subtract(df[\"last_false_datetime\"])\n            ).dt.total_seconds()\n            return pd.Series(total_seconds)\n\n        return time_since_previous_false\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/cumulative/cumulative_time_since_last_true.py",
    "content": "import numpy as np\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Boolean, Datetime, Double\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass CumulativeTimeSinceLastTrue(TransformPrimitive):\n    \"\"\"Determines the time (in seconds) since the last boolean was `True`\n    given a datetime index column and boolean column\n\n    Examples:\n        >>> from datetime import datetime\n        >>> cumulative_time_since_last_true = CumulativeTimeSinceLastTrue()\n        >>> booleans = [False, True, False, True]\n        >>> datetimes = [\n        ...     datetime(2011, 4, 9, 10, 30, 0),\n        ...     datetime(2011, 4, 9, 10, 30, 10),\n        ...     datetime(2011, 4, 9, 10, 30, 15),\n        ...     datetime(2011, 4, 9, 10, 30, 30)\n        ... ]\n        >>> cumulative_time_since_last_true(datetimes, booleans).tolist()\n        [nan, 0.0, 5.0, 0.0]\n    \"\"\"\n\n    name = \"cumulative_time_since_last_true\"\n    input_types = [\n        ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"}),\n        ColumnSchema(logical_type=Boolean),\n    ]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n\n    def get_function(self):\n        def time_since_previous_true(datetime_col, bool_col):\n            if bool_col.dropna().empty:\n                return pd.Series([np.nan] * len(bool_col))\n            df = pd.DataFrame(\n                {\n                    \"datetime\": datetime_col,\n                    \"last_true_datetime\": datetime_col,\n                    \"bool\": bool_col,\n                },\n            )\n            not_false_indices = df[\"bool\"]\n            df.loc[~not_false_indices, \"last_true_datetime\"] = np.nan\n            df[\"last_true_datetime\"] = df[\"last_true_datetime\"].fillna(method=\"ffill\")\n            total_seconds = (\n                pd.to_datetime(df[\"datetime\"]).subtract(df[\"last_true_datetime\"])\n            ).dt.total_seconds()\n            return pd.Series(total_seconds)\n\n        return time_since_previous_true\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/__init__.py",
    "content": "from featuretools.primitives.standard.transform.datetime.age import Age\nfrom featuretools.primitives.standard.transform.datetime.date_to_holiday import (\n    DateToHoliday,\n)\nfrom featuretools.primitives.standard.transform.datetime.date_to_timezone import (\n    DateToTimeZone,\n)\nfrom featuretools.primitives.standard.transform.datetime.day import Day\nfrom featuretools.primitives.standard.transform.datetime.day_of_year import DayOfYear\nfrom featuretools.primitives.standard.transform.datetime.days_in_month import (\n    DaysInMonth,\n)\nfrom featuretools.primitives.standard.transform.datetime.diff_datetime import (\n    DiffDatetime,\n)\nfrom featuretools.primitives.standard.transform.datetime.distance_to_holiday import (\n    DistanceToHoliday,\n)\nfrom featuretools.primitives.standard.transform.datetime.hour import Hour\nfrom featuretools.primitives.standard.transform.datetime.is_first_week_of_month import (\n    IsFirstWeekOfMonth,\n)\nfrom featuretools.primitives.standard.transform.datetime.is_federal_holiday import (\n    IsFederalHoliday,\n)\nfrom featuretools.primitives.standard.transform.datetime.is_leap_year import IsLeapYear\nfrom featuretools.primitives.standard.transform.datetime.is_lunch_time import (\n    IsLunchTime,\n)\nfrom featuretools.primitives.standard.transform.datetime.is_month_end import IsMonthEnd\nfrom featuretools.primitives.standard.transform.datetime.is_month_start import (\n    IsMonthStart,\n)\nfrom featuretools.primitives.standard.transform.datetime.is_quarter_end import (\n    IsQuarterEnd,\n)\nfrom featuretools.primitives.standard.transform.datetime.is_quarter_start import (\n    IsQuarterStart,\n)\nfrom featuretools.primitives.standard.transform.datetime.is_weekend import IsWeekend\nfrom featuretools.primitives.standard.transform.datetime.is_working_hours import (\n    IsWorkingHours,\n)\nfrom featuretools.primitives.standard.transform.datetime.is_year_end import IsYearEnd\nfrom featuretools.primitives.standard.transform.datetime.is_year_start import (\n    IsYearStart,\n)\nfrom featuretools.primitives.standard.transform.datetime.minute import Minute\nfrom featuretools.primitives.standard.transform.datetime.month import Month\nfrom featuretools.primitives.standard.transform.datetime.part_of_day import PartOfDay\nfrom featuretools.primitives.standard.transform.datetime.quarter import Quarter\nfrom featuretools.primitives.standard.transform.datetime.season import Season\nfrom featuretools.primitives.standard.transform.datetime.second import Second\nfrom featuretools.primitives.standard.transform.datetime.time_since import TimeSince\nfrom featuretools.primitives.standard.transform.datetime.time_since_previous import (\n    TimeSincePrevious,\n)\nfrom featuretools.primitives.standard.transform.datetime.week import Week\nfrom featuretools.primitives.standard.transform.datetime.weekday import Weekday\nfrom featuretools.primitives.standard.transform.datetime.year import Year\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/age.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import AgeFractional, Datetime\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass Age(TransformPrimitive):\n    \"\"\"Calculates the age in years as a floating point number given a\n       date of birth.\n\n    Description:\n        Age in years is computed by calculating the number of days between\n        the date of birth and the reference time and dividing the result\n        by 365.\n\n    Examples:\n        Determine the age of three people as of Jan 1, 2019\n        >>> import pandas as pd\n        >>> reference_date = pd.to_datetime(\"01-01-2019\")\n        >>> age = Age()\n        >>> input_ages = [pd.to_datetime(\"01-01-2000\"),\n        ...               pd.to_datetime(\"05-30-1983\"),\n        ...               pd.to_datetime(\"10-17-1997\")]\n        >>> age(input_ages, time=reference_date).tolist()\n        [19.013698630136986, 35.61643835616438, 21.221917808219178]\n    \"\"\"\n\n    name = \"age\"\n    input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={\"date_of_birth\"})]\n    return_type = ColumnSchema(logical_type=AgeFractional, semantic_tags={\"numeric\"})\n    uses_calc_time = True\n    description_template = \"the age from {}\"\n\n    def get_function(self):\n        def age(x, time=None):\n            return (time - x).dt.days / 365\n\n        return age\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/date_to_holiday.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Categorical, Datetime\n\nfrom featuretools.primitives.base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.datetime.utils import HolidayUtil\n\n\nclass DateToHoliday(TransformPrimitive):\n    \"\"\"Transforms time of an instance into the holiday name, if there is one.\n\n    Description:\n        If there is no holiday, it returns `NaN`. Currently only works for the\n        United States and Canada with dates between 1950 and 2100.\n\n    Args:\n        country (str): Country to use for determining Holidays.\n            Default is 'US'. Should be one of the available countries here:\n            https://github.com/dr-prodigy/python-holidays#available-countries\n\n    Examples:\n        >>> from datetime import datetime\n        >>> date_to_holiday = DateToHoliday()\n        >>> dates = pd.Series([datetime(2016, 1, 1),\n        ...          datetime(2016, 2, 27),\n        ...          datetime(2017, 5, 29, 10, 30, 5),\n        ...          datetime(2018, 7, 4)])\n        >>> date_to_holiday(dates).tolist()\n        [\"New Year's Day\", nan, 'Memorial Day', 'Independence Day']\n\n        We can also change the country.\n\n        >>> date_to_holiday_canada = DateToHoliday(country='Canada')\n        >>> dates = pd.Series([datetime(2016, 7, 1),\n        ...          datetime(2016, 11, 15),\n        ...          datetime(2018, 12, 25)])\n        >>> date_to_holiday_canada(dates).tolist()\n        ['Canada Day', nan, 'Christmas Day']\n    \"\"\"\n\n    name = \"date_to_holiday\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={\"category\"})\n\n    def __init__(self, country=\"US\"):\n        self.country = country\n        self.holidayUtil = HolidayUtil(country)\n\n    def get_function(self):\n        def date_to_holiday(x):\n            holiday_df = self.holidayUtil.to_df()\n            df = pd.DataFrame({\"date\": x})\n            df[\"date\"] = df[\"date\"].dt.date.astype(\"datetime64[ns]\")\n\n            df = df.merge(\n                holiday_df,\n                how=\"left\",\n                left_on=\"date\",\n                right_on=\"holiday_date\",\n            )\n            return df.names.values\n\n        return date_to_holiday\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/date_to_timezone.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Categorical, Datetime\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass DateToTimeZone(TransformPrimitive):\n    \"\"\"Determines the timezone of a datetime.\n\n    Description:\n        Given a list of datetimes, extract the timezone from each\n        one. Looks for the `tzinfo` attribute on `datetime.datetime`\n        objects. If the datetime has no timezone or the date is\n        missing, return `NaN`.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> from pytz import timezone\n        >>> date_to_time_zone = DateToTimeZone()\n        >>> dates = [datetime(2010, 1, 1, tzinfo=timezone(\"America/Los_Angeles\")),\n        ...          datetime(2010, 1, 1, tzinfo=timezone(\"America/New_York\")),\n        ...          datetime(2010, 1, 1, tzinfo=timezone(\"America/Chicago\")),\n        ...          datetime(2010, 1, 1)]\n        >>> date_to_time_zone(dates).tolist()\n        ['America/Los_Angeles', 'America/New_York', 'America/Chicago', nan]\n    \"\"\"\n\n    name = \"date_to_time_zone\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={\"category\"})\n\n    def get_function(self):\n        def date_to_time_zone(x):\n            return x.apply(lambda x: x.tzinfo.zone if x.tzinfo else np.nan)\n\n        return date_to_time_zone\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/day.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Ordinal\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass Day(TransformPrimitive):\n    \"\"\"Determines the day of the month from a datetime.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> dates = [datetime(2019, 3, 1),\n        ...          datetime(2019, 3, 3),\n        ...          datetime(2019, 3, 31)]\n        >>> day = Day()\n        >>> day(dates).tolist()\n        [1, 3, 31]\n    \"\"\"\n\n    name = \"day\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(\n        logical_type=Ordinal(order=list(range(1, 32))),\n        semantic_tags={\"category\"},\n    )\n\n    description_template = \"the day of the month of {}\"\n\n    def get_function(self):\n        def day(vals):\n            return vals.dt.day\n\n        return day\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/day_of_year.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Ordinal\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass DayOfYear(TransformPrimitive):\n    \"\"\"Determines the ordinal day of the year from the given datetime\n\n    Description:\n        For a list of dates, return the ordinal day of the year\n        from the given datetime.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> dates = [datetime(2019, 1, 1),\n        ...          datetime(2020, 12, 31),\n        ...          datetime(2020, 2, 28)]\n        >>> dayOfYear = DayOfYear()\n        >>> dayOfYear(dates).tolist()\n        [1, 366, 59]\n    \"\"\"\n\n    name = \"day_of_year\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(\n        logical_type=Ordinal(order=list(range(1, 367))),\n        semantic_tags={\"category\"},\n    )\n\n    description_template = \"the day of year from {}\"\n\n    def get_function(self):\n        def dayOfYear(vals):\n            return vals.dt.dayofyear\n\n        return dayOfYear\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/days_in_month.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Ordinal\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass DaysInMonth(TransformPrimitive):\n    \"\"\"Determines the number of days in the month of given datetime.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> dates = [datetime(2019, 12, 1),\n        ...          datetime(2019, 1, 3),\n        ...          datetime(2020, 2, 1)]\n        >>> days_in_month = DaysInMonth()\n        >>> days_in_month(dates).tolist()\n        [31, 31, 29]\n    \"\"\"\n\n    name = \"days_in_month\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(\n        logical_type=Ordinal(order=list(range(1, 32))),\n        semantic_tags={\"category\"},\n    )\n\n    description_template = \"the days in the month of {}\"\n\n    def get_function(self):\n        def days_in_month(vals):\n            return vals.dt.daysinmonth\n\n        return days_in_month\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/diff_datetime.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Timedelta\n\nfrom featuretools.primitives.standard.transform.numeric.diff import Diff\n\n\nclass DiffDatetime(Diff):\n    \"\"\"Computes the timedelta between a datetime in a list and the\n    previous datetime in that list.\n\n    Args:\n        periods (int): The number of periods by which to shift the index row.\n            Default is 0. Periods correspond to rows.\n\n    Description:\n        Given a list of datetimes, compute the difference from the previous\n        item in the list. The result for the first element of the list will\n        always be `NaT`.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> dt_values = [datetime(2019, 3, 1), datetime(2019, 6, 30), datetime(2019, 11, 17), datetime(2020, 1, 30), datetime(2020, 3, 11)]\n        >>> diff_dt = DiffDatetime()\n        >>> diff_dt(dt_values).tolist()\n        [NaT, Timedelta('121 days 00:00:00'), Timedelta('140 days 00:00:00'), Timedelta('74 days 00:00:00'), Timedelta('41 days 00:00:00')]\n\n        You can specify the number of periods to shift the values\n\n        >>> diff_dt_periods = DiffDatetime(periods = 1)\n        >>> diff_dt_periods(dt_values).tolist()\n        [NaT, NaT, Timedelta('121 days 00:00:00'), Timedelta('140 days 00:00:00'), Timedelta('74 days 00:00:00')]\n    \"\"\"\n\n    name = \"diff_datetime\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(logical_type=Timedelta)\n    uses_full_dataframe = True\n    description_template = \"the difference from the previous value of {}\"\n\n    def __init__(self, periods=0):\n        super().__init__(periods)\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/distance_to_holiday.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime\n\nfrom featuretools.primitives.base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.datetime.utils import HolidayUtil\n\n\nclass DistanceToHoliday(TransformPrimitive):\n    \"\"\"Computes the number of days before or after a given holiday.\n\n    Description:\n        For a list of dates, return the distance from the nearest\n        occurrence of a chosen holiday. The distance is returned in\n        days. If the closest occurrence is prior to the date given,\n        return a negative number.\n\n        If a date is missing, return `NaN`.\n\n        Currently only works with dates between 1950 and 2100.\n\n    Args:\n        holiday (str): Name of the holiday. Defaults to New Year's Day.\n\n        country (str): Specifies which country's calendar to use for the\n            given holiday. Default is `US`.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> distance_to_holiday = DistanceToHoliday(\"New Year's Day\")\n        >>> dates = [datetime(2010, 1, 1),\n        ...          datetime(2012, 5, 31),\n        ...          datetime(2017, 7, 31),\n        ...          datetime(2020, 12, 31)]\n        >>> distance_to_holiday(dates).tolist()\n        [0, -151, 154, 1]\n\n        We can also control the country in which we're searching for\n            a holiday.\n\n        >>> distance_to_holiday = DistanceToHoliday(\"Canada Day\", country='Canada')\n        >>> dates = [datetime(2010, 1, 1),\n        ...          datetime(2012, 5, 31),\n        ...          datetime(2017, 7, 31),\n        ...          datetime(2020, 12, 31)]\n        >>> distance_to_holiday(dates).tolist()\n        [181, 31, -30, 182]\n    \"\"\"\n\n    name = \"distance_to_holiday\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    default_value = 0\n\n    def __init__(self, holiday=\"New Year's Day\", country=\"US\"):\n        self.country = country\n        self.holiday = holiday\n        self.holidayUtil = HolidayUtil(country)\n\n        available_holidays = list(set(self.holidayUtil.federal_holidays.values()))\n        if self.holiday not in available_holidays:\n            error = \"must be one of the available holidays:\\n%s\" % available_holidays\n            raise ValueError(error)\n\n    def get_function(self):\n        def distance_to_holiday(x):\n            holiday_df = self.holidayUtil.to_df()\n            holiday_df = holiday_df[holiday_df.names == self.holiday]\n\n            df = pd.DataFrame({\"date\": x})\n            df[\"x_index\"] = df.index  # store original index as a column\n            df = df.dropna()\n            df = df.sort_values(\"date\")\n            df[\"date\"] = df[\"date\"].dt.date.astype(\"datetime64[ns]\")\n\n            matches = pd.merge_asof(\n                df,\n                holiday_df,\n                left_on=\"date\",\n                right_on=\"holiday_date\",\n                direction=\"nearest\",\n                tolerance=pd.Timedelta(\"365d\"),\n            )\n            matches = matches.set_index(\"x_index\")\n            matches[\"days_diff\"] = (matches.holiday_date - matches.date).dt.days\n\n            return matches.days_diff.reindex_like(x)\n\n        return distance_to_holiday\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/hour.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Ordinal\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass Hour(TransformPrimitive):\n    \"\"\"Determines the hour value of a datetime.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> dates = [datetime(2019, 3, 1),\n        ...          datetime(2019, 3, 3, 11, 10, 50),\n        ...          datetime(2019, 3, 31, 19, 45, 15)]\n        >>> hour = Hour()\n        >>> hour(dates).tolist()\n        [0, 11, 19]\n    \"\"\"\n\n    name = \"hour\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(\n        logical_type=Ordinal(order=list(range(24))),\n        semantic_tags={\"category\"},\n    )\n\n    description_template = \"the hour value of {}\"\n\n    def get_function(self):\n        def hour(vals):\n            return vals.dt.hour\n\n        return hour\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/is_federal_holiday.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable, Datetime\n\nfrom featuretools.primitives.base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.datetime.utils import HolidayUtil\n\n\nclass IsFederalHoliday(TransformPrimitive):\n    \"\"\"Determines if a given datetime is a federal holiday.\n\n    Description:\n        This primtive currently only works for the United States\n        and Canada with dates between 1950 and 2100.\n\n    Args:\n        country (str): Country to use for determining Holidays.\n            Default is 'US'. Should be one of the available countries here:\n            https://github.com/dr-prodigy/python-holidays#available-countries\n\n    Examples:\n        >>> from datetime import datetime\n        >>> is_federal_holiday = IsFederalHoliday(country=\"US\")\n        >>> is_federal_holiday([\n        ...     datetime(2019, 7, 4, 10, 0, 30),\n        ...     datetime(2019, 2, 26)]).tolist()\n        [True, False]\n    \"\"\"\n\n    name = \"is_federal_holiday\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n\n    def __init__(self, country=\"US\"):\n        self.country = country\n        self.holidayUtil = HolidayUtil(country)\n\n    def get_function(self):\n        def is_federal_holiday(x):\n            holidays_df = self.holidayUtil.to_df()\n            is_holiday = x.dt.normalize().isin(holidays_df.holiday_date)\n            if x.isnull().values.any():\n                is_holiday = is_holiday.astype(\"object\")\n                is_holiday[x.isnull()] = np.nan\n            return is_holiday.values\n\n        return is_federal_holiday\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/is_first_week_of_month.py",
    "content": "import numpy as np\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable, Datetime\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass IsFirstWeekOfMonth(TransformPrimitive):\n    \"\"\"Determines if a date falls in the first week of the month.\n\n    Description:\n        Converts a datetime to a boolean indicating if the date\n        falls in the first week of the month. The first week of\n        the month starts on day 1, and the week number is incremented\n        each Sunday.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> is_first_week_of_month = IsFirstWeekOfMonth()\n        >>> times = [datetime(2019, 3, 1),\n        ...          datetime(2019, 3, 3),\n        ...          datetime(2019, 3, 31),\n        ...          datetime(2019, 3, 30)]\n        >>> is_first_week_of_month(times).tolist()\n        [True, False, False, False]\n    \"\"\"\n\n    name = \"is_first_week_of_month\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n\n    def get_function(self):\n        def is_first_week_of_month(x):\n            df = pd.DataFrame({\"date\": x})\n            df[\"first_day\"] = df.date - pd.to_timedelta(df[\"date\"].dt.day - 1, unit=\"d\")\n            df[\"dom\"] = df.date.dt.day\n            df[\"first_day_weekday\"] = df.first_day.dt.weekday\n            df[\"adjusted_dom\"] = df.dom + df.first_day_weekday + 1\n            df.loc[df[\"first_day_weekday\"].astype(float) == 6.0, \"adjusted_dom\"] = df[\n                \"dom\"\n            ]\n            df[\"is_first_week\"] = np.ceil(df.adjusted_dom / 7.0) == 1.0\n            if df[\"date\"].isnull().values.any():\n                df[\"is_first_week\"] = df[\"is_first_week\"].astype(\"object\")\n                df.loc[df[\"date\"].isnull(), \"is_first_week\"] = np.nan\n            return df.is_first_week.values\n\n        return is_first_week_of_month\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/is_leap_year.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable, Datetime\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass IsLeapYear(TransformPrimitive):\n    \"\"\"Determines the is_leap_year attribute of a datetime column.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> dates = [datetime(2019, 3, 1),\n        ...          datetime(2020, 3, 3, 11, 10, 50),\n        ...          datetime(2021, 3, 31, 19, 45, 15)]\n        >>> ily = IsLeapYear()\n        >>> ily(dates).tolist()\n        [False, True, False]\n    \"\"\"\n\n    name = \"is_leap_year\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n\n    description_template = \"whether the year of {} is a leap year\"\n\n    def get_function(self):\n        def is_leap_year(vals):\n            return vals.dt.is_leap_year\n\n        return is_leap_year\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/is_lunch_time.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable, Datetime\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass IsLunchTime(TransformPrimitive):\n    \"\"\"Determines if a datetime falls during configurable lunch hour, on a 24-hour clock.\n\n    Args:\n        lunch_hour (int): Hour when lunch is taken. Must adhere to 24-hour clock. Defaults to 12.\n\n    Examples:\n        >>> import numpy as np\n        >>> from datetime import datetime\n        >>> dates = [datetime(2022, 6, 21, 12, 3, 3),\n        ...          datetime(2019, 1, 3, 4, 4, 4),\n        ...          datetime(2022, 1, 1, 11, 1, 2),\n        ...          np.nan]\n        >>> is_lunch_time = IsLunchTime()\n        >>> is_lunch_time(dates).tolist()\n        [True, False, False, False]\n        >>> is_lunch_time = IsLunchTime(11)\n        >>> is_lunch_time(dates).tolist()\n        [False, False, True, False]\n    \"\"\"\n\n    name = \"is_lunch_time\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n\n    description_template = \"whether {} falls during lunch time\"\n\n    def __init__(self, lunch_hour=12):\n        self.lunch_hour = lunch_hour\n\n    def get_function(self):\n        def is_lunch_time(vals):\n            return vals.dt.hour == self.lunch_hour\n\n        return is_lunch_time\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/is_month_end.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable, Datetime\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass IsMonthEnd(TransformPrimitive):\n    \"\"\"Determines the is_month_end attribute of a datetime column.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> dates = [datetime(2019, 3, 1),\n        ...          datetime(2021, 2, 28),\n        ...          datetime(2020, 2, 29)]\n        >>> ime = IsMonthEnd()\n        >>> ime(dates).tolist()\n        [False, True, True]\n    \"\"\"\n\n    name = \"is_month_end\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n\n    description_template = \"whether {} is at the end of a month\"\n\n    def get_function(self):\n        def is_month_end(vals):\n            return vals.dt.is_month_end\n\n        return is_month_end\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/is_month_start.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable, Datetime\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass IsMonthStart(TransformPrimitive):\n    \"\"\"Determines the is_month_start attribute of a datetime column.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> dates = [datetime(2019, 3, 1),\n        ...          datetime(2020, 2, 13),\n        ...          datetime(2020, 2, 29)]\n        >>> ims = IsMonthStart()\n        >>> ims(dates).tolist()\n        [True, False, False]\n    \"\"\"\n\n    name = \"is_month_start\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n\n    description_template = \"whether {} is at the start of a month\"\n\n    def get_function(self):\n        def is_month_start(vals):\n            return vals.dt.is_month_start\n\n        return is_month_start\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/is_quarter_end.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable, Datetime\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass IsQuarterEnd(TransformPrimitive):\n    \"\"\"Determines the is_quarter_end attribute of a datetime column.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> iqe = IsQuarterEnd()\n        >>> dates = [datetime(2020, 3, 31),\n        ...          datetime(2020, 1, 1)]\n        >>> iqe(dates).tolist()\n        [True, False]\n    \"\"\"\n\n    name = \"is_quarter_end\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n\n    description_template = \"whether {} is a quarter end\"\n\n    def get_function(self):\n        def is_quarter_end(vals):\n            return vals.dt.is_quarter_end\n\n        return is_quarter_end\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/is_quarter_start.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable, Datetime\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass IsQuarterStart(TransformPrimitive):\n    \"\"\"Determines the is_quarter_start attribute of a datetime column.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> iqs = IsQuarterStart()\n        >>> dates = [datetime(2020, 3, 31),\n        ...          datetime(2020, 1, 1)]\n        >>> iqs(dates).tolist()\n        [False, True]\n    \"\"\"\n\n    name = \"is_quarter_start\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n\n    description_template = \"whether {} is a quarter start\"\n\n    def get_function(self):\n        def is_quarter_start(vals):\n            return vals.dt.is_quarter_start\n\n        return is_quarter_start\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/is_weekend.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable, Datetime\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass IsWeekend(TransformPrimitive):\n    \"\"\"Determines if a date falls on a weekend.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> dates = [datetime(2019, 3, 1),\n        ...          datetime(2019, 6, 17, 11, 10, 50),\n        ...          datetime(2019, 11, 30, 19, 45, 15)]\n        >>> is_weekend = IsWeekend()\n        >>> is_weekend(dates).tolist()\n        [False, False, True]\n    \"\"\"\n\n    name = \"is_weekend\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n\n    description_template = \"whether {} occurred on a weekend\"\n\n    def get_function(self):\n        def is_weekend(vals):\n            return vals.dt.weekday > 4\n\n        return is_weekend\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/is_working_hours.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable, Datetime\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass IsWorkingHours(TransformPrimitive):\n    \"\"\"Determines if a datetime falls during working hours on a 24-hour clock. Can configure start_hour and end_hour.\n\n    Args:\n        start_hour (int): Start hour of workday. Must adhere to 24-hour clock. Default is 8 (8am).\n        end_hour (int): End hour of workday. Must adhere to 24-hour clock. Default is 18 (6pm).\n\n    Examples:\n        >>> import numpy as np\n        >>> from datetime import datetime\n        >>> dates = [datetime(2022, 6, 21, 16, 3, 3),\n        ...          datetime(2019, 1, 3, 4, 4, 4),\n        ...          datetime(2022, 1, 1, 12, 1, 2),\n        ...          np.nan]\n        >>> is_working_hour = IsWorkingHours()\n        >>> is_working_hour(dates).tolist()\n        [True, False, True, False]\n        >>> is_working_hour = IsWorkingHours(15, 17)\n        >>> is_working_hour(dates).tolist()\n        [True, False, False, False]\n    \"\"\"\n\n    name = \"is_working_hours\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n\n    description_template = \"whether {} falls during working hours\"\n\n    def __init__(self, start_hour=8, end_hour=18):\n        self.start_hour = start_hour\n        self.end_hour = end_hour\n\n    def get_function(self):\n        def is_working_hours(vals):\n            return (vals.dt.hour >= self.start_hour) & (vals.dt.hour <= self.end_hour)\n\n        return is_working_hours\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/is_year_end.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable, Datetime\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass IsYearEnd(TransformPrimitive):\n    \"\"\"Determines if a date falls on the end of a year.\n\n    Examples:\n        >>> import numpy as np\n        >>> from datetime import datetime\n        >>> dates = [datetime(2019, 12, 31),\n        ...          datetime(2019, 1, 1),\n        ...          datetime(2019, 11, 30),\n        ...          np.nan]\n        >>> is_year_end = IsYearEnd()\n        >>> is_year_end(dates).tolist()\n        [True, False, False, False]\n    \"\"\"\n\n    name = \"is_year_end\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n\n    description_template = \"whether {} occurred on the end of a year\"\n\n    def get_function(self):\n        def is_year_end(vals):\n            return vals.dt.is_year_end\n\n        return is_year_end\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/is_year_start.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable, Datetime\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass IsYearStart(TransformPrimitive):\n    \"\"\"Determines if a date falls on the start of a year.\n\n    Examples:\n        >>> import numpy as np\n        >>> from datetime import datetime\n        >>> dates = [datetime(2019, 12, 31),\n        ...          datetime(2019, 1, 1),\n        ...          datetime(2019, 11, 30),\n        ...          np.nan]\n        >>> is_year_start = IsYearStart()\n        >>> is_year_start(dates).tolist()\n        [False, True, False, False]\n    \"\"\"\n\n    name = \"is_year_start\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n\n    description_template = \"whether {} occurred on the start of a year\"\n\n    def get_function(self):\n        def is_year_start(vals):\n            return vals.dt.is_year_start\n\n        return is_year_start\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/minute.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Ordinal\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass Minute(TransformPrimitive):\n    \"\"\"Determines the minutes value of a datetime.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> dates = [datetime(2019, 3, 1),\n        ...          datetime(2019, 3, 3, 11, 10, 50),\n        ...          datetime(2019, 3, 31, 19, 45, 15)]\n        >>> minute = Minute()\n        >>> minute(dates).tolist()\n        [0, 10, 45]\n    \"\"\"\n\n    name = \"minute\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(\n        logical_type=Ordinal(order=list(range(60))),\n        semantic_tags={\"category\"},\n    )\n\n    description_template = \"the minutes value of {}\"\n\n    def get_function(self):\n        def minute(vals):\n            return vals.dt.minute\n\n        return minute\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/month.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Ordinal\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass Month(TransformPrimitive):\n    \"\"\"Determines the month value of a datetime.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> dates = [datetime(2019, 3, 1),\n        ...          datetime(2019, 6, 17, 11, 10, 50),\n        ...          datetime(2019, 11, 30, 19, 45, 15)]\n        >>> month = Month()\n        >>> month(dates).tolist()\n        [3, 6, 11]\n    \"\"\"\n\n    name = \"month\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(\n        logical_type=Ordinal(order=list(range(1, 13))),\n        semantic_tags={\"category\"},\n    )\n\n    description_template = \"the month of {}\"\n\n    def get_function(self):\n        def month(vals):\n            return vals.dt.month\n\n        return month\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/part_of_day.py",
    "content": "import numpy as np\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Categorical, Datetime\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass PartOfDay(TransformPrimitive):\n    \"\"\"Determines the part of day of a datetime.\n\n    Description:\n        For a list of datetimes, determines the part of day the datetime\n        falls into, based on the hour.\n        If the hour falls from 4 to 5, the part of day is 'dawn'.\n        If the hour falls from 6 to 7, the part of day is 'early morning'.\n        If the hour falls from 8 to 10, the part of day is 'late morning'.\n        If the hour falls from 11 to 13, the part of day is 'noon'.\n        If the hour falls from 14 to 16, the part of day is 'afternoon'.\n        If the hour falls from 17 to 19, the part of day is 'evening'.\n        If the hour falls from 20 to 22, the part of day is 'night'.\n        If the hour falls into 23, 24, or 1 to 3, the part of day is 'midnight'.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> dates = [datetime(2020, 1, 11, 6, 2, 1),\n        ...          datetime(2021, 3, 31, 4, 2, 1),\n        ...          datetime(2020, 3, 4, 9, 2, 1)]\n        >>> part_of_day = PartOfDay()\n        >>> part_of_day(dates).tolist()\n        ['early morning', 'dawn', 'late morning']\n    \"\"\"\n\n    name = \"part_of_day\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={\"category\"})\n\n    description_template = \"the part of day {} falls in\"\n\n    @staticmethod\n    def construct_replacement_dict():\n        tdict = dict()\n        tdict[pd.NaT] = np.nan\n        for hour in [4, 5]:\n            tdict[hour] = \"dawn\"\n        for hour in [6, 7]:\n            tdict[hour] = \"early morning\"\n        for hour in [8, 9, 10]:\n            tdict[hour] = \"late morning\"\n        for hour in [11, 12, 13]:\n            tdict[hour] = \"noon\"\n        for hour in [14, 15, 16]:\n            tdict[hour] = \"afternoon\"\n        for hour in [17, 18, 19]:\n            tdict[hour] = \"evening\"\n        for hour in [20, 21, 22]:\n            tdict[hour] = \"night\"\n        for hour in [23, 0, 1, 2, 3]:\n            tdict[hour] = \"midnight\"\n        return tdict\n\n    def get_function(self):\n        replacement_dict = self.construct_replacement_dict()\n\n        def part_of_day(vals):\n            ans = vals.dt.hour.replace(replacement_dict)\n            return ans\n\n        return part_of_day\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/quarter.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Ordinal\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass Quarter(TransformPrimitive):\n    \"\"\"Determines the quarter a datetime column falls into (1, 2, 3, 4)\n\n    Examples:\n        >>> from datetime import datetime\n        >>> dates = [datetime(2019,12,1),\n        ...          datetime(2019,1,3),\n        ...          datetime(2020,2,1)]\n        >>> q = Quarter()\n        >>> q(dates).tolist()\n        [4, 1, 1]\n    \"\"\"\n\n    name = \"quarter\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(\n        logical_type=Ordinal(order=list(range(1, 5))),\n        semantic_tags={\"category\"},\n    )\n\n    description_template = \"the quarter that describes {}\"\n\n    def get_function(self):\n        def quarter(vals):\n            return vals.dt.quarter\n\n        return quarter\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/season.py",
    "content": "from datetime import date\n\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Categorical, Datetime\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass Season(TransformPrimitive):\n    \"\"\"Determines the season of a given datetime.\n        Returns winter, spring, summer, or fall.\n        This only works for northern hemisphere.\n\n    Description:\n        Given a list of datetimes, return the season of each one\n        (`winter`, `spring`, `summer`, or `fall`).\n\n    Examples:\n        >>> from datetime import datetime\n        >>> times = [datetime(2019, 1, 1),\n        ...          datetime(2019, 4, 15),\n        ...          datetime(2019, 7, 20),\n        ...          datetime(2019, 12, 30)]\n        >>> season = Season()\n        >>> season(times).tolist()\n        ['winter', 'spring', 'summer', 'winter']\n    \"\"\"\n\n    name = \"season\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={\"category\"})\n\n    def get_function(self):\n        def season(x):\n            # https://stackoverflow.com/a/28688724/2512385\n            Y = 2000  # dummy leap year to allow input X-02-29 (leap day)\n            seasons = [\n                (\"winter\", (date(Y, 1, 1), date(Y, 3, 20))),\n                (\"spring\", (date(Y, 3, 21), date(Y, 6, 20))),\n                (\"summer\", (date(Y, 6, 21), date(Y, 9, 22))),\n                (\"fall\", (date(Y, 9, 23), date(Y, 12, 20))),\n                (\"winter\", (date(Y, 12, 21), date(Y, 12, 31))),\n            ]\n            x = x.apply(lambda x: x.replace(year=2000))\n\n            def get_season(dt):\n                for season, (start, end) in seasons:\n                    if not pd.isna(dt) and start <= dt.date() <= end:\n                        return season\n                return pd.NA\n\n            new = x.apply(get_season).astype(dtype=\"string\")\n            return new\n\n        return season\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/second.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Ordinal\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass Second(TransformPrimitive):\n    \"\"\"Determines the seconds value of a datetime.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> dates = [datetime(2019, 3, 1),\n        ...          datetime(2019, 3, 3, 11, 10, 50),\n        ...          datetime(2019, 3, 31, 19, 45, 15)]\n        >>> second = Second()\n        >>> second(dates).tolist()\n        [0, 50, 15]\n    \"\"\"\n\n    name = \"second\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(\n        logical_type=Ordinal(order=list(range(60))),\n        semantic_tags={\"category\"},\n    )\n\n    description_template = \"the seconds value of {}\"\n\n    def get_function(self):\n        def second(vals):\n            return vals.dt.second\n\n        return second\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/time_since.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime\n\nfrom featuretools.primitives.base import TransformPrimitive\nfrom featuretools.utils import convert_time_units\n\n\nclass TimeSince(TransformPrimitive):\n    \"\"\"Calculates time from a value to a specified cutoff datetime.\n\n    Args:\n        unit (str): Defines the unit of time to count from.\n            Defaults to Seconds. Acceptable values:\n            years, months, days, hours, minutes, seconds, milliseconds, nanoseconds\n\n    Examples:\n        >>> from datetime import datetime\n        >>> time_since = TimeSince()\n        >>> times = [datetime(2019, 3, 1, 0, 0, 0, 1),\n        ...          datetime(2019, 3, 1, 0, 0, 1, 0),\n        ...          datetime(2019, 3, 1, 0, 2, 0, 0)]\n        >>> cutoff_time = datetime(2019, 3, 1, 0, 0, 0, 0)\n        >>> values = time_since(times, time=cutoff_time)\n        >>> list(map(int, values))\n        [0, -1, -120]\n\n        Change output to nanoseconds\n\n        >>> from datetime import datetime\n        >>> time_since_nano = TimeSince(unit='nanoseconds')\n        >>> times = [datetime(2019, 3, 1, 0, 0, 0, 1),\n        ...          datetime(2019, 3, 1, 0, 0, 1, 0),\n        ...          datetime(2019, 3, 1, 0, 2, 0, 0)]\n        >>> cutoff_time = datetime(2019, 3, 1, 0, 0, 0, 0)\n        >>> values = time_since_nano(times, time=cutoff_time)\n        >>> list(map(lambda x: int(round(x)), values))\n        [-1000, -1000000000, -120000000000]\n    \"\"\"\n\n    name = \"time_since\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    uses_calc_time = True\n    description_template = \"the time from {} to the cutoff time\"\n\n    def __init__(self, unit=\"seconds\"):\n        self.unit = unit.lower()\n\n    def get_function(self):\n        def pd_time_since(array, time):\n            return convert_time_units((time - array).dt.total_seconds(), self.unit)\n\n        return pd_time_since\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/time_since_previous.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime\n\nfrom featuretools.primitives.base import TransformPrimitive\nfrom featuretools.utils import convert_time_units\n\n\nclass TimeSincePrevious(TransformPrimitive):\n    \"\"\"Computes the time since the previous entry in a list.\n\n    Args:\n        unit (str): Defines the unit of time to count from.\n            Defaults to Seconds. Acceptable values:\n            years, months, days, hours, minutes, seconds, milliseconds, nanoseconds\n\n    Description:\n        Given a list of datetimes, compute the time in seconds elapsed since\n        the previous item in the list. The result for the first item in the\n        list will always be `NaN`.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> time_since_previous = TimeSincePrevious()\n        >>> dates = [datetime(2019, 3, 1, 0, 0, 0),\n        ...          datetime(2019, 3, 1, 0, 2, 0),\n        ...          datetime(2019, 3, 1, 0, 3, 0),\n        ...          datetime(2019, 3, 1, 0, 2, 30),\n        ...          datetime(2019, 3, 1, 0, 10, 0)]\n        >>> time_since_previous(dates).tolist()\n        [nan, 120.0, 60.0, -30.0, 450.0]\n    \"\"\"\n\n    name = \"time_since_previous\"\n    input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    description_template = \"the time since the previous instance of {}\"\n\n    def __init__(self, unit=\"seconds\"):\n        self.unit = unit.lower()\n\n    def get_function(self):\n        def pd_diff(values):\n            return convert_time_units(\n                values.diff().apply(lambda x: x.total_seconds()),\n                self.unit,\n            )\n\n        return pd_diff\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/utils.py",
    "content": "from typing import Optional, Tuple\n\nimport holidays\nimport pandas as pd\n\n\nclass HolidayUtil:\n    def __init__(self, country=\"US\"):\n        try:\n            country, subdivision = self.convert_to_subdivision(country)\n            self.holidays = holidays.country_holidays(\n                country=country,\n                subdiv=subdivision,\n            )\n        except NotImplementedError:\n            available_countries = (\n                \"https://github.com/dr-prodigy/python-holidays#available-countries\"\n            )\n            error = \"must be one of the available countries:\\n%s\" % available_countries\n            raise ValueError(error)\n\n        self.federal_holidays = getattr(holidays, country)(years=range(1950, 2075))\n\n    def to_df(self):\n        holidays_df = pd.DataFrame(\n            sorted(self.federal_holidays.items()),\n            columns=[\"holiday_date\", \"names\"],\n        )\n        holidays_df.holiday_date = holidays_df.holiday_date.astype(\"datetime64[ns]\")\n        return holidays_df\n\n    def convert_to_subdivision(self, country: str) -> Tuple[str, Optional[str]]:\n        \"\"\"Convert country to country + subdivision\n\n           Created in response to library changes that changed countries to subdivisions\n\n        Args:\n            country (str): Original country name\n\n        Returns:\n            Tuple[str,Optional[str]]: country, subdivsion\n        \"\"\"\n        return {\n            \"ENGLAND\": (\"GB\", country),\n            \"NORTHERNIRELAND\": (\"GB\", country),\n            \"PORTUGALEXT\": (\"PT\", \"Ext\"),\n            \"PTE\": (\"PT\", \"Ext\"),\n            \"SCOTLAND\": (\"GB\", country),\n            \"UK\": (\"GB\", country),\n            \"WALES\": (\"GB\", country),\n        }.get(country.upper(), (country, None))\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/week.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Ordinal\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass Week(TransformPrimitive):\n    \"\"\"Determines the week of the year from a datetime.\n\n    Description:\n        Returns the week of the year from a datetime value. The first week\n        of the year starts on January 1, and week numbers increment each\n        Monday.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> dates = [datetime(2019, 1, 3),\n        ...          datetime(2019, 6, 17, 11, 10, 50),\n        ...          datetime(2019, 11, 30, 19, 45, 15)]\n        >>> week = Week()\n        >>> week(dates).tolist()\n        [1, 25, 48]\n    \"\"\"\n\n    name = \"week\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(\n        logical_type=Ordinal(order=list(range(1, 54))),\n        semantic_tags={\"category\"},\n    )\n\n    description_template = \"the week of the year of {}\"\n\n    def get_function(self):\n        def week(vals):\n            if hasattr(vals.dt, \"isocalendar\"):\n                return vals.dt.isocalendar().week\n            else:\n                return vals.dt.week\n\n        return week\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/weekday.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Ordinal\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass Weekday(TransformPrimitive):\n    \"\"\"Determines the day of the week from a datetime.\n\n    Description:\n        Returns the day of the week from a datetime value. Weeks\n        start on Monday (day 0) and run through Sunday (day 6).\n\n    Examples:\n        >>> from datetime import datetime\n        >>> dates = [datetime(2019, 3, 1),\n        ...          datetime(2019, 6, 17, 11, 10, 50),\n        ...          datetime(2019, 11, 30, 19, 45, 15)]\n        >>> weekday = Weekday()\n        >>> weekday(dates).tolist()\n        [4, 0, 5]\n    \"\"\"\n\n    name = \"weekday\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(\n        logical_type=Ordinal(order=list(range(7))),\n        semantic_tags={\"category\"},\n    )\n\n    description_template = \"the day of the week of {}\"\n\n    def get_function(self):\n        def weekday(vals):\n            return vals.dt.weekday\n\n        return weekday\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/datetime/year.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Ordinal\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass Year(TransformPrimitive):\n    \"\"\"Determines the year value of a datetime.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> dates = [datetime(2019, 3, 1),\n        ...          datetime(2048, 6, 17, 11, 10, 50),\n        ...          datetime(1950, 11, 30, 19, 45, 15)]\n        >>> year = Year()\n        >>> year(dates).tolist()\n        [2019, 2048, 1950]\n    \"\"\"\n\n    name = \"year\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(\n        logical_type=Ordinal(order=list(range(1, 3000))),\n        semantic_tags={\"category\"},\n    )\n\n    description_template = \"the year of {}\"\n\n    def get_function(self):\n        def year(vals):\n            return vals.dt.year\n\n        return year\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/email/__init__.py",
    "content": "from featuretools.primitives.standard.transform.email.email_address_to_domain import (\n    EmailAddressToDomain,\n)\nfrom featuretools.primitives.standard.transform.email.is_free_email_domain import (\n    IsFreeEmailDomain,\n)\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/email/email_address_to_domain.py",
    "content": "import numpy as np\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Categorical, EmailAddress\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass EmailAddressToDomain(TransformPrimitive):\n    \"\"\"Determines the domain of an email\n\n    Description:\n        EmailAddress input should be a string. Will return Nan\n        if an invalid email address is provided, or if the input is\n        not a string.\n\n    Examples:\n        >>> email_address_to_domain = EmailAddressToDomain()\n        >>> email_address_to_domain(['name@gmail.com', 'name@featuretools.com']).tolist()\n        ['gmail.com', 'featuretools.com']\n    \"\"\"\n\n    name = \"email_address_to_domain\"\n    input_types = [ColumnSchema(logical_type=EmailAddress)]\n    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={\"category\"})\n\n    def get_function(self):\n        def email_address_to_domain(emails):\n            # if the input is empty return an empty Series\n            if len(emails) == 0:\n                return pd.Series([], dtype=\"category\")\n\n            emails_df = pd.DataFrame({\"email\": emails})\n\n            # if all emails are NaN expand won't propogate NaNs and will fail on indexing\n            if emails_df[\"email\"].isnull().all():\n                emails_df[\"domain\"] = np.nan\n                emails_df[\"domain\"] = emails_df[\"domain\"].astype(object)\n            else:\n                # .str.strip() and .str.split() return NaN for NaN values and propogate NaNs into new columns\n                emails_df[\"domain\"] = (\n                    emails_df[\"email\"].str.strip().str.split(\"@\", expand=True)[1]\n                )\n            return emails_df.domain.values\n\n        return email_address_to_domain\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/email/is_free_email_domain.py",
    "content": "import numpy as np\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable, EmailAddress\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass IsFreeEmailDomain(TransformPrimitive):\n    \"\"\"Determines if an email address is from a free email domain.\n\n    Description:\n        EmailAddress input should be a string. Will return Nan\n        if an invalid email address is provided, or if the input is\n        not a string. The list of free email domains used in this primitive\n        was obtained from https://github.com/willwhite/freemail/blob/master/data/free.txt.\n\n    Examples:\n        >>> is_free_email_domain = IsFreeEmailDomain()\n        >>> is_free_email_domain(['name@gmail.com', 'name@featuretools.com']).tolist()\n        [True, False]\n    \"\"\"\n\n    name = \"is_free_email_domain\"\n    input_types = [ColumnSchema(logical_type=EmailAddress)]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n    filename = \"free_email_provider_domains.txt\"\n\n    def get_function(self):\n        file_path = self.get_filepath(self.filename)\n\n        free_domains = pd.read_csv(file_path, header=None, names=[\"domain\"])\n        free_domains[\"domain\"] = free_domains.domain.str.strip()\n\n        def is_free_email_domain(emails):\n            # if the input is empty return an empty Series\n            if len(emails) == 0:\n                return pd.Series([], dtype=\"category\")\n\n            emails_df = pd.DataFrame({\"email\": emails})\n\n            # if all emails are NaN expand won't propogate NaNs and will fail on indexing\n            if emails_df[\"email\"].isnull().all():\n                emails_df[\"domain\"] = np.nan\n            else:\n                # .str.strip() and .str.split() return NaN for NaN values and propogate NaNs into new columns\n                emails_df[\"domain\"] = (\n                    emails_df[\"email\"].str.strip().str.split(\"@\", expand=True)[1]\n                )\n\n            emails_df[\"is_free\"] = emails_df[\"domain\"].isin(free_domains[\"domain\"])\n\n            # if there are any NaN domain values, change the series type to allow for\n            # both bools and NaN values and set is_free to NaN for the NaN domains\n            if emails_df[\"domain\"].isnull().values.any():\n                emails_df[\"is_free\"] = emails_df[\"is_free\"].astype(\"object\")\n                emails_df.loc[emails_df[\"domain\"].isnull(), \"is_free\"] = np.nan\n            return emails_df.is_free.values\n\n        return is_free_email_domain\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/exponential/__init__.py",
    "content": "from featuretools.primitives.standard.transform.exponential.exponential_weighted_average import (\n    ExponentialWeightedAverage,\n)\nfrom featuretools.primitives.standard.transform.exponential.exponential_weighted_std import (\n    ExponentialWeightedSTD,\n)\nfrom featuretools.primitives.standard.transform.exponential.exponential_weighted_variance import (\n    ExponentialWeightedVariance,\n)\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/exponential/exponential_weighted_average.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Double\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass ExponentialWeightedAverage(TransformPrimitive):\n    \"\"\"Computes the exponentially weighted moving average for a series of numbers\n\n    Description:\n        Returns the exponentially weighted moving average for a series of\n        numbers. Exactly one of center of mass (com), span, half-life, and\n        alpha must be provided. Missing values can be ignored when calculating\n        weights by setting 'ignore_na' to True.\n\n    Args:\n        com (float): Specify decay in terms of center of mass for com >= 0.\n            Default is None.\n\n        span (float): Specify decay in terms of span for span >= 1.\n            Default is None.\n\n        halflife (float): Specify decay in terms of half-life for halflife > 0.\n            Default is None.\n\n        alpha (float): Specify smoothing factor alpha directly. Alpha should be\n            greater than 0 and less than or equal to 1. Default is None.\n\n        ignore_na (bool): Ignore missing values when calculating weights.\n            Default is False.\n\n    Examples:\n        >>> exponential_weighted_average = ExponentialWeightedAverage(com=0.5)\n        >>> exponential_weighted_average([1, 2, 3, 4]).tolist()\n        [1.0, 1.75, 2.615384615384615, 3.55]\n\n        Missing values can be ignored\n        >>> ewma_ignorena = ExponentialWeightedAverage(com=0.5, ignore_na=True)\n        >>> ewma_ignorena([1, 2, 3, None, 4]).tolist()\n        [1.0, 1.75, 2.615384615384615, 2.615384615384615, 3.55]\n    \"\"\"\n\n    name = \"exponential_weighted_average\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    uses_full_dataframe = True\n\n    def __init__(self, com=None, span=None, halflife=None, alpha=None, ignore_na=False):\n        if all(x is None for x in [com, span, halflife, alpha]):\n            com = 0.5\n        self.com = com\n        self.span = span\n        self.halflife = halflife\n        self.alpha = alpha\n        self.ignore_na = ignore_na\n\n    def get_function(self):\n        def exponential_weighted_average(x):\n            return x.ewm(\n                com=self.com,\n                span=self.span,\n                halflife=self.halflife,\n                alpha=self.alpha,\n                ignore_na=self.ignore_na,\n            ).mean()\n\n        return exponential_weighted_average\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/exponential/exponential_weighted_std.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Double\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass ExponentialWeightedSTD(TransformPrimitive):\n    \"\"\"Computes the exponentially weighted moving standard deviation for\n    a series of numbers\n\n    Description:\n        Returns the exponentially weighted moving standard deviation for a\n        series of numbers. Exactly one of center of mass (com), span,\n        half-life, and alpha must be provided. Missing values can be ignored\n        when calculating weights by setting 'ignore_na' to True.\n\n    Args:\n        com (float): Specify decay in terms of center of mass for com >= 0.\n            Default is None.\n\n        span (float): Specify decay in terms of span for span >= 1.\n            Default is None.\n\n        halflife (float): Specify decay in terms of half-life for halflife > 0.\n            Default is None.\n\n        alpha (float): Specify smoothing factor alpha directly. Alpha should be\n            greater than 0 and less than or equal to 1. Default is None.\n\n        ignore_na (bool): Ignore missing values when calculating weights.\n            Default is False.\n\n    Examples:\n        >>> exponential_weighted_std = ExponentialWeightedSTD(com=0.5)\n        >>> exponential_weighted_std([1, 2, 3, 7]).tolist()\n        [nan, 0.7071067811865475, 0.9198662110077998, 2.9852200022005855]\n\n        Missing values can be ignored\n\n        >>> ewmstd_ignorena = ExponentialWeightedSTD(com=0.5, ignore_na=True)\n        >>> ewmstd_ignorena([1, 2, 3, None, 7]).tolist()\n        [nan, 0.7071067811865475, 0.9198662110077998, 0.9198662110077998, 2.9852200022005855]\n    \"\"\"\n\n    name = \"exponential_weighted_std\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    uses_full_dataframe = True\n\n    def __init__(self, com=None, span=None, halflife=None, alpha=None, ignore_na=False):\n        if all(x is None for x in [com, span, halflife, alpha]):\n            com = 0.5\n        self.com = com\n        self.span = span\n        self.halflife = halflife\n        self.alpha = alpha\n        self.ignore_na = ignore_na\n\n    def get_function(self):\n        def exponential_weighted_std(x):\n            return x.ewm(\n                com=self.com,\n                span=self.span,\n                halflife=self.halflife,\n                alpha=self.alpha,\n                ignore_na=self.ignore_na,\n            ).std()\n\n        return exponential_weighted_std\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/exponential/exponential_weighted_variance.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Double\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass ExponentialWeightedVariance(TransformPrimitive):\n    \"\"\"Computes the exponentially weighted moving variance for a series of numbers\n\n    Description:\n        Returns the exponentially weighted moving variance for a series of\n        numbers. Exactly one of center of mass (com), span, half-life, and\n        alpha must be provided. Missing values can be ignored when calculating\n        weights by setting 'ignore_na' to True.\n\n    Args:\n        com (float): Specify decay in terms of center of mass for com >= 0.\n            Default is None.\n\n        span (float): Specify decay in terms of span for span >= 1.\n            Default is None.\n\n        halflife (float): Specify decay in terms of half-life for halflife > 0.\n            Default is None.\n\n        alpha (float): Specify smoothing factor alpha directly. Alpha should be\n            greater than 0 and less than or equal to 1. Default is None.\n\n        ignore_na (bool): Ignore missing values when calculating weights.\n            Default is False.\n\n    Examples:\n        >>> exponential_weighted_variance = ExponentialWeightedVariance(com=0.5)\n        >>> exponential_weighted_variance([1, 2, 3, 4]).tolist()\n        [nan, 0.49999999999999983, 0.8461538461538459, 1.1230769230769233]\n\n        Missing values can be ignored\n\n        >>> ewmv_ignorena = ExponentialWeightedVariance(com=0.5, ignore_na=True)\n        >>> ewmv_ignorena([1, 2, 3, None, 4]).tolist()\n        [nan, 0.49999999999999983, 0.8461538461538459, 0.8461538461538459, 1.1230769230769233]\n    \"\"\"\n\n    name = \"exponential_weighted_variance\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    uses_full_dataframe = True\n\n    def __init__(self, com=None, span=None, halflife=None, alpha=None, ignore_na=False):\n        if all(x is None for x in [com, span, halflife, alpha]):\n            com = 0.5\n        self.com = com\n        self.span = span\n        self.halflife = halflife\n        self.alpha = alpha\n        self.ignore_na = ignore_na\n\n    def get_function(self):\n        def exponential_weighted_average(x):\n            return x.ewm(\n                com=self.com,\n                span=self.span,\n                halflife=self.halflife,\n                alpha=self.alpha,\n                ignore_na=self.ignore_na,\n            ).var()\n\n        return exponential_weighted_average\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/file_extension.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Filepath\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass FileExtension(TransformPrimitive):\n    \"\"\"Determines the extension of a filepath.\n\n    Description:\n        Given a list of filepaths, return the extension\n        suffix of each one. If the filepath is missing\n        or invalid, return `NaN`.\n\n    Examples:\n        >>> file_extension = FileExtension()\n        >>> file_extension(['doc.txt', '~/documents/data.json', 'file']).tolist()\n        ['.txt', '.json', nan]\n    \"\"\"\n\n    name = \"file_extension\"\n    input_types = [ColumnSchema(logical_type=Filepath)]\n    return_type = ColumnSchema(semantic_tags={\"category\"})\n\n    def get_function(self):\n        def file_extension(x):\n            p = r\"(\\.[a-z|A-Z]+$)\"\n            return x.str.extract(p, expand=False).str.lower()\n\n        return file_extension\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/full_name_to_first_name.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Categorical, PersonFullName\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass FullNameToFirstName(TransformPrimitive):\n    \"\"\"Determines the first name from a person's name.\n\n    Description:\n        Given a list of names, determines the first name. If\n        only a single name is provided, assume this is a first name.\n        If only a title and a single name is provided return `nan`.\n        This assumes all titles will be followed by a period. Please note,\n        in the current implementation, last names containing spaces may\n        result in improper first name matches.\n\n\n    Examples:\n        >>> full_name_to_first_name = FullNameToFirstName()\n        >>> names = ['Woolf Spector', 'Oliva y Ocana, Dona. Fermina',\n        ...          'Ware, Mr. Frederick', 'Peter, Michael J', 'Mr. Brown']\n        >>> full_name_to_first_name(names).to_list()\n        ['Woolf', 'Oliva', 'Frederick', 'Michael', nan]\n    \"\"\"\n\n    name = \"full_name_to_first_name\"\n    input_types = [ColumnSchema(logical_type=PersonFullName)]\n    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={\"category\"})\n\n    def get_function(self):\n        def full_name_to_first_name(x):\n            title_with_last_pattern = r\"(^[A-Z][a-z]+\\. [A-Z][a-z]+$)\"\n            titles_pattern = r\"([A-Z][a-z]+)\\. \"\n            df = pd.DataFrame({\"names\": x})\n            # remove any entries with just a title and a name\n            df[\"names\"] = df[\"names\"].str.replace(\n                title_with_last_pattern,\n                \"\",\n                regex=True,\n            )\n            # remove any known titles\n            df[\"names\"] = df[\"names\"].str.replace(titles_pattern, \"\", regex=True)\n            # extract first names\n            pattern = r\"([A-Z][a-z]+ |, [A-Z][a-z]+$|^[A-Z][a-z]+$)\"\n            df[\"first_name\"] = df[\"names\"].str.extract(pattern)\n            # clean up white space and leftover commas\n            df[\"first_name\"] = df[\"first_name\"].str.replace(\",\", \"\").str.strip()\n            return df[\"first_name\"]\n\n        return full_name_to_first_name\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/full_name_to_last_name.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Categorical, PersonFullName\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass FullNameToLastName(TransformPrimitive):\n    \"\"\"Determines the first name from a person's name.\n\n    Description:\n        Given a list of names, determines the last name. If\n        only a single name is provided, assume this is a first name, and\n        return `nan`. This assumes all titles will be followed by a period.\n\n\n    Examples:\n        >>> full_name_to_last_name = FullNameToLastName()\n        >>> names = ['Woolf Spector', 'Oliva y Ocana, Dona. Fermina',\n        ...          'Ware, Mr. Frederick', 'Peter, Michael J', 'Mr. Brown']\n        >>> full_name_to_last_name(names).to_list()\n        ['Spector', 'Oliva y Ocana', 'Ware', 'Peter', 'Brown']\n    \"\"\"\n\n    name = \"full_name_to_last_name\"\n    input_types = [ColumnSchema(logical_type=PersonFullName)]\n    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={\"category\"})\n\n    def get_function(self):\n        def full_name_to_last_name(x):\n            titles_pattern = r\"([A-Z][a-z]+)\\. \"\n            df = pd.DataFrame({\"names\": x})\n            # extract initial names\n            pattern = r\"(^.+?,|^[A-Z][a-z]+\\. [A-Z][a-z]+$| [A-Z][a-z]+$| [A-Z][a-z]+[/-][A-Z][a-z]+$)\"\n            df[\"last_name\"] = df[\"names\"].str.extract(pattern)\n            # remove titles\n            df[\"last_name\"] = df[\"last_name\"].str.replace(\n                titles_pattern,\n                \"\",\n                regex=True,\n            )\n            # clean up white space and leftover commas\n            df[\"last_name\"] = df[\"last_name\"].str.replace(\",\", \"\").str.strip()\n            return df[\"last_name\"]\n\n        return full_name_to_last_name\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/full_name_to_title.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Categorical, PersonFullName\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass FullNameToTitle(TransformPrimitive):\n    \"\"\"Determines the title from a person's name.\n\n    Description:\n        Given a list of names, determines the title, or\n        prefix of each name (e.g. \"Mr\", \"Mrs\", etc). If\n        no title is found, returns `NaN`.\n\n    Examples:\n        >>> full_name_to_title = FullNameToTitle()\n        >>> names = ['Spector, Mr. Woolf', 'Oliva y Ocana, Dona. Fermina',\n        ...          'Ware, Mr. Frederick', 'Peter, Michael J', 'Mr. Brown']\n        >>> full_name_to_title(names).to_list()\n        ['Mr', 'Dona', 'Mr', nan, 'Mr']\n    \"\"\"\n\n    name = \"full_name_to_title\"\n    input_types = [ColumnSchema(logical_type=PersonFullName)]\n    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={\"category\"})\n\n    def get_function(self):\n        def full_name_to_title(x):\n            pattern = r\"([A-Z][a-z]+)\\. \"\n            return x.str.extract(pattern, expand=True)[0]\n\n        return full_name_to_title\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/is_in.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Boolean\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass IsIn(TransformPrimitive):\n    \"\"\"Determines whether a value is present in a provided list.\n\n    Examples:\n        >>> items = ['string', 10.3, False]\n        >>> is_in = IsIn(list_of_outputs=items)\n        >>> is_in(['string', 10.5, False]).tolist()\n        [True, False, True]\n    \"\"\"\n\n    name = \"isin\"\n    input_types = [ColumnSchema()]\n    return_type = ColumnSchema(logical_type=Boolean)\n\n    def __init__(self, list_of_outputs=None):\n        self.list_of_outputs = list_of_outputs\n        if not list_of_outputs:\n            stringified_output_list = \"[]\"\n        else:\n            stringified_output_list = \", \".join([str(x) for x in list_of_outputs])\n        self.description_template = \"whether {{}} is in {}\".format(\n            stringified_output_list,\n        )\n\n    def get_function(self):\n        def pd_is_in(array):\n            return array.isin(self.list_of_outputs or [])\n\n        return pd_is_in\n\n    def generate_name(self, base_feature_names):\n        return \"%s.isin(%s)\" % (base_feature_names[0], str(self.list_of_outputs))\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/is_null.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Boolean\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass IsNull(TransformPrimitive):\n    \"\"\"Determines if a value is null.\n\n    Examples:\n        >>> is_null = IsNull()\n        >>> is_null([1, None, 3]).tolist()\n        [False, True, False]\n    \"\"\"\n\n    name = \"is_null\"\n    input_types = [ColumnSchema()]\n    return_type = ColumnSchema(logical_type=Boolean)\n    description_template = \"whether {} is null\"\n\n    def get_function(self):\n        def isnull(array):\n            return array.isnull()\n\n        return isnull\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/latlong/__init__.py",
    "content": "from featuretools.primitives.standard.transform.latlong.cityblock_distance import (\n    CityblockDistance,\n)\nfrom featuretools.primitives.standard.transform.latlong.geomidpoint import GeoMidpoint\nfrom featuretools.primitives.standard.transform.latlong.haversine import Haversine\nfrom featuretools.primitives.standard.transform.latlong.is_in_geobox import IsInGeoBox\nfrom featuretools.primitives.standard.transform.latlong.latitude import Latitude\nfrom featuretools.primitives.standard.transform.latlong.longitude import Longitude\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/latlong/cityblock_distance.py",
    "content": "import numpy as np\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Double, LatLong\n\nfrom featuretools.primitives.base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.latlong.utils import (\n    _haversine_calculate,\n)\n\n\nclass CityblockDistance(TransformPrimitive):\n    \"\"\"Calculates the distance between points in a city road grid.\n\n    Description:\n        This distance is calculated using the haversine formula, which\n        takes into account the curvature of the Earth.\n        If either input data contains `NaN`s, the calculated\n        distance with be `NaN`.\n        This calculation is also known as the Mahnattan distance.\n\n    Args:\n        unit (str): Determines the unit value to output. Could\n            be miles or kilometers. Default is miles.\n\n    Examples:\n        >>> cityblock_distance = CityblockDistance()\n        >>> DC = (38, -77)\n        >>> Boston = (43, -71)\n        >>> NYC = (40, -74)\n        >>> distances_mi = cityblock_distance([DC, DC], [NYC, Boston])\n        >>> np.round(distances_mi, 3).tolist()\n        [301.519, 672.089]\n\n        We can also change the units in which the distance is calculated.\n\n        >>> cityblock_distance_kilometers = CityblockDistance(unit='kilometers')\n        >>> distances_km = cityblock_distance_kilometers([DC, DC], [NYC, Boston])\n        >>> np.round(distances_km, 3).tolist()\n        [485.248, 1081.622]\n    \"\"\"\n\n    name = \"cityblock_distance\"\n    input_types = [\n        ColumnSchema(logical_type=LatLong),\n        ColumnSchema(logical_type=LatLong),\n    ]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    commutative = True\n\n    def __init__(self, unit=\"miles\"):\n        if unit not in [\"miles\", \"kilometers\"]:\n            raise ValueError(\"Invalid unit given\")\n        self.unit = unit\n\n    def get_function(self):\n        def cityblock(latlong_1, latlong_2):\n            latlong_1 = np.array(latlong_1.tolist())\n            latlong_2 = np.array(latlong_2.tolist())\n            lat_1s = latlong_1[:, 0]\n            lat_2s = latlong_2[:, 0]\n            lon_1s = latlong_1[:, 1]\n            lon_2s = latlong_2[:, 1]\n            lon_dis = _haversine_calculate(lat_1s, lon_1s, lat_1s, lon_2s, self.unit)\n            lat_dist = _haversine_calculate(lat_1s, lon_1s, lat_2s, lon_1s, self.unit)\n            return pd.Series(lon_dis + lat_dist)\n\n        return cityblock\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/latlong/geomidpoint.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import LatLong\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass GeoMidpoint(TransformPrimitive):\n    \"\"\"Determines the geographic center of two coordinates.\n\n    Examples:\n        >>> geomidpoint = GeoMidpoint()\n        >>> geomidpoint([(42.4, -71.1)], [(40.0, -122.4)])\n        [(41.2, -96.75)]\n    \"\"\"\n\n    name = \"geomidpoint\"\n    input_types = [\n        ColumnSchema(logical_type=LatLong),\n        ColumnSchema(logical_type=LatLong),\n    ]\n    return_type = ColumnSchema(logical_type=LatLong)\n    commutative = True\n\n    def get_function(self):\n        def geomidpoint_func(latlong_1, latlong_2):\n            latlong_1 = np.array(latlong_1.tolist())\n            latlong_2 = np.array(latlong_2.tolist())\n            lat_1s = latlong_1[:, 0]\n            lat_2s = latlong_2[:, 0]\n            lon_1s = latlong_1[:, 1]\n            lon_2s = latlong_2[:, 1]\n\n            lat_middle = np.array([lat_1s, lat_2s]).transpose().mean(axis=1)\n            lon_middle = np.array([lon_1s, lon_2s]).transpose().mean(axis=1)\n            return list(zip(lat_middle, lon_middle))\n\n        return geomidpoint_func\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/latlong/haversine.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import LatLong\n\nfrom featuretools.primitives.base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.latlong.utils import (\n    _haversine_calculate,\n)\n\n\nclass Haversine(TransformPrimitive):\n    \"\"\"Calculates the approximate haversine distance between two LatLong columns.\n\n    Args:\n        unit (str): Determines the unit value to output. Could\n            be `miles` or `kilometers`. Default is `miles`.\n\n    Examples:\n        >>> haversine = Haversine()\n        >>> distances = haversine([(42.4, -71.1), (40.0, -122.4)],\n        ...                       [(40.0, -122.4), (41.2, -96.75)])\n        >>> np.round(distances, 3).tolist()\n        [2631.231, 1343.289]\n\n        Output units can be specified\n\n        >>> haversine_km = Haversine(unit='kilometers')\n        >>> distances_km = haversine_km([(42.4, -71.1), (40.0, -122.4)],\n        ...                             [(40.0, -122.4), (41.2, -96.75)])\n        >>> np.round(distances_km, 3).tolist()\n        [4234.555, 2161.814]\n    \"\"\"\n\n    name = \"haversine\"\n    input_types = [\n        ColumnSchema(logical_type=LatLong),\n        ColumnSchema(logical_type=LatLong),\n    ]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    commutative = True\n\n    def __init__(self, unit=\"miles\"):\n        valid_units = [\"miles\", \"kilometers\"]\n        if unit not in valid_units:\n            error_message = \"Invalid unit %s provided. Must be one of %s\" % (\n                unit,\n                valid_units,\n            )\n            raise ValueError(error_message)\n        self.unit = unit\n        self.description_template = (\n            \"the haversine distance in {} between {{}} and {{}}\".format(self.unit)\n        )\n\n    def get_function(self):\n        def haversine(latlong_1, latlong_2):\n            latlong_1 = np.array(latlong_1.tolist())\n            latlong_2 = np.array(latlong_2.tolist())\n            lat_1s = latlong_1[:, 0]\n            lat_2s = latlong_2[:, 0]\n            lon_1s = latlong_1[:, 1]\n            lon_2s = latlong_2[:, 1]\n\n            distance = _haversine_calculate(lat_1s, lon_1s, lat_2s, lon_2s, self.unit)\n            return distance\n\n        return haversine\n\n    def generate_name(self, base_feature_names):\n        name = \"{}(\".format(self.name.upper())\n        name += \", \".join(base_feature_names)\n        if self.unit != \"miles\":\n            name += \", unit={}\".format(self.unit)\n        name += \")\"\n        return name\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/latlong/is_in_geobox.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable, LatLong\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass IsInGeoBox(TransformPrimitive):\n    \"\"\"Determines if coordinates are inside a box defined by two\n    corner coordinate points.\n\n    Description:\n        Coordinate values should be specified as (latitude, longitude)\n        tuples. This primitive is unable to handle coordinates and boxes\n        at the poles, and near +/- 180 degrees latitude.\n\n    Args:\n        point1 (tuple(float, float)): The coordinates\n            of the first corner of the box. Defaults to (0, 0).\n        point2 (tuple(float, float)): The coordinates\n            of the diagonal corner of the box. Defaults to (0, 0).\n\n    Example:\n        >>> is_in_geobox = IsInGeoBox((40.7128, -74.0060), (42.2436, -71.1677))\n        >>> is_in_geobox([(41.034, -72.254), (39.125, -87.345)]).tolist()\n        [True, False]\n    \"\"\"\n\n    name = \"is_in_geobox\"\n    input_types = [ColumnSchema(logical_type=LatLong)]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n\n    def __init__(self, point1=(0, 0), point2=(0, 0)):\n        self.point1 = point1\n        self.point2 = point2\n        self.lats = np.sort(np.array([point1[0], point2[0]]))\n        self.lons = np.sort(np.array([point1[1], point2[1]]))\n\n    def get_function(self):\n        def geobox(latlongs):\n            transposed = np.transpose(np.array(latlongs.tolist()))\n            lats = (self.lats[0] <= transposed[0]) & (self.lats[1] >= transposed[0])\n            longs = (self.lons[0] <= transposed[1]) & (self.lons[1] >= transposed[1])\n            return lats & longs\n\n        return geobox\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/latlong/latitude.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import LatLong\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass Latitude(TransformPrimitive):\n    \"\"\"Returns the first tuple value in a list of LatLong tuples.\n       For use with the LatLong logical type.\n\n    Examples:\n        >>> latitude = Latitude()\n        >>> latitude([(42.4, -71.1),\n        ...            (40.0, -122.4),\n        ...            (41.2, -96.75)]).tolist()\n        [42.4, 40.0, 41.2]\n    \"\"\"\n\n    name = \"latitude\"\n    input_types = [ColumnSchema(logical_type=LatLong)]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    description_template = \"the latitude of {}\"\n\n    def get_function(self):\n        def latitude(latlong):\n            latlong = np.array(latlong.tolist())\n            return latlong[:, 0]\n\n        return latitude\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/latlong/longitude.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import LatLong\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass Longitude(TransformPrimitive):\n    \"\"\"Returns the second tuple value in a list of LatLong tuples.\n       For use with the LatLong logical type.\n\n    Examples:\n        >>> longitude = Longitude()\n        >>> longitude([(42.4, -71.1),\n        ...            (40.0, -122.4),\n        ...            (41.2, -96.75)]).tolist()\n        [-71.1, -122.4, -96.75]\n    \"\"\"\n\n    name = \"longitude\"\n    input_types = [ColumnSchema(logical_type=LatLong)]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    description_template = \"the longitude of {}\"\n\n    def get_function(self):\n        def longitude(latlong):\n            latlong = np.array(latlong.tolist())\n            return latlong[:, 1]\n\n        return longitude\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/latlong/utils.py",
    "content": "import numpy as np\n\n\ndef _haversine_calculate(lat_1s, lon_1s, lat_2s, lon_2s, unit):\n    # https://stackoverflow.com/a/29546836/2512385\n    lon1, lat1, lon2, lat2 = map(np.radians, [lon_1s, lat_1s, lon_2s, lat_2s])\n    dlon = lon2 - lon1\n    dlat = lat2 - lat1\n    a = np.sin(dlat / 2.0) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0) ** 2\n    radius_earth = 3958.7613\n    if unit == \"kilometers\":\n        radius_earth = 6371.0088\n    distances = radius_earth * 2 * np.arcsin(np.sqrt(a))\n    return distances\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/natural_language/__init__.py",
    "content": "from featuretools.primitives.standard.transform.natural_language.count_string import (\n    CountString,\n)\nfrom featuretools.primitives.standard.transform.natural_language.mean_characters_per_word import (\n    MeanCharactersPerWord,\n)\nfrom featuretools.primitives.standard.transform.natural_language.median_word_length import (\n    MedianWordLength,\n)\nfrom featuretools.primitives.standard.transform.natural_language.num_characters import (\n    NumCharacters,\n)\nfrom featuretools.primitives.standard.transform.natural_language.num_unique_separators import (\n    NumUniqueSeparators,\n)\nfrom featuretools.primitives.standard.transform.natural_language.num_words import (\n    NumWords,\n)\nfrom featuretools.primitives.standard.transform.natural_language.number_of_common_words import (\n    NumberOfCommonWords,\n)\nfrom featuretools.primitives.standard.transform.natural_language.number_of_hashtags import (\n    NumberOfHashtags,\n)\nfrom featuretools.primitives.standard.transform.natural_language.number_of_mentions import (\n    NumberOfMentions,\n)\nfrom featuretools.primitives.standard.transform.natural_language.number_of_unique_words import (\n    NumberOfUniqueWords,\n)\nfrom featuretools.primitives.standard.transform.natural_language.number_of_words_in_quotes import (\n    NumberOfWordsInQuotes,\n)\nfrom featuretools.primitives.standard.transform.natural_language.punctuation_count import (\n    PunctuationCount,\n)\nfrom featuretools.primitives.standard.transform.natural_language.title_word_count import (\n    TitleWordCount,\n)\nfrom featuretools.primitives.standard.transform.natural_language.total_word_length import (\n    TotalWordLength,\n)\nfrom featuretools.primitives.standard.transform.natural_language.upper_case_count import (\n    UpperCaseCount,\n)\nfrom featuretools.primitives.standard.transform.natural_language.upper_case_word_count import (\n    UpperCaseWordCount,\n)\nfrom featuretools.primitives.standard.transform.natural_language.whitespace_count import (\n    WhitespaceCount,\n)\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/natural_language/constants.py",
    "content": "from string import punctuation\n\nDELIMITERS = \"[ \\n\\t]\"\nPUNCTUATION_AND_WHITESPACE = f\"[{punctuation}\\n\\t ]\"\n\ncommon_words_1000 = frozenset(\n    [\n        \"the\",\n        \"of\",\n        \"to\",\n        \"and\",\n        \"a\",\n        \"in\",\n        \"is\",\n        \"it\",\n        \"you\",\n        \"that\",\n        \"he\",\n        \"was\",\n        \"for\",\n        \"on\",\n        \"are\",\n        \"with\",\n        \"as\",\n        \"i\",\n        \"his\",\n        \"they\",\n        \"be\",\n        \"at\",\n        \"one\",\n        \"have\",\n        \"this\",\n        \"from\",\n        \"or\",\n        \"had\",\n        \"by\",\n        \"not\",\n        \"word\",\n        \"but\",\n        \"what\",\n        \"some\",\n        \"we\",\n        \"can\",\n        \"out\",\n        \"other\",\n        \"were\",\n        \"all\",\n        \"there\",\n        \"when\",\n        \"up\",\n        \"use\",\n        \"your\",\n        \"how\",\n        \"said\",\n        \"an\",\n        \"each\",\n        \"she\",\n        \"which\",\n        \"do\",\n        \"their\",\n        \"time\",\n        \"if\",\n        \"will\",\n        \"way\",\n        \"about\",\n        \"many\",\n        \"then\",\n        \"them\",\n        \"write\",\n        \"would\",\n        \"like\",\n        \"so\",\n        \"these\",\n        \"her\",\n        \"long\",\n        \"make\",\n        \"thing\",\n        \"see\",\n        \"him\",\n        \"two\",\n        \"has\",\n        \"look\",\n        \"more\",\n        \"day\",\n        \"could\",\n        \"go\",\n        \"come\",\n        \"did\",\n        \"number\",\n        \"sound\",\n        \"no\",\n        \"most\",\n        \"people\",\n        \"my\",\n        \"over\",\n        \"know\",\n        \"water\",\n        \"than\",\n        \"call\",\n        \"first\",\n        \"who\",\n        \"may\",\n        \"down\",\n        \"side\",\n        \"been\",\n        \"now\",\n        \"find\",\n        \"any\",\n        \"new\",\n        \"work\",\n        \"part\",\n        \"take\",\n        \"get\",\n        \"place\",\n        \"made\",\n        \"live\",\n        \"where\",\n        \"after\",\n        \"back\",\n        \"little\",\n        \"only\",\n        \"round\",\n        \"man\",\n        \"year\",\n        \"came\",\n        \"show\",\n        \"every\",\n        \"good\",\n        \"me\",\n        \"give\",\n        \"our\",\n        \"under\",\n        \"name\",\n        \"very\",\n        \"through\",\n        \"just\",\n        \"form\",\n        \"sentence\",\n        \"great\",\n        \"think\",\n        \"say\",\n        \"help\",\n        \"low\",\n        \"line\",\n        \"differ\",\n        \"turn\",\n        \"cause\",\n        \"much\",\n        \"mean\",\n        \"before\",\n        \"move\",\n        \"right\",\n        \"boy\",\n        \"old\",\n        \"too\",\n        \"same\",\n        \"tell\",\n        \"does\",\n        \"set\",\n        \"three\",\n        \"want\",\n        \"air\",\n        \"well\",\n        \"also\",\n        \"play\",\n        \"small\",\n        \"end\",\n        \"put\",\n        \"home\",\n        \"read\",\n        \"hand\",\n        \"port\",\n        \"large\",\n        \"spell\",\n        \"add\",\n        \"even\",\n        \"land\",\n        \"here\",\n        \"must\",\n        \"big\",\n        \"high\",\n        \"such\",\n        \"follow\",\n        \"act\",\n        \"why\",\n        \"ask\",\n        \"men\",\n        \"change\",\n        \"went\",\n        \"light\",\n        \"kind\",\n        \"off\",\n        \"need\",\n        \"house\",\n        \"picture\",\n        \"try\",\n        \"us\",\n        \"again\",\n        \"animal\",\n        \"point\",\n        \"mother\",\n        \"world\",\n        \"near\",\n        \"build\",\n        \"self\",\n        \"earth\",\n        \"father\",\n        \"head\",\n        \"stand\",\n        \"own\",\n        \"page\",\n        \"should\",\n        \"country\",\n        \"found\",\n        \"answer\",\n        \"school\",\n        \"grow\",\n        \"study\",\n        \"still\",\n        \"learn\",\n        \"plant\",\n        \"cover\",\n        \"food\",\n        \"sun\",\n        \"four\",\n        \"between\",\n        \"state\",\n        \"keep\",\n        \"eye\",\n        \"never\",\n        \"last\",\n        \"let\",\n        \"thought\",\n        \"city\",\n        \"tree\",\n        \"cross\",\n        \"farm\",\n        \"hard\",\n        \"start\",\n        \"might\",\n        \"story\",\n        \"saw\",\n        \"far\",\n        \"sea\",\n        \"draw\",\n        \"left\",\n        \"late\",\n        \"run\",\n        \"don't\",\n        \"while\",\n        \"press\",\n        \"close\",\n        \"night\",\n        \"real\",\n        \"life\",\n        \"few\",\n        \"north\",\n        \"open\",\n        \"seem\",\n        \"together\",\n        \"next\",\n        \"white\",\n        \"children\",\n        \"begin\",\n        \"got\",\n        \"walk\",\n        \"example\",\n        \"ease\",\n        \"paper\",\n        \"group\",\n        \"always\",\n        \"music\",\n        \"those\",\n        \"both\",\n        \"mark\",\n        \"often\",\n        \"letter\",\n        \"until\",\n        \"mile\",\n        \"river\",\n        \"car\",\n        \"feet\",\n        \"care\",\n        \"second\",\n        \"book\",\n        \"carry\",\n        \"took\",\n        \"science\",\n        \"eat\",\n        \"room\",\n        \"friend\",\n        \"began\",\n        \"idea\",\n        \"fish\",\n        \"mountain\",\n        \"stop\",\n        \"once\",\n        \"base\",\n        \"hear\",\n        \"horse\",\n        \"cut\",\n        \"sure\",\n        \"watch\",\n        \"color\",\n        \"face\",\n        \"wood\",\n        \"main\",\n        \"enough\",\n        \"plain\",\n        \"girl\",\n        \"usual\",\n        \"young\",\n        \"ready\",\n        \"above\",\n        \"ever\",\n        \"red\",\n        \"list\",\n        \"though\",\n        \"feel\",\n        \"talk\",\n        \"bird\",\n        \"soon\",\n        \"body\",\n        \"dog\",\n        \"family\",\n        \"direct\",\n        \"pose\",\n        \"leave\",\n        \"song\",\n        \"measure\",\n        \"door\",\n        \"product\",\n        \"black\",\n        \"short\",\n        \"numeral\",\n        \"class\",\n        \"wind\",\n        \"question\",\n        \"happen\",\n        \"complete\",\n        \"ship\",\n        \"area\",\n        \"half\",\n        \"rock\",\n        \"order\",\n        \"fire\",\n        \"south\",\n        \"problem\",\n        \"piece\",\n        \"told\",\n        \"knew\",\n        \"pass\",\n        \"since\",\n        \"top\",\n        \"whole\",\n        \"king\",\n        \"space\",\n        \"heard\",\n        \"best\",\n        \"hour\",\n        \"better\",\n        \"true\",\n        \"during\",\n        \"hundred\",\n        \"five\",\n        \"remember\",\n        \"step\",\n        \"early\",\n        \"hold\",\n        \"west\",\n        \"ground\",\n        \"interest\",\n        \"reach\",\n        \"fast\",\n        \"verb\",\n        \"sing\",\n        \"listen\",\n        \"six\",\n        \"table\",\n        \"travel\",\n        \"less\",\n        \"morning\",\n        \"ten\",\n        \"simple\",\n        \"several\",\n        \"vowel\",\n        \"toward\",\n        \"war\",\n        \"lay\",\n        \"against\",\n        \"pattern\",\n        \"slow\",\n        \"center\",\n        \"love\",\n        \"person\",\n        \"money\",\n        \"serve\",\n        \"appear\",\n        \"road\",\n        \"map\",\n        \"rain\",\n        \"rule\",\n        \"govern\",\n        \"pull\",\n        \"cold\",\n        \"notice\",\n        \"voice\",\n        \"unit\",\n        \"power\",\n        \"town\",\n        \"fine\",\n        \"certain\",\n        \"fly\",\n        \"fall\",\n        \"lead\",\n        \"cry\",\n        \"dark\",\n        \"machine\",\n        \"note\",\n        \"wait\",\n        \"plan\",\n        \"figure\",\n        \"star\",\n        \"box\",\n        \"noun\",\n        \"field\",\n        \"rest\",\n        \"correct\",\n        \"able\",\n        \"pound\",\n        \"done\",\n        \"beauty\",\n        \"drive\",\n        \"stood\",\n        \"contain\",\n        \"front\",\n        \"teach\",\n        \"week\",\n        \"final\",\n        \"gave\",\n        \"green\",\n        \"oh\",\n        \"quick\",\n        \"develop\",\n        \"ocean\",\n        \"warm\",\n        \"free\",\n        \"minute\",\n        \"strong\",\n        \"special\",\n        \"mind\",\n        \"behind\",\n        \"clear\",\n        \"tail\",\n        \"produce\",\n        \"fact\",\n        \"street\",\n        \"inch\",\n        \"multiply\",\n        \"nothing\",\n        \"course\",\n        \"stay\",\n        \"wheel\",\n        \"full\",\n        \"force\",\n        \"blue\",\n        \"object\",\n        \"decide\",\n        \"surface\",\n        \"deep\",\n        \"moon\",\n        \"island\",\n        \"foot\",\n        \"system\",\n        \"busy\",\n        \"test\",\n        \"record\",\n        \"boat\",\n        \"common\",\n        \"gold\",\n        \"possible\",\n        \"plane\",\n        \"stead\",\n        \"dry\",\n        \"wonder\",\n        \"laugh\",\n        \"thousand\",\n        \"ago\",\n        \"ran\",\n        \"check\",\n        \"game\",\n        \"shape\",\n        \"equate\",\n        \"hot\",\n        \"miss\",\n        \"brought\",\n        \"heat\",\n        \"snow\",\n        \"tire\",\n        \"bring\",\n        \"yes\",\n        \"distant\",\n        \"fill\",\n        \"east\",\n        \"paint\",\n        \"language\",\n        \"among\",\n        \"grand\",\n        \"ball\",\n        \"yet\",\n        \"wave\",\n        \"drop\",\n        \"heart\",\n        \"am\",\n        \"present\",\n        \"heavy\",\n        \"dance\",\n        \"engine\",\n        \"position\",\n        \"arm\",\n        \"wide\",\n        \"sail\",\n        \"material\",\n        \"size\",\n        \"vary\",\n        \"settle\",\n        \"speak\",\n        \"weight\",\n        \"general\",\n        \"ice\",\n        \"matter\",\n        \"circle\",\n        \"pair\",\n        \"include\",\n        \"divide\",\n        \"syllable\",\n        \"felt\",\n        \"perhaps\",\n        \"pick\",\n        \"sudden\",\n        \"count\",\n        \"square\",\n        \"reason\",\n        \"length\",\n        \"represent\",\n        \"art\",\n        \"subject\",\n        \"region\",\n        \"energy\",\n        \"hunt\",\n        \"probable\",\n        \"bed\",\n        \"brother\",\n        \"egg\",\n        \"ride\",\n        \"cell\",\n        \"believe\",\n        \"fraction\",\n        \"forest\",\n        \"sit\",\n        \"race\",\n        \"window\",\n        \"store\",\n        \"summer\",\n        \"train\",\n        \"sleep\",\n        \"prove\",\n        \"lone\",\n        \"leg\",\n        \"exercise\",\n        \"wall\",\n        \"catch\",\n        \"mount\",\n        \"wish\",\n        \"sky\",\n        \"board\",\n        \"joy\",\n        \"winter\",\n        \"sat\",\n        \"written\",\n        \"wild\",\n        \"instrument\",\n        \"kept\",\n        \"glass\",\n        \"grass\",\n        \"cow\",\n        \"job\",\n        \"edge\",\n        \"sign\",\n        \"visit\",\n        \"past\",\n        \"soft\",\n        \"fun\",\n        \"bright\",\n        \"gas\",\n        \"weather\",\n        \"month\",\n        \"million\",\n        \"bear\",\n        \"finish\",\n        \"happy\",\n        \"hope\",\n        \"flower\",\n        \"clothe\",\n        \"strange\",\n        \"gone\",\n        \"jump\",\n        \"baby\",\n        \"eight\",\n        \"village\",\n        \"meet\",\n        \"root\",\n        \"buy\",\n        \"raise\",\n        \"solve\",\n        \"metal\",\n        \"whether\",\n        \"push\",\n        \"seven\",\n        \"paragraph\",\n        \"third\",\n        \"shall\",\n        \"held\",\n        \"hair\",\n        \"describe\",\n        \"cook\",\n        \"floor\",\n        \"either\",\n        \"result\",\n        \"burn\",\n        \"hill\",\n        \"safe\",\n        \"cat\",\n        \"century\",\n        \"consider\",\n        \"type\",\n        \"law\",\n        \"bit\",\n        \"coast\",\n        \"copy\",\n        \"phrase\",\n        \"silent\",\n        \"tall\",\n        \"sand\",\n        \"soil\",\n        \"roll\",\n        \"temperature\",\n        \"finger\",\n        \"industry\",\n        \"value\",\n        \"fight\",\n        \"lie\",\n        \"beat\",\n        \"excite\",\n        \"natural\",\n        \"view\",\n        \"sense\",\n        \"ear\",\n        \"else\",\n        \"quite\",\n        \"broke\",\n        \"case\",\n        \"middle\",\n        \"kill\",\n        \"son\",\n        \"lake\",\n        \"moment\",\n        \"scale\",\n        \"loud\",\n        \"spring\",\n        \"observe\",\n        \"child\",\n        \"straight\",\n        \"consonant\",\n        \"nation\",\n        \"dictionary\",\n        \"milk\",\n        \"speed\",\n        \"method\",\n        \"organ\",\n        \"pay\",\n        \"age\",\n        \"section\",\n        \"dress\",\n        \"cloud\",\n        \"surprise\",\n        \"quiet\",\n        \"stone\",\n        \"tiny\",\n        \"climb\",\n        \"cool\",\n        \"design\",\n        \"poor\",\n        \"lot\",\n        \"experiment\",\n        \"bottom\",\n        \"key\",\n        \"iron\",\n        \"single\",\n        \"stick\",\n        \"flat\",\n        \"twenty\",\n        \"skin\",\n        \"smile\",\n        \"crease\",\n        \"hole\",\n        \"trade\",\n        \"melody\",\n        \"trip\",\n        \"office\",\n        \"receive\",\n        \"row\",\n        \"mouth\",\n        \"exact\",\n        \"symbol\",\n        \"die\",\n        \"least\",\n        \"trouble\",\n        \"shout\",\n        \"except\",\n        \"wrote\",\n        \"seed\",\n        \"tone\",\n        \"join\",\n        \"suggest\",\n        \"clean\",\n        \"break\",\n        \"lady\",\n        \"yard\",\n        \"rise\",\n        \"bad\",\n        \"blow\",\n        \"oil\",\n        \"blood\",\n        \"touch\",\n        \"grew\",\n        \"cent\",\n        \"mix\",\n        \"team\",\n        \"wire\",\n        \"cost\",\n        \"lost\",\n        \"brown\",\n        \"wear\",\n        \"garden\",\n        \"equal\",\n        \"sent\",\n        \"choose\",\n        \"fell\",\n        \"fit\",\n        \"flow\",\n        \"fair\",\n        \"bank\",\n        \"collect\",\n        \"save\",\n        \"control\",\n        \"decimal\",\n        \"gentle\",\n        \"woman\",\n        \"captain\",\n        \"practice\",\n        \"separate\",\n        \"difficult\",\n        \"doctor\",\n        \"please\",\n        \"protect\",\n        \"noon\",\n        \"whose\",\n        \"locate\",\n        \"ring\",\n        \"character\",\n        \"insect\",\n        \"caught\",\n        \"period\",\n        \"indicate\",\n        \"radio\",\n        \"spoke\",\n        \"atom\",\n        \"human\",\n        \"history\",\n        \"effect\",\n        \"electric\",\n        \"expect\",\n        \"crop\",\n        \"modern\",\n        \"element\",\n        \"hit\",\n        \"student\",\n        \"corner\",\n        \"party\",\n        \"supply\",\n        \"bone\",\n        \"rail\",\n        \"imagine\",\n        \"provide\",\n        \"agree\",\n        \"thus\",\n        \"capital\",\n        \"won't\",\n        \"chair\",\n        \"danger\",\n        \"fruit\",\n        \"rich\",\n        \"thick\",\n        \"soldier\",\n        \"process\",\n        \"operate\",\n        \"guess\",\n        \"necessary\",\n        \"sharp\",\n        \"wing\",\n        \"create\",\n        \"neighbor\",\n        \"wash\",\n        \"bat\",\n        \"rather\",\n        \"crowd\",\n        \"corn\",\n        \"compare\",\n        \"poem\",\n        \"string\",\n        \"bell\",\n        \"depend\",\n        \"meat\",\n        \"rub\",\n        \"tube\",\n        \"famous\",\n        \"dollar\",\n        \"stream\",\n        \"fear\",\n        \"sight\",\n        \"thin\",\n        \"triangle\",\n        \"planet\",\n        \"hurry\",\n        \"chief\",\n        \"colony\",\n        \"clock\",\n        \"mine\",\n        \"tie\",\n        \"enter\",\n        \"major\",\n        \"fresh\",\n        \"search\",\n        \"send\",\n        \"yellow\",\n        \"gun\",\n        \"allow\",\n        \"print\",\n        \"dead\",\n        \"spot\",\n        \"desert\",\n        \"suit\",\n        \"current\",\n        \"lift\",\n        \"rose\",\n        \"continue\",\n        \"block\",\n        \"chart\",\n        \"hat\",\n        \"sell\",\n        \"success\",\n        \"company\",\n        \"subtract\",\n        \"event\",\n        \"particular\",\n        \"deal\",\n        \"swim\",\n        \"term\",\n        \"opposite\",\n        \"wife\",\n        \"shoe\",\n        \"shoulder\",\n        \"spread\",\n        \"arrange\",\n        \"camp\",\n        \"invent\",\n        \"cotton\",\n        \"born\",\n        \"determine\",\n        \"quart\",\n        \"nine\",\n        \"truck\",\n        \"noise\",\n        \"level\",\n        \"chance\",\n        \"gather\",\n        \"shop\",\n        \"stretch\",\n        \"throw\",\n        \"shine\",\n        \"property\",\n        \"column\",\n        \"molecule\",\n        \"select\",\n        \"wrong\",\n        \"gray\",\n        \"repeat\",\n        \"require\",\n        \"broad\",\n        \"prepare\",\n        \"salt\",\n        \"nose\",\n        \"plural\",\n        \"anger\",\n        \"claim\",\n        \"continent\",\n        \"oxygen\",\n        \"sugar\",\n        \"death\",\n        \"pretty\",\n        \"skill\",\n        \"women\",\n        \"season\",\n        \"solution\",\n        \"magnet\",\n        \"silver\",\n        \"thank\",\n        \"branch\",\n        \"match\",\n        \"suffix\",\n        \"especially\",\n        \"fig\",\n        \"afraid\",\n        \"huge\",\n        \"sister\",\n        \"steel\",\n        \"discuss\",\n        \"forward\",\n        \"similar\",\n        \"guide\",\n        \"experience\",\n        \"score\",\n        \"apple\",\n        \"bought\",\n        \"led\",\n        \"pitch\",\n        \"coat\",\n        \"mass\",\n        \"card\",\n        \"band\",\n        \"rope\",\n        \"slip\",\n        \"win\",\n        \"dream\",\n        \"evening\",\n        \"condition\",\n        \"feed\",\n        \"tool\",\n        \"total\",\n        \"basic\",\n        \"smell\",\n        \"valley\",\n        \"nor\",\n        \"double\",\n        \"seat\",\n        \"arrive\",\n        \"master\",\n        \"track\",\n        \"parent\",\n        \"shore\",\n        \"division\",\n        \"sheet\",\n        \"substance\",\n        \"favor\",\n        \"connect\",\n        \"post\",\n        \"spend\",\n        \"chord\",\n        \"fat\",\n        \"glad\",\n        \"original\",\n        \"share\",\n        \"station\",\n        \"dad\",\n        \"bread\",\n        \"charge\",\n        \"proper\",\n        \"bar\",\n        \"offer\",\n        \"segment\",\n        \"slave\",\n        \"duck\",\n        \"instant\",\n        \"market\",\n        \"degree\",\n        \"populate\",\n        \"chick\",\n        \"dear\",\n        \"enemy\",\n        \"reply\",\n        \"drink\",\n        \"occur\",\n        \"support\",\n        \"speech\",\n        \"nature\",\n        \"range\",\n        \"steam\",\n        \"motion\",\n        \"path\",\n        \"liquid\",\n        \"log\",\n        \"meant\",\n        \"quotient\",\n        \"teeth\",\n        \"shell\",\n        \"neck\",\n    ],\n)  # https://gist.github.com/deekayen/4148741\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/natural_language/count_string.py",
    "content": "import re\n\nimport numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable, NaturalLanguage\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass CountString(TransformPrimitive):\n    \"\"\"Determines how many times a given string shows up in a text field.\n\n    Args:\n        string (str): The string to determine the count of. Defaults to\n            the word \"the\".\n        ignore_case (bool): Determines if case of the string should be\n            considered or not. Defaults to true.\n        ignore_non_alphanumeric (bool): Determines if non-alphanumeric\n            characters should be used in the search. Defaults to False.\n        is_regex (bool): Defines if the string argument is a regex or not.\n            Defaults to False.\n        match_whole_words_only (bool): Determines if whole words should be\n            matched or not. For example searching for word `the` against\n            `then, the, there` should only return `the` if this argument\n            was True. Defaults to False.\n    Examples:\n        >>> count_string = CountString(string=\"the\")\n        >>> count_string([\"The problem was difficult.\",\n        ...               \"He was there.\",\n        ...               \"The girl went to the store.\"]).tolist()\n        [1.0, 1.0, 2.0]\n        >>> # Match case of string\n        >>> count_string_ignore_case = CountString(string=\"the\", ignore_case=False)\n        >>> count_string_ignore_case([\"The problem was difficult.\",\n        ...                           \"He was there.\",\n        ...                           \"The girl went to the store.\"]).tolist()\n        [0.0, 1.0, 1.0]\n        >>> # Ignore non-alphanumeric characters in the search\n        >>> count_string_ignore_non_alphanumeric = CountString(string=\"the\",\n        ...                                                    ignore_non_alphanumeric=True)\n        >>> count_string_ignore_non_alphanumeric([\"Th*/e problem was difficult.\",\n        ...                                       \"He was there.\",\n        ...                                       \"The girl went to the store.\"]).tolist()\n        [1.0, 1.0, 2.0]\n        >>> # Specify the string as a regex\n        >>> count_string_is_regex = CountString(string=\"t.e\", is_regex=True)\n        >>> count_string_is_regex([\"The problem was difficult.\",\n        ...                        \"He was there.\",\n        ...                        \"The girl went to the store.\"]).tolist()\n        [1.0, 1.0, 2.0]\n        >>> # Match whole words only\n        >>> count_string_match_whole_words_only = CountString(string=\"the\",\n        ...                                                   match_whole_words_only=True)\n        >>> count_string_match_whole_words_only([\"The problem was difficult.\",\n        ...                                      \"He was there.\",\n        ...                                      \"The girl went to the store.\"]).tolist()\n        [1.0, 0.0, 2.0]\n    \"\"\"\n\n    name = \"count_string\"\n    input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n\n    def __init__(\n        self,\n        string=\"the\",\n        ignore_case=True,\n        ignore_non_alphanumeric=False,\n        is_regex=False,\n        match_whole_words_only=False,\n    ):\n        self.string = string\n        self.ignore_case = ignore_case\n        self.ignore_non_alphanumeric = ignore_non_alphanumeric\n        self.match_whole_words_only = match_whole_words_only\n        self.is_regex = is_regex\n\n        # we don't want to strip non alphanumeric characters from the pattern\n        # ie h.ll. should match \"hello\" so we can't strip the dots to make hll\n        if not is_regex:\n            self.pattern = re.escape(self.process_text(string))\n        else:\n            self.pattern = string\n            if ignore_case:\n                self.pattern = self.pattern.lower()\n\n        # \\b\\b.*\\b\\b is the same as \\b.*\\b so we don't have to check if\n        # the pattern is given to us as regex and if it already has leading\n        # and trailing \\b's\n        if match_whole_words_only:\n            self.pattern = \"\\\\b\" + self.pattern + \"\\\\b\"\n\n    def process_text(self, text):\n        if self.ignore_non_alphanumeric:\n            text = re.sub(\"[^0-9a-zA-Z ]+\", \"\", text)\n        if self.ignore_case:\n            text = text.lower()\n        return text\n\n    def get_function(self):\n        def count_string(words):\n            if not isinstance(words, str):\n                return np.nan\n            words = self.process_text(words)\n            return len(re.findall(self.pattern, words))\n\n        return np.vectorize(count_string, otypes=[float])\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/natural_language/mean_characters_per_word.py",
    "content": "# -*- coding: utf-8 -*-\n\nimport re\n\nimport numpy as np\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Double, NaturalLanguage\n\nfrom featuretools.primitives.base import TransformPrimitive\n\nPUNCTUATION = re.escape(\"!,.:;?\")\nEND_OF_SENTENCE_PUNCT_RE = re.compile(\n    rf\"[{PUNCTUATION}]+$|[{PUNCTUATION}]+ |[{PUNCTUATION}]+\\n\",\n)\n\n\ndef _mean_characters_per_word(value):\n    if pd.isna(value):\n        return np.nan\n\n    # replace end-of-sentence punctuation with space\n    value = END_OF_SENTENCE_PUNCT_RE.sub(\" \", value)\n    words = value.split()\n    character_count = [len(x) for x in words]\n\n    return np.mean(character_count) if len(character_count) else 0\n\n\nclass MeanCharactersPerWord(TransformPrimitive):\n    \"\"\"Determines the mean number of characters per word.\n\n    Description:\n        Given list of strings, determine the mean number of\n        characters per word in each string. A word is defined as\n        a series of any characters not separated by white space.\n        Punctuation is removed before counting. If a string\n        is empty or `NaN`, return `NaN`.\n\n    Examples:\n        >>> x = ['This is a test file', 'This is second line', 'third line $1,000']\n        >>> mean_characters_per_word = MeanCharactersPerWord()\n        >>> mean_characters_per_word(x).tolist()\n        [3.0, 4.0, 5.0]\n    \"\"\"\n\n    name = \"mean_characters_per_word\"\n    input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    default_value = 0\n\n    def get_function(self):\n        def mean_characters_per_word(series):\n            return series.apply(_mean_characters_per_word)\n\n        return mean_characters_per_word\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/natural_language/median_word_length.py",
    "content": "from numpy import median\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Double, NaturalLanguage\n\nfrom featuretools.primitives.base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.natural_language.constants import (\n    DELIMITERS,\n)\n\n\nclass MedianWordLength(TransformPrimitive):\n    \"\"\"Determines the median word length.\n\n    Description:\n        Given list of strings, determine the median\n        word length in each string. A word is defined as\n        a series of any characters not separated by a delimiter.\n        If a string is empty or `NaN`, return `NaN`.\n\n    Args:\n        delimiters_regex (str): Delimiters as a regex string for splitting text into words.\n            Defaults to whitespace characters.\n\n    Examples:\n        >>> x = ['This is a test file', 'This is second line', 'third line $1,000', None]\n        >>> median_word_length = MedianWordLength()\n        >>> median_word_length(x).tolist()\n        [4.0, 4.0, 5.0, nan]\n    \"\"\"\n\n    name = \"median_word_length\"\n    input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n\n    default_value = 0\n\n    def __init__(self, delimiters_regex=DELIMITERS):\n        self.delimiters_regex = delimiters_regex\n\n    def get_function(self):\n        def get_median(words):\n            if isinstance(words, list):\n                return median([len(word) for word in words if len(word) != 0])\n\n        def median_word_length(x):\n            words = x.str.split(self.delimiters_regex)\n            return words.apply(get_median)\n\n        return median_word_length\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/natural_language/num_characters.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable, NaturalLanguage\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass NumCharacters(TransformPrimitive):\n    \"\"\"Calculates the number of characters in a given string, including whitespace and punctuation.\n\n    Description:\n        Returns the number of characters in a string. This is equivalent to the length of a string.\n\n    Examples:\n        >>> num_characters = NumCharacters()\n        >>> num_characters(['This is a string',\n        ...                 'second item',\n        ...                 'final1']).tolist()\n        [16, 11, 6]\n    \"\"\"\n\n    name = \"num_characters\"\n    input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n\n    description_template = \"the number of characters in {}\"\n\n    def get_function(self):\n        def character_counter(array):\n            def _get_num_characters(elem):\n                \"\"\"Returns the length of elem, or pd.NA given null input\"\"\"\n                if pd.isna(elem):\n                    return pd.NA\n                return len(elem)\n\n            return array.apply(_get_num_characters)\n\n        return character_counter\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/natural_language/num_unique_separators.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable, NaturalLanguage\n\nfrom featuretools.primitives.base import TransformPrimitive\n\nNATURAL_LANGUAGE_SEPARATORS = [\" \", \".\", \",\", \"!\", \"?\", \";\", \"\\n\"]\n\n\nclass NumUniqueSeparators(TransformPrimitive):\n    r\"\"\"Calculates the number of unique separators.\n\n    Description:\n        Given a string and a list of separators, determine\n        the number of unique separators in each string. If a string\n        is null determined by pd.isnull return pd.NA.\n\n    Args:\n        separators (list, optional): a list of separator characters to count.\n            ``[\" \", \".\", \",\", \"!\", \"?\", \";\", \"\\n\"]`` is used by default.\n\n    Examples:\n        >>> x = [\"First. Line.\", \"This. is the second, line!\", \"notinlist@#$%^%&\"]\n        >>> num_unique_separators = NumUniqueSeparators([\".\", \",\", \"!\"])\n        >>> num_unique_separators(x).tolist()\n        [1, 3, 0]\n    \"\"\"\n\n    name = \"num_unique_separators\"\n    input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n\n    def __init__(self, separators=NATURAL_LANGUAGE_SEPARATORS):\n        assert separators is not None, \"separators needs to be defined\"\n        self.separators = separators\n\n    def get_function(self):\n        def count_unique_separator(s):\n            if pd.isnull(s):\n                return pd.NA\n            return len(set(self.separators).intersection(set(s)))\n\n        def get_separator_count(column):\n            return column.apply(count_unique_separator)\n\n        return get_separator_count\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/natural_language/num_words.py",
    "content": "import re\nfrom string import punctuation\nfrom typing import Optional\n\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable, NaturalLanguage\n\nfrom featuretools.primitives.base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.natural_language.constants import (\n    DELIMITERS,\n)\n\n\nclass NumWords(TransformPrimitive):\n    \"\"\"Determines the number of words in a string. Words are sequences of characters\n    delimited by whitespace.\n\n    Examples:\n        >>> num_words = NumWords()\n        >>> num_words(['This is a string',\n        ...            'Two words',\n        ...            'no-spaces',\n        ...            'Also works with sentences. Second sentence!']).tolist()\n        [4, 2, 1, 6]\n    \"\"\"\n\n    name = \"num_words\"\n    input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n\n    description_template = \"the number of words in {}\"\n\n    def get_function(self):\n        def word_counter(array):\n            def _get_number_of_words(elem: Optional[str]):\n                \"\"\"Returns the number of words in given element,\n                or pd.NA given null input\"\"\"\n                if pd.isna(elem):\n                    return pd.NA\n                return sum(\n                    1 for word in re.split(DELIMITERS, elem) if word.strip(punctuation)\n                )\n\n            return array.apply(_get_number_of_words)\n\n        return word_counter\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/natural_language/number_of_common_words.py",
    "content": "from string import punctuation\nfrom typing import Iterable\n\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable, NaturalLanguage\n\nfrom featuretools.primitives.base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.natural_language.constants import (\n    DELIMITERS,\n    common_words_1000,\n)\n\n\nclass NumberOfCommonWords(TransformPrimitive):\n    \"\"\"Determines the number of common words in a string.\n\n    Description:\n        Given string, determine the number of words that appear in a supplied word set.\n        The word set defaults to nlp_primitives.constants.common_words_1000. The string\n        is case insensitive. The word bank should consist of only lower case strings. If a string is\n        missing, return `NaN`.\n\n    Args:\n        word_set (set, optional): The set of words to look for in the string. These\n            words should all be lower case strings.\n        delimiters_regex (str, optional): The regular expression used to determine\n            what separates words. Defaults to whitespace characters.\n\n    Examples:\n        >>> x = ['Hey! This is some natural language', 'bacon, cheesburger, AND, fries', 'I! Am. A; duck?']\n        >>> number_of_common_words = NumberOfCommonWords(word_set={'and', 'some', 'am', 'a', 'the', 'is', 'i'})\n        >>> number_of_common_words(x).tolist()\n        [2, 1, 3]\n\n        >>> x = ['Hey! This is. some. natural language']\n        >>> number_of_common_words = NumberOfCommonWords(word_set={'hey', 'is', 'some'}, delimiters_regex=\"[ .]\")\n        >>> number_of_common_words(x).tolist()\n        [3]\n    \"\"\"\n\n    name = \"number_of_common_words\"\n    input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n\n    default_value = 0\n\n    def __init__(\n        self,\n        word_set=common_words_1000,\n        delimiters_regex=DELIMITERS,\n    ):\n        self.delimiters_regex = delimiters_regex\n        self.word_set = word_set\n\n    def get_function(self):\n        def get_num_in_word_bank(words):\n            if not isinstance(words, Iterable):\n                return pd.NA\n            num_common_words = 0\n            for w in words:\n                if (\n                    w.lower().strip(punctuation) in self.word_set\n                ):  # assumes word_set is all lowercase\n                    num_common_words += 1\n            return num_common_words\n\n        def num_common_words(x):\n            words = x.str.split(self.delimiters_regex)\n            return words.apply(get_num_in_word_bank)\n\n        return num_common_words\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/natural_language/number_of_hashtags.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable, NaturalLanguage\n\nfrom featuretools.primitives.standard.transform.natural_language.count_string import (\n    CountString,\n)\n\n\nclass NumberOfHashtags(CountString):\n    \"\"\"Determines the number of hashtags in a string.\n\n    Description:\n        Given a list of strings, determine the number of hashtags\n        in each string.\n\n        A hashtag is defined as a string that meets the following criteria:\n            - Starts with a '#' character, followed by a sequence of alphanumeric characters containing at least one alphabetic character\n            - Present at the start of a string or after whitespace\n            - Terminated by the end of the string, a whitespace, or a punctuation character other than '#'\n                - e.g. The string '#yes-no' contains a valid hashtag ('#yes')\n                - e.g. The string '#yes#' does not contain a valid hashtag\n\n        This implementation handles Unicode characters.\n\n        This implementation does not impose any character limit on hashtags.\n\n        If a string is missing, return `NaN`.\n\n    Examples:\n        >>> x = ['#regular #expression', 'this is a string', '###__regular#1and_0#expression']\n        >>> number_of_hashtags = NumberOfHashtags()\n        >>> number_of_hashtags(x).tolist()\n        [2.0, 0.0, 0.0]\n    \"\"\"\n\n    name = \"number_of_hashtags\"\n    input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n    default_value = 0\n\n    def __init__(self):\n        pattern = r\"((^#)|\\s#)(\\w*([^\\W\\d])+\\w*)(?![#\\w])\"\n        super().__init__(string=pattern, is_regex=True, ignore_case=False)\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/natural_language/number_of_mentions.py",
    "content": "import re\nimport string\n\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable, NaturalLanguage\n\nfrom featuretools.primitives.standard.transform.natural_language.count_string import (\n    CountString,\n)\n\n\nclass NumberOfMentions(CountString):\n    \"\"\"Determines the number of mentions in a string.\n\n    Description:\n        Given a list of strings, determine the number of mentions\n        in each string.\n\n        A mention is defined as a string that meets the following criteria:\n            - Starts with a '@' character, followed by a sequence of alphanumeric characters\n            - Present at the start of a string or after whitespace\n            - Terminated by the end of the string, a whitespace, or a punctuation character other than '@'\n                - e.g. The string '@yes-no' contains a valid mention ('@yes')\n                - e.g. The string '@yes@' does not contain a valid mention\n\n        This implementation handles Unicode characters.\n\n        This implementation does not impose any character limit on mentions.\n\n        If a string is missing, return `NaN`.\n\n    Examples:\n         >>> x = ['@user1 @user2', 'this is a string', '@@@__user1@1and_0@expression']\n        >>> number_of_mentions = NumberOfMentions()\n        >>> number_of_mentions(x).tolist()\n        [2.0, 0.0, 0.0]\n    \"\"\"\n\n    name = \"number_of_mentions\"\n    input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n    default_value = 0\n\n    def __init__(self):\n        SPECIALS_MINUS_AT = \"\".join(list(set(string.punctuation) - {\"@\"}))\n        SPECIALS_MINUS_AT = re.escape(SPECIALS_MINUS_AT)\n        pattern = rf\"((^@)|(\\s+@))(\\w+)(?=\\s|$|[{SPECIALS_MINUS_AT}])\"\n        super().__init__(string=pattern, is_regex=True, ignore_case=False)\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/natural_language/number_of_unique_words.py",
    "content": "from string import punctuation\nfrom typing import Iterable\n\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable, NaturalLanguage\n\nfrom featuretools.primitives.base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.natural_language.constants import (\n    DELIMITERS,\n)\n\n\nclass NumberOfUniqueWords(TransformPrimitive):\n    \"\"\"Determines the number of unique words in a string.\n\n    Description:\n        Determines the number of unique words in a given string. Includes options for\n        case-insensitive behavior.\n\n    Args:\n        case_insensitive (bool, optional): Specify case_insensitivity when searching for unique words.\n        For example, setting this to True would mean \"WORD word\" would be treated as having\n        one unique word. Defaults to False.\n\n    Examples:\n        >>> x = ['Word word Word', 'This is a SENTENCE.', 'green red green']\n        >>> number_of_unique_words = NumberOfUniqueWords()\n        >>> number_of_unique_words(x).tolist()\n        [2, 4, 2]\n\n        >>> x = ['word WoRD WORD worD', 'dog dog dog', 'catt CAT caT']\n        >>> number_of_unique_words = NumberOfUniqueWords(case_insensitive=True)\n        >>> number_of_unique_words(x).tolist()\n        [1, 1, 2]\n    \"\"\"\n\n    name = \"number_of_unique_words\"\n    input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n\n    default_value = 0\n\n    def __init__(self, case_insensitive=False):\n        self.case_insensitive = case_insensitive\n\n    def get_function(self):\n        def _unique_word_helper(text):\n            if not isinstance(text, Iterable):\n                return pd.NA\n            unique = set()\n            for t in text:\n                punct_less = t.strip(punctuation)\n                if len(punct_less) > 0:\n                    unique.add(punct_less)\n            return len(unique)\n\n        def num_unique_words(array):\n            if self.case_insensitive:\n                array = array.str.lower()\n            array = array.str.split(f\"{DELIMITERS}\")\n            return array.apply(_unique_word_helper)\n\n        return num_unique_words\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/natural_language/number_of_words_in_quotes.py",
    "content": "import re\nfrom string import punctuation\n\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable, NaturalLanguage\n\nfrom featuretools.primitives.base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.natural_language.constants import (\n    DELIMITERS,\n)\n\n\nclass NumberOfWordsInQuotes(TransformPrimitive):\n    \"\"\"Determines the number of words in quotes in a string.\n\n    Description:\n        Given a list of strings, determine the number of words in quotes\n        in each string.\n\n        This implementation handles Unicode characters.\n\n        If a string is missing, return `NaN`.\n\n    Args:\n        quote_type (str, optional): Specifies what type of quotation marks to match.\n        Argument \"single\" matches on only single quotes (' ').\n        Argument \"double\" matches words between double quotes (\" \").\n        Argument \"both\" matches words between either type of quotes.\n        Defaults to \"both\".\n\n    Examples:\n         >>> x = ['\"python\" java prolog \"Diffie-Hellman\" \"4.99\"', \"Reach me at 'user@email.com'\", \"'Here's an interesting example!'\"]\n        >>> number_of_words_in_quotes = NumberOfWordsInQuotes()\n        >>> number_of_words_in_quotes(x).tolist()\n        [3, 1, 4]\n    \"\"\"\n\n    name = \"number_of_words_in_quotes\"\n    input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n    default_value = 0\n\n    def __init__(self, quote_type=\"both\"):\n        if quote_type not in [\"both\", \"single\", \"double\"]:\n            raise ValueError(\n                f\"{quote_type} is not a valid quote_type. Specify 'both', 'single', or 'double'\",\n            )\n        self.quote_type = quote_type\n        IN_DOUBLE_QUOTES = r'((^|\\W)\"(.)*?\"(?!\\w))'\n        IN_SINGLE_QUOTES = r\"((^|\\W)'(.)*?'(?!\\w))\"\n        if quote_type == \"double\":\n            self.regex = IN_DOUBLE_QUOTES\n        elif quote_type == \"single\":\n            self.regex = IN_SINGLE_QUOTES\n        else:\n            self.regex = f\"({IN_SINGLE_QUOTES}|{IN_DOUBLE_QUOTES})\"\n\n    def get_function(self):\n        def count_words_in_quotes(text):\n            if pd.isnull(text):\n                return pd.NA\n            matches = re.findall(self.regex, text, re.DOTALL)\n            count = 0\n            for match in matches:\n                matched_phrase = match[0]\n                words = re.split(f\"{DELIMITERS}\", matched_phrase)\n                for word in words:\n                    if len(word.strip(punctuation + \" \")):\n                        count += 1\n            return count\n\n        def num_words_in_quotes(array):\n            return array.apply(count_words_in_quotes).astype(\"Int64\")\n\n        return num_words_in_quotes\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/natural_language/punctuation_count.py",
    "content": "# -*- coding: utf-8 -*-\n\nimport re\nimport string\n\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable, NaturalLanguage\n\nfrom featuretools.primitives.standard.transform.natural_language.count_string import (\n    CountString,\n)\n\n\nclass PunctuationCount(CountString):\n    \"\"\"Determines number of punctuation characters in a string.\n\n    Description:\n        Given list of strings, determine the number of punctuation\n        characters in each string. Looks for any of the following:\n\n        !\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~\n\n        If a string is missing, return `NaN`.\n\n    Examples:\n        >>> x = ['This is a test file.', 'This is second line', 'third line: $1,000']\n        >>> punctuation_count = PunctuationCount()\n        >>> punctuation_count(x).tolist()\n        [1.0, 0.0, 3.0]\n    \"\"\"\n\n    name = \"punctuation_count\"\n    input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n    default_value = 0\n\n    def __init__(self):\n        pattern = \"(%s)\" % \"|\".join([re.escape(x) for x in string.punctuation])\n        super().__init__(string=pattern, is_regex=True, ignore_case=False)\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/natural_language/title_word_count.py",
    "content": "# -*- coding: utf-8 -*-\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable, NaturalLanguage\n\nfrom featuretools.primitives.standard.transform.natural_language.count_string import (\n    CountString,\n)\n\n\nclass TitleWordCount(CountString):\n    \"\"\"Determines the number of title words in a string.\n\n    Description:\n        Given list of strings, determine the number of title words\n        in each string. A title word is defined as any word starting\n        with a capital letter. Words at the start of a sentence will\n        be counted.\n\n        If a string is missing, return `NaN`.\n\n    Examples:\n        >>> x = ['My favorite movie is Jaws.', 'this is a string', 'AAA']\n        >>> title_word_count = TitleWordCount()\n        >>> title_word_count(x).tolist()\n        [2.0, 0.0, 1.0]\n    \"\"\"\n\n    name = \"title_word_count\"\n    input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n    default_value = 0\n\n    def __init__(self):\n        pattern = r\"([A-Z][^\\s]*)\"\n        super().__init__(string=pattern, is_regex=True, ignore_case=False)\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/natural_language/total_word_length.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable, NaturalLanguage\n\nfrom featuretools.primitives.base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.natural_language.constants import (\n    PUNCTUATION_AND_WHITESPACE,\n)\n\n\nclass TotalWordLength(TransformPrimitive):\n    \"\"\"Determines the total word length.\n\n    Description:\n        Given list of strings, determine the total\n        word length in each string. A word is defined as\n        a series of any characters not separated by a delimiter.\n        If a string is empty or `NaN`, return `NaN`.\n\n    Args:\n        delimiters_regex (str): Delimiters as a regex string for splitting text into words.\n            Defaults to whitespace characters.\n\n    Examples:\n        >>> x = ['This is a test file', 'This is second line', 'third line $1,000', None]\n        >>> total_word_length = TotalWordLength()\n        >>> total_word_length(x).tolist()\n        [15.0, 16.0, 13.0, nan]\n    \"\"\"\n\n    name = \"total_word_length\"\n    input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n\n    default_value = 0\n\n    def __init__(self, do_not_count=PUNCTUATION_AND_WHITESPACE):\n        self.do_not_count = do_not_count\n\n    def get_function(self):\n        def total_word_length(x):\n            return x.str.len() - x.str.count(self.do_not_count)\n\n        return total_word_length\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/natural_language/upper_case_count.py",
    "content": "# -*- coding: utf-8 -*-\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable, NaturalLanguage\n\nfrom featuretools.primitives.standard.transform.natural_language.count_string import (\n    CountString,\n)\n\n\nclass UpperCaseCount(CountString):\n    \"\"\"Calculates the number of upper case letters in text.\n\n    Description:\n        Given a list of strings, determine the number of characters in each string\n        that are capitalized. Counts every letter individually, not just every\n        word that contains capitalized letters.\n\n        If a string is missing, return `NaN`\n\n    Examples:\n        >>> x = ['This IS a string.', 'This is a string', 'aaa']\n        >>> upper_case_count = UpperCaseCount()\n        >>> upper_case_count(x).tolist()\n        [3.0, 1.0, 0.0]\n    \"\"\"\n\n    name = \"upper_case_count\"\n    input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n    default_value = 0\n\n    def __init__(self):\n        pattern = r\"([A-Z])\"\n        super().__init__(string=pattern, is_regex=True, ignore_case=False)\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/natural_language/upper_case_word_count.py",
    "content": "import re\nfrom string import punctuation\n\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import IntegerNullable, NaturalLanguage\n\nfrom featuretools.primitives.base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.natural_language.constants import (\n    DELIMITERS,\n)\n\n\nclass UpperCaseWordCount(TransformPrimitive):\n    \"\"\"Determines the number of words in a string that are entirely capitalized.\n\n    Description:\n        Given list of strings, determine the number of words in each string\n        that are entirely capitalized.\n\n        If a string is missing, return `NaN`.\n\n    Examples:\n        >>> x = ['This IS a string.', 'This is a string', 'AAA']\n        >>> upper_case_word_count = UpperCaseWordCount()\n        >>> upper_case_word_count(x).tolist()\n        [1, 0, 1]\n    \"\"\"\n\n    name = \"upper_case_word_count\"\n    input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n    default_value = 0\n\n    def get_function(self):\n        def upper_case_word_count(x):\n            def _count_upper_case_words(elem):\n                if pd.isna(elem):\n                    return pd.NA\n                return sum(\n                    1\n                    for word in re.split(DELIMITERS, elem)\n                    if word.strip(punctuation) and word.upper() == word\n                )\n\n            return x.apply(_count_upper_case_words)\n\n        return upper_case_word_count\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/natural_language/whitespace_count.py",
    "content": "from featuretools.primitives.standard.transform.natural_language.count_string import (\n    CountString,\n)\n\n\nclass WhitespaceCount(CountString):\n    \"\"\"Calculates number of whitespaces in a string.\n\n    Description:\n        Given a list of strings, determine the whitespaces in each string\n        If a string is missing, return `NaN`\n\n    Examples:\n        >>> x = ['', 'hi im ethan', 'multiple    spaces']\n        >>> upper_case_count = WhitespaceCount()\n        >>> upper_case_count(x).tolist()\n        [0.0, 2.0, 4.0]\n    \"\"\"\n\n    name = \"whitespace_count\"\n    default_value = 0\n\n    def __init__(self):\n        super().__init__(string=\" \")\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/not_primitive.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Boolean, BooleanNullable\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass Not(TransformPrimitive):\n    \"\"\"Negates a boolean value.\n\n    Examples:\n        >>> not_func = Not()\n        >>> not_func([True, True, False]).tolist()\n        [False, False, True]\n    \"\"\"\n\n    name = \"not\"\n    input_types = [\n        [ColumnSchema(logical_type=Boolean)],\n        [ColumnSchema(logical_type=BooleanNullable)],\n    ]\n    return_type = ColumnSchema(logical_type=BooleanNullable)\n    description_template = \"the negation of {}\"\n\n    def generate_name(self, base_feature_names):\n        return \"NOT({})\".format(base_feature_names[0])\n\n    def get_function(self):\n        return np.logical_not\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/nth_week_of_month.py",
    "content": "import numpy as np\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Double\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass NthWeekOfMonth(TransformPrimitive):\n    \"\"\"Determines the nth week of the month from a given date.\n\n    Description:\n        Converts a datetime to an float representing the week\n        of the month in which the date falls. The first day of\n        the month starts week 1, and the week number is incremented\n        each Sunday.\n\n    Examples:\n        >>> from datetime import datetime\n        >>> nth_week_of_month = NthWeekOfMonth()\n        >>> times = [datetime(2019, 3, 1),\n        ...          datetime(2019, 3, 3),\n        ...          datetime(2019, 3, 31),\n        ...          datetime(2019, 3, 30)]\n        >>> nth_week_of_month(times).tolist()\n        [1.0, 2.0, 6.0, 5.0]\n    \"\"\"\n\n    name = \"nth_week_of_month\"\n    input_types = [ColumnSchema(logical_type=Datetime)]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n\n    def get_function(self):\n        def nth_week_of_month(x):\n            df = pd.DataFrame({\"date\": x})\n            df[\"first_day\"] = df.date - pd.to_timedelta(df[\"date\"].dt.day - 1, unit=\"d\")\n            df[\"dom\"] = df.date.dt.day\n            df[\"first_day_weekday\"] = df.first_day.dt.weekday\n            df[\"adjusted_dom\"] = df.dom + df.first_day_weekday + 1\n            df.loc[df[\"first_day_weekday\"].astype(float) == 6.0, \"adjusted_dom\"] = df[\n                \"dom\"\n            ]\n            df[\"week_of_month\"] = np.ceil(df.adjusted_dom / 7.0)\n            return df.week_of_month.values\n\n        return nth_week_of_month\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/numeric/__init__.py",
    "content": "from featuretools.primitives.standard.transform.numeric.absolute import Absolute\nfrom featuretools.primitives.standard.transform.numeric.cosine import Cosine\nfrom featuretools.primitives.standard.transform.numeric.diff import Diff\nfrom featuretools.primitives.standard.transform.numeric.natural_logarithm import (\n    NaturalLogarithm,\n)\nfrom featuretools.primitives.standard.transform.numeric.negate import Negate\nfrom featuretools.primitives.standard.transform.numeric.percentile import Percentile\nfrom featuretools.primitives.standard.transform.numeric.rate_of_change import (\n    RateOfChange,\n)\nfrom featuretools.primitives.standard.transform.numeric.same_as_previous import (\n    SameAsPrevious,\n)\nfrom featuretools.primitives.standard.transform.numeric.sine import Sine\nfrom featuretools.primitives.standard.transform.numeric.square_root import SquareRoot\nfrom featuretools.primitives.standard.transform.numeric.tangent import Tangent\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/numeric/absolute.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass Absolute(TransformPrimitive):\n    \"\"\"Computes the absolute value of a number.\n\n    Examples:\n        >>> absolute = Absolute()\n        >>> absolute([3.0, -5.0, -2.4]).tolist()\n        [3.0, 5.0, 2.4]\n    \"\"\"\n\n    name = \"absolute\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n    description_template = \"the absolute value of {}\"\n\n    def get_function(self):\n        return np.absolute\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/numeric/cosine.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Double\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass Cosine(TransformPrimitive):\n    \"\"\"Computes the cosine of a number.\n\n    Examples:\n        >>> cos = Cosine()\n        >>> cos([0.0, np.pi/2.0, np.pi]).tolist()\n        [1.0, 6.123233995736766e-17, -1.0]\n    \"\"\"\n\n    name = \"cosine\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n\n    description_template = \"the cosine of {}\"\n\n    def get_function(self):\n        return np.cos\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/numeric/diff.py",
    "content": "from woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass Diff(TransformPrimitive):\n    \"\"\"Computes the difference between the value in a list and the\n    previous value in that list.\n\n    Args:\n        periods (int): The number of periods by which to shift the index row.\n            Default is 0. Periods correspond to rows.\n\n    Description:\n        Given a list of values, compute the difference from the previous\n        item in the list. The result for the first element of the list will\n        always be `NaN`.\n\n    Examples:\n        >>> diff = Diff()\n        >>> values = [1, 10, 3, 4, 15]\n        >>> diff(values).tolist()\n        [nan, 9.0, -7.0, 1.0, 11.0]\n\n        You can specify the number of periods to shift the values\n\n        >>> values = [1, 2, 4, 7, 11, 16]\n        >>> diff_periods = Diff(periods = 1)\n        >>> diff_periods(values).tolist()\n        [nan, nan, 1.0, 2.0, 3.0, 4.0]\n    \"\"\"\n\n    name = \"diff\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    uses_full_dataframe = True\n    description_template = \"the difference from the previous value of {}\"\n\n    def __init__(self, periods=0):\n        self.periods = periods\n\n    def get_function(self):\n        def pd_diff(values):\n            return values.shift(self.periods).diff()\n\n        return pd_diff\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/numeric/natural_logarithm.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Double\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass NaturalLogarithm(TransformPrimitive):\n    \"\"\"Computes the natural logarithm of a number.\n\n    Examples:\n        >>> log = NaturalLogarithm()\n        >>> results = log([1.0, np.e]).tolist()\n        >>> results = [round(x, 2) for x in results]\n        >>> results\n        [0.0, 1.0]\n    \"\"\"\n\n    name = \"natural_logarithm\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n\n    description_template = \"the natural logarithm of {}\"\n\n    def get_function(self):\n        return np.log\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/numeric/negate.py",
    "content": "from woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass Negate(TransformPrimitive):\n    \"\"\"Negates a numeric value.\n\n    Examples:\n        >>> negate = Negate()\n        >>> negate([1.0, 23.2, -7.0]).tolist()\n        [-1.0, -23.2, 7.0]\n    \"\"\"\n\n    name = \"negate\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    description_template = \"the negation of {}\"\n\n    def get_function(self):\n        def negate(vals):\n            return vals * -1\n\n        return negate\n\n    def generate_name(self, base_feature_names):\n        return \"-(%s)\" % (base_feature_names[0])\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/numeric/percentile.py",
    "content": "from woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass Percentile(TransformPrimitive):\n    \"\"\"Determines the percentile rank for each value in a list.\n\n    Examples:\n        >>> percentile = Percentile()\n        >>> percentile([10, 15, 1, 20]).tolist()\n        [0.5, 0.75, 0.25, 1.0]\n\n        Nan values are ignored when determining rank\n\n        >>> percentile([10, 15, 1, None, 20]).tolist()\n        [0.5, 0.75, 0.25, nan, 1.0]\n    \"\"\"\n\n    name = \"percentile\"\n    uses_full_dataframe = True\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    description_template = \"the percentile rank of {}\"\n\n    def get_function(self):\n        return lambda array: array.rank(pct=True)\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/numeric/rate_of_change.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Double\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass RateOfChange(TransformPrimitive):\n    \"\"\"Computes the rate of change of a value per second.\n\n    Examples:\n        >>> import pandas as pd\n        >>> rate_of_change = RateOfChange()\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> results = rate_of_change([0, 30, 180, -90, 0], times).tolist()\n        >>> results = [round(x, 2) for x in results]\n        >>> results\n        [nan, 0.5, 2.5, -4.5, 1.5]\n    \"\"\"\n\n    name = \"rate_of_change\"\n    input_types = [\n        ColumnSchema(semantic_tags={\"numeric\"}),\n        ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"}),\n    ]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    uses_full_dataframe = True\n    description_template = \"the rate of change of {} per second\"\n\n    def get_function(self):\n        def rate_of_change(values, time):\n            time_delta = time.diff().dt.total_seconds()\n            value_delta = values.diff()\n            return value_delta / time_delta\n\n        return rate_of_change\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/numeric/same_as_previous.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import BooleanNullable\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass SameAsPrevious(TransformPrimitive):\n    \"\"\"Determines if a value is equal to the previous value in a list.\n\n    Description:\n        Compares a value in a list to the previous value and returns True if\n        the value is equal to the previous value or False otherwise. The\n        first item in the output will always be False, since there is no previous\n        element for the first element comparison.\n\n        Any nan values in the input will be filled using either a forward-fill\n        or backward-fill method, specified by the fill_method argument. The number\n        of consecutive nan values that get filled can be limited with the limit\n        argument. Any nan values left after filling will result in False being\n        returned for any comparison involving the nan value.\n\n    Args:\n        fill_method (str): Method for filling gaps in series. Valid\n        options are `backfill`, `bfill`, `pad`, `ffill`.\n        `pad / ffill`: fill gap with last valid observation.\n        `backfill / bfill`: fill gap with next valid observation.\n        Default is `pad`.\n\n        limit (int): The max number of consecutive NaN values in a gap that\n            can be filled. Default is None.\n\n    Examples:\n        >>> same_as_previous = SameAsPrevious()\n        >>> same_as_previous([1, 2, 2, 4]).tolist()\n        [False, False, True, False]\n\n        The fill method for nan values can be specified\n\n        >>> same_as_previous_fillna = SameAsPrevious(fill_method=\"bfill\")\n        >>> same_as_previous_fillna([1, None, 2, 4]).tolist()\n        [False, False, True, False]\n\n        The number of nan values that are filled can be limited\n\n        >>> same_as_previous_limitfill = SameAsPrevious(limit=2)\n        >>> same_as_previous_limitfill([1, None, None, None, 2, 3]).tolist()\n        [False, True, True, False, False, False]\n    \"\"\"\n\n    name = \"same_as_previous\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(BooleanNullable)\n\n    def __init__(self, fill_method=\"pad\", limit=None):\n        if fill_method not in [\"backfill\", \"bfill\", \"pad\", \"ffill\"]:\n            raise ValueError(\"Invalid fill_method\")\n        self.fill_method = fill_method\n        self.limit = limit\n\n    def get_function(self):\n        def same_as_previous(x):\n            x = x.fillna(method=self.fill_method, limit=self.limit)\n            x = x.eq(x.shift())\n            # first value will always be false, since there is no previous value\n            x.iloc[0] = False\n            return x\n\n        return same_as_previous\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/numeric/sine.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Double\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass Sine(TransformPrimitive):\n    \"\"\"Computes the sine of a number.\n\n    Examples:\n        >>> sin = Sine()\n        >>> sin([-np.pi/2.0, 0.0, np.pi/2.0]).tolist()\n        [-1.0, 0.0, 1.0]\n    \"\"\"\n\n    name = \"sine\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n\n    description_template = \"the sine of {}\"\n\n    def get_function(self):\n        return np.sin\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/numeric/square_root.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Double\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass SquareRoot(TransformPrimitive):\n    \"\"\"Computes the square root of a number.\n\n    Examples:\n        >>> sqrt = SquareRoot()\n        >>> sqrt([9.0, 16.0, 4.0]).tolist()\n        [3.0, 4.0, 2.0]\n    \"\"\"\n\n    name = \"square_root\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n\n    description_template = \"the square root of {}\"\n\n    def get_function(self):\n        return np.sqrt\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/numeric/tangent.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Double\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass Tangent(TransformPrimitive):\n    \"\"\"Computes the tangent of a number.\n\n    Examples:\n        >>> tan = Tangent()\n        >>> tan([-np.pi, 0.0, np.pi/2.0]).tolist()\n        [1.2246467991473532e-16, 0.0, 1.633123935319537e+16]\n    \"\"\"\n\n    name = \"tangent\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n\n    description_template = \"the tangent of {}\"\n\n    def get_function(self):\n        return np.tan\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/percent_change.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Double\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass PercentChange(TransformPrimitive):\n    \"\"\"Determines the percent difference between values in a list.\n\n    Description:\n        Given a list of numbers, return the percent difference\n        between each subsequent number. Percentages are shown in\n        decimal form (not multiplied by 100). Uses pandas' pct_change\n        function.\n\n    Args:\n        periods (int): Periods to shift for calculating percent change.\n            Default is 1.\n\n        fill_method (str): Method for filling gaps in reindexed\n            Series. Valid options are `backfill`, `bfill`, `pad`, `ffill`.\n            `pad / ffill`: fill gap with last valid observation.\n            `backfill / bfill`: fill gap with next valid observation.\n            Default is `pad`.\n\n        limit (int): The max number of consecutive NaN values in a gap that\n            can be filled. Default is None.\n\n        freq (DateOffset, timedelta, or offset alias string):\n            If `freq` is specified, instead of calcualting change between subsequent\n            points, PercentChange will calculate change between points with a\n            certain interval between their date indices. `freq` defines the\n            desired interval. When freq is used, the resulting index will also be\n            filled to include any missing dates from the specified interval.\n\n            If the index is not date/datetime and freq is used, it will raise a\n            NotImplementedError.\n\n            If freq is None, no changes will be applied. Default is None.\n\n    Examples:\n        >>> percent_change = PercentChange()\n        >>> percent_change([2, 5, 15, 3, 3, 9, 4.5]).to_list()\n        [nan, 1.5, 2.0, -0.8, 0.0, 2.0, -0.5]\n\n        We can control the number of periods to return the percent\n            difference between points further from one another.\n\n        >>> percent_change_2 = PercentChange(periods=2)\n        >>> percent_change_2([2, 5, 15, 3, 3, 9, 4.5]).to_list()\n        [nan, nan, 6.5, -0.4, -0.8, 2.0, 0.5]\n\n        We can control the method used to handle gaps in data.\n\n        >>> percent_change = PercentChange()\n        >>> percent_change([2, 4, 8, None, 16, None, 32, None]).to_list()\n        [nan, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0]\n        >>> percent_change_backfill = PercentChange(fill_method='backfill')\n        >>> percent_change_backfill([2, 4, 8, None, 16, None, 32, None]).to_list()\n        [nan, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, nan]\n\n        We can also control the maximum number of NaN values to fill in a gap.\n\n        >>> percent_change = PercentChange()\n        >>> percent_change([2, None, None, None, 4]).to_list()\n        [nan, 0.0, 0.0, 0.0, 1.0]\n        >>> percent_change_limited = PercentChange(limit=2)\n        >>> percent_change_limited([2, None, None, None, 4]).to_list()\n        [nan, 0.0, 0.0, nan, nan]\n\n        Finally, we can specify a date frequency on which to calculate percent\n            change.\n\n        >>> import pandas as pd\n        >>> dates = pd.DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-05'])\n        >>> x_indexed = pd.Series([1, 2, 3, 4], index=dates)\n        >>> percent_change = PercentChange()\n        >>> percent_change(x_indexed).to_list()\n        [nan, 1.0, 0.5, 0.33333333333333326]\n        >>> date_offset = pd.tseries.offsets.DateOffset(days=1)\n        >>> percent_change_freq = PercentChange(freq=date_offset)\n        >>> percent_change_freq(x_indexed).to_list()\n        [nan, 1.0, 0.5, nan]\n    \"\"\"\n\n    name = \"percent_change\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n\n    def __init__(self, periods=1, fill_method=\"pad\", limit=None, freq=None):\n        if fill_method not in [\"backfill\", \"bfill\", \"pad\", \"ffill\"]:\n            raise ValueError(\"Invalid fill_method\")\n        self.periods = periods\n        self.fill_method = fill_method\n        self.limit = limit\n        self.freq = freq\n\n    def get_function(self):\n        def percent_change(data):\n            return data.pct_change(\n                self.periods,\n                self.fill_method,\n                self.limit,\n                self.freq,\n            )\n\n        return percent_change\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/postal/__init__.py",
    "content": "from featuretools.primitives.standard.transform.postal.one_digit_postal_code import (\n    OneDigitPostalCode,\n)\nfrom featuretools.primitives.standard.transform.postal.two_digit_postal_code import (\n    TwoDigitPostalCode,\n)\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/postal/one_digit_postal_code.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Categorical, PostalCode\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass OneDigitPostalCode(TransformPrimitive):\n    \"\"\"Returns the one digit prefix of a given postal code.\n\n    Description:\n        Given a list of postal codes, returns the one digit prefix for each postal code.\n\n    Examples:\n        >>> one_digit_postal_code = OneDigitPostalCode()\n        >>> one_digit_postal_code(['92432', '34514']).tolist()\n        ['9', '3']\n    \"\"\"\n\n    name = \"one_digit_postal_code\"\n    input_types = [ColumnSchema(logical_type=PostalCode)]\n    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={\"category\"})\n    description_template = \"The one digit postal code prefix of {}\"\n\n    def get_function(self):\n        def one_digit_postal_code(postal_codes):\n            def transform_postal_code(pc):\n                return str(pc)[0] if pd.notna(pc) else pd.NA\n\n            return postal_codes.apply(transform_postal_code)\n\n        return one_digit_postal_code\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/postal/two_digit_postal_code.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Categorical, PostalCode\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass TwoDigitPostalCode(TransformPrimitive):\n    \"\"\"Returns the two digit prefix of a given postal code.\n\n    Description:\n        Given a list of postal codes, returns the two digit prefix for each postal code.\n\n    Examples:\n        >>> two_digit_postal_code = TwoDigitPostalCode()\n        >>> two_digit_postal_code(['92432', '34514']).tolist()\n        ['92', '34']\n    \"\"\"\n\n    name = \"two_digit_postal_code\"\n    input_types = [ColumnSchema(logical_type=PostalCode)]\n\n    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={\"category\"})\n    description_template = \"The two digit postal code prefix of {}\"\n\n    def get_function(self):\n        def two_digit_postal_code(postal_codes):\n            def transform_postal_code(pc):\n                return str(pc)[:2] if pd.notna(pc) else pd.NA\n\n            return postal_codes.apply(transform_postal_code)\n\n        return two_digit_postal_code\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/savgol_filter.py",
    "content": "from math import floor\n\nimport numpy as np\nfrom scipy.signal import savgol_coeffs, savgol_filter\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Double\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass SavgolFilter(TransformPrimitive):\n    \"\"\"Applies a Savitzky-Golay filter to a list of values.\n\n    Description:\n        Given a list of values, return a smoothed list which increases\n        the signal to noise ratio without greatly distoring the\n        signal. Uses the `Savitzky–Golay filter` method.\n\n        If the input list has less than 20 values, it will be returned\n        as is.\n\n        See the following page for more info:\n        https://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.signal.savgol_filter.html\n\n    Args:\n        window_length (int):  The length of the filter window (i.e. the number\n            of coefficients). `window_length` must be a positive odd integer.\n\n        polyorder (int): The order of the polynomial used to fit the samples.\n            `polyorder` must be less than `window_length`.\n\n        deriv (int): Optional. The order of the derivative to compute.  This\n            must be a nonnegative integer.  The default is 0, which means to\n            filter the data without differentiating.\n\n        delta (float): Optional. The spacing of the samples to which the filter\n            will be applied. This is only used if deriv > 0.  Default is 1.0.\n\n        mode (str): Optional. Must be 'mirror', 'constant', 'nearest', 'wrap'\n            or 'interp'.  This determines the type of extension to use for the\n            padded signal to which the filter is applied.  When `mode` is\n            'constant', the padding value is given by `cval`.  See the Notes\n            for more details on 'mirror', 'constant', 'wrap', and 'nearest'.\n\n            When the 'interp' mode is selected (the default), no extension\n            is used.  Instead, a degree `polyorder` polynomial is fit to the\n            last `window_length` values of the edges, and this polynomial is\n            used to evaluate the last `window_length // 2` output values.\n\n        cval (scalar): Optional. Value to fill past the edges of the input\n            if `mode` is 'constant'. Default is 0.0.\n\n    Examples:\n        >>> savgol_filter = SavgolFilter()\n        >>> data = [0, 1, 1, 2, 3, 4, 5, 7, 8, 7, 9, 9, 12, 11, 12, 14, 15, 17, 17, 17, 20]\n        >>> [round(x, 4) for x in savgol_filter(data).tolist()[:3]]\n        [0.0429, 0.8286, 1.2571]\n\n        We can control `window_length` and `polyorder` of the filter.\n\n        >>> savgol_filter = SavgolFilter(window_length=13, polyorder=3)\n        >>> [round(x, 4) for x in savgol_filter(data).tolist()[:3]]\n        [-0.0962, 0.6484, 1.4451]\n\n        We can also control the `deriv` and `delta` parameters.\n\n        >>> savgol_filter = SavgolFilter(deriv=1, delta=1.5)\n        >>> [round(x, 4) for x in savgol_filter(data).tolist()[:3]]\n        [0.754, 0.3492, 0.2778]\n\n        Finally, we can use `mode` to control how edge values are handled.\n\n        >>> savgol_filter = SavgolFilter(mode='constant', cval=5)\n        >>> [round(x, 4) for x in savgol_filter(data).tolist()[:3]]\n        [1.5429, 0.2286, 1.2571]\n    \"\"\"\n\n    name = \"savgol_filter\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n\n    def __init__(\n        self,\n        window_length=None,\n        polyorder=None,\n        deriv=0,\n        delta=1.0,\n        mode=\"interp\",\n        cval=0.0,\n    ):\n        if window_length is not None and polyorder is not None:\n            try:\n                if mode not in [\"mirror\", \"constant\", \"nearest\", \"interp\", \"wrap\"]:\n                    raise ValueError(\n                        \"mode must be 'mirror', 'constant', \"\n                        \"'nearest', 'wrap' or 'interp'.\",\n                    )\n                savgol_coeffs(window_length, polyorder, deriv=deriv, delta=delta)\n            except Exception:\n                raise\n        elif (window_length is None and polyorder is not None) or (\n            window_length is not None and polyorder is None\n        ):\n            error_text = (\n                \"Both window_length and polyorder must be defined if you define one.\"\n            )\n            raise ValueError(error_text)\n\n        self.window_length = window_length\n        self.polyorder = polyorder\n        self.deriv = deriv\n        self.delta = delta\n        self.mode = mode\n        self.cval = cval\n\n    def get_function(self):\n        def smooth(x):\n            if x.shape[0] < 20:\n                return x\n            if np.isnan(np.min(x)):\n                # interpolate the nan values, works for edges & middle nans\n                mask = np.isnan(x)\n                x[mask] = np.interp(\n                    np.flatnonzero(mask),\n                    np.flatnonzero(~mask),\n                    x[~mask],\n                )\n            window_length = self.window_length\n            polyorder = self.polyorder\n            if window_length is None and polyorder is None:\n                window_length = floor(len(x) / 10) * 2 + 1\n                polyorder = 3\n            return savgol_filter(\n                x,\n                window_length=window_length,\n                polyorder=polyorder,\n                deriv=self.deriv,\n                delta=self.delta,\n                mode=self.mode,\n                cval=self.cval,\n            )\n\n        return smooth\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/time_series/__init__.py",
    "content": "from featuretools.primitives.standard.transform.time_series.lag import Lag\nfrom featuretools.primitives.standard.transform.time_series.numeric_lag import (\n    NumericLag,\n)\nfrom featuretools.primitives.standard.transform.time_series.rolling_count import (\n    RollingCount,\n)\nfrom featuretools.primitives.standard.transform.time_series.rolling_max import (\n    RollingMax,\n)\nfrom featuretools.primitives.standard.transform.time_series.rolling_mean import (\n    RollingMean,\n)\nfrom featuretools.primitives.standard.transform.time_series.rolling_min import (\n    RollingMin,\n)\nfrom featuretools.primitives.standard.transform.time_series.rolling_outlier_count import (\n    RollingOutlierCount,\n)\nfrom featuretools.primitives.standard.transform.time_series.rolling_std import (\n    RollingSTD,\n)\nfrom featuretools.primitives.standard.transform.time_series.rolling_trend import (\n    RollingTrend,\n)\nfrom featuretools.primitives.standard.transform.time_series.expanding import (\n    ExpandingCount,\n    ExpandingMax,\n    ExpandingMean,\n    ExpandingMin,\n    ExpandingSTD,\n    ExpandingTrend,\n)\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/time_series/expanding/__init__.py",
    "content": "from featuretools.primitives.standard.transform.time_series.expanding.expanding_count import (\n    ExpandingCount,\n)\nfrom featuretools.primitives.standard.transform.time_series.expanding.expanding_max import (\n    ExpandingMax,\n)\nfrom featuretools.primitives.standard.transform.time_series.expanding.expanding_mean import (\n    ExpandingMean,\n)\nfrom featuretools.primitives.standard.transform.time_series.expanding.expanding_min import (\n    ExpandingMin,\n)\nfrom featuretools.primitives.standard.transform.time_series.expanding.expanding_std import (\n    ExpandingSTD,\n)\nfrom featuretools.primitives.standard.transform.time_series.expanding.expanding_trend import (\n    ExpandingTrend,\n)\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/time_series/expanding/expanding_count.py",
    "content": "import numpy as np\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, IntegerNullable\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.time_series.utils import (\n    _apply_gap_for_expanding_primitives,\n)\n\n\nclass ExpandingCount(TransformPrimitive):\n    \"\"\"Computes the expanding count of events over a given window.\n\n    Description:\n        Given a list of datetimes, returns an expanding count starting\n        at the row `gap` rows away from the current row. An expanding\n        primitive calculates the value of a primitive for a given time\n        with all the data available up to the corresponding point in time.\n\n        Input datetimes should be monotonic.\n\n    Args:\n        gap (int, optional): Specifies a gap backwards from each instance before the\n            usable data begins. Corresponds to number of rows. Defaults to 1.\n        min_periods (int, optional): Minimum number of observations required for performing calculations\n            over the window. Defaults to 1.\n\n\n    Examples:\n        >>> import pandas as pd\n        >>> expanding_count = ExpandingCount()\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> expanding_count(times).tolist()\n        [nan, 1.0, 2.0, 3.0, 4.0]\n\n        We can also control the gap before the expanding calculation.\n\n        >>> import pandas as pd\n        >>> expanding_count = ExpandingCount(gap=0)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> expanding_count(times).tolist()\n        [1.0, 2.0, 3.0, 4.0, 5.0]\n\n        We can also control the minimum number of periods required for the rolling calculation.\n\n        >>> import pandas as pd\n        >>> expanding_count = ExpandingCount(min_periods=3)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> expanding_count(times).tolist()\n        [nan, nan, nan, 3.0, 4.0]\n    \"\"\"\n\n    name = \"expanding_count\"\n    input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"})]\n    return_type = ColumnSchema(logical_type=IntegerNullable, semantic_tags={\"numeric\"})\n    uses_full_dataframe = True\n\n    def __init__(self, gap=1, min_periods=1):\n        self.gap = gap\n        self.min_periods = min_periods\n\n    def get_function(self):\n        def expanding_count(datetime_series):\n            datetime_series = _apply_gap_for_expanding_primitives(\n                datetime_series,\n                self.gap,\n            )\n            count_series = datetime_series.expanding(\n                min_periods=self.min_periods,\n            ).count()\n            num_nans = self.gap + self.min_periods - 1\n            count_series[range(num_nans)] = np.nan\n            return count_series\n\n        return expanding_count\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/time_series/expanding/expanding_max.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.time_series.utils import (\n    _apply_gap_for_expanding_primitives,\n)\n\n\nclass ExpandingMax(TransformPrimitive):\n    \"\"\"Computes the expanding maximum of events over a given window.\n\n    Description:\n        Given a list of datetimes, returns an expanding maximum starting\n        at the row `gap` rows away from the current row. An expanding\n        primitive calculates the value of a primitive for a given time\n        with all the data available up to the corresponding point in time.\n\n        Input datetimes should be monotonic.\n\n    Args:\n        gap (int, optional): Specifies a gap backwards from each instance before the\n            usable data begins. Corresponds to number of rows. Defaults to 1.\n        min_periods (int, optional): Minimum number of observations required for performing calculations\n            over the window. Defaults to 1.\n\n\n    Examples:\n        >>> import pandas as pd\n        >>> expanding_min = ExpandingMax()\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> expanding_min(times, [2, 4, 6, 7, 2]).tolist()\n        [nan, 2.0, 4.0, 6.0, 7.0]\n\n        We can also control the gap before the expanding calculation.\n\n        >>> import pandas as pd\n        >>> expanding_min = ExpandingMax(gap=0)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> expanding_min(times, [2, 4, 6, 7, 2]).tolist()\n        [2.0, 4.0, 6.0, 7.0, 7.0]\n\n        We can also control the minimum number of periods required for the rolling calculation.\n\n        >>> import pandas as pd\n        >>> expanding_min = ExpandingMax(min_periods=3)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> expanding_min(times, [2, 4, 6, 7, 2]).tolist()\n        [nan, nan, nan, 6.0, 7.0]\n    \"\"\"\n\n    name = \"expanding_max\"\n    input_types = [\n        ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"}),\n        ColumnSchema(semantic_tags={\"numeric\"}),\n    ]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    uses_full_dataframe = True\n\n    def __init__(self, gap=1, min_periods=1):\n        self.gap = gap\n        self.min_periods = min_periods\n\n    def get_function(self):\n        def expanding_max(datetime, numeric):\n            x = pd.Series(numeric.values, index=datetime)\n            x = _apply_gap_for_expanding_primitives(x, self.gap)\n            return x.expanding(min_periods=self.min_periods).max().values\n\n        return expanding_max\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/time_series/expanding/expanding_mean.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Double\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.time_series.utils import (\n    _apply_gap_for_expanding_primitives,\n)\n\n\nclass ExpandingMean(TransformPrimitive):\n    \"\"\"Computes the expanding mean of events over a given window.\n\n    Description:\n        Given a list of datetimes, returns an expanding mean starting\n        at the row `gap` rows away from the current row. An expanding\n        primitive calculates the value of a primitive for a given time\n        with all the data available up to the corresponding point in time.\n\n        Input datetimes should be monotonic.\n\n    Args:\n        gap (int, optional): Specifies a gap backwards from each instance before the\n            usable data begins. Corresponds to number of rows. Defaults to 1.\n        min_periods (int, optional): Minimum number of observations required for performing calculations\n            over the window. Defaults to 1.\n\n\n    Examples:\n        >>> import pandas as pd\n        >>> expanding_mean = ExpandingMean()\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> expanding_mean(times, [5, 4, 3, 2, 1]).tolist()\n        [nan, 5.0, 4.5, 4.0, 3.5]\n\n        We can also control the gap before the expanding calculation.\n\n        >>> import pandas as pd\n        >>> expanding_mean = ExpandingMean(gap=0)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> expanding_mean(times, [5, 4, 3, 2, 1]).tolist()\n        [5.0, 4.5, 4.0, 3.5, 3.0]\n\n        We can also control the minimum number of periods required for the rolling calculation.\n\n        >>> import pandas as pd\n        >>> expanding_mean = ExpandingMean(min_periods=3)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> expanding_mean(times, [5, 4, 3, 2, 1]).tolist()\n        [nan, nan, nan, 4.0, 3.5]\n    \"\"\"\n\n    name = \"expanding_mean\"\n    input_types = [\n        ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"}),\n        ColumnSchema(semantic_tags={\"numeric\"}),\n    ]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    uses_full_dataframe = True\n\n    def __init__(self, gap=1, min_periods=1):\n        self.gap = gap\n        self.min_periods = min_periods\n\n    def get_function(self):\n        def expanding_mean(datetime, numeric):\n            x = pd.Series(numeric.values, index=datetime)\n            x = _apply_gap_for_expanding_primitives(x, self.gap)\n            return x.expanding(min_periods=self.min_periods).mean().values\n\n        return expanding_mean\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/time_series/expanding/expanding_min.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.time_series.utils import (\n    _apply_gap_for_expanding_primitives,\n)\n\n\nclass ExpandingMin(TransformPrimitive):\n    \"\"\"Computes the expanding minimum of events over a given window.\n\n    Description:\n        Given a list of datetimes, returns an expanding minimum starting\n        at the row `gap` rows away from the current row. An expanding\n        primitive calculates the value of a primitive for a given time\n        with all the data available up to the corresponding point in time.\n\n        Input datetimes should be monotonic.\n\n    Args:\n        gap (int, optional): Specifies a gap backwards from each instance before the\n            usable data begins. Corresponds to number of rows. Defaults to 1.\n        min_periods (int, optional): Minimum number of observations required for performing calculations\n            over the window. Defaults to 1.\n\n    Examples:\n        >>> import pandas as pd\n        >>> expanding_min = ExpandingMin()\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> expanding_min(times, [5, 4, 3, 2, 1]).tolist()\n        [nan, 5.0, 4.0, 3.0, 2.0]\n\n        We can also control the gap before the expanding calculation.\n\n        >>> import pandas as pd\n        >>> expanding_min = ExpandingMin(gap=0)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> expanding_min(times, [5, 4, 3, 2, 1]).tolist()\n        [5.0, 4.0, 3.0, 2.0, 1.0]\n\n        We can also control the minimum number of periods required for the rolling calculation.\n\n        >>> import pandas as pd\n        >>> expanding_min = ExpandingMin(min_periods=3)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> expanding_min(times, [5, 4, 3, 2, 1]).tolist()\n        [nan, nan, nan, 3.0, 2.0]\n    \"\"\"\n\n    name = \"expanding_min\"\n    input_types = [\n        ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"}),\n        ColumnSchema(semantic_tags={\"numeric\"}),\n    ]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    uses_full_dataframe = True\n\n    def __init__(self, gap=1, min_periods=1):\n        self.gap = gap\n        self.min_periods = min_periods\n\n    def get_function(self):\n        def expanding_min(datetime, numeric):\n            x = pd.Series(numeric.values, index=datetime)\n            x = _apply_gap_for_expanding_primitives(x, self.gap)\n            return x.expanding(min_periods=self.min_periods).min().values\n\n        return expanding_min\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/time_series/expanding/expanding_std.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Double\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.time_series.utils import (\n    _apply_gap_for_expanding_primitives,\n)\n\n\nclass ExpandingSTD(TransformPrimitive):\n    \"\"\"Computes the expanding standard deviation for events over a given window.\n\n    Description:\n        Given a list of datetimes, returns the expanding standard deviation\n        starting at the row `gap` rows away from the current row. An expanding\n        primitive calculates the value of a primitive for a given time\n        with all the data available up to the corresponding point in time.\n\n        Input datetimes should be monotonic.\n\n    Args:\n        gap (int, optional): Specifies a gap backwards from each instance before the\n            usable data begins. Corresponds to number of rows. Defaults to 1.\n        min_periods (int, optional): Minimum number of observations required for performing calculations\n            over the window. Defaults to 1.\n\n\n    Examples:\n        >>> import pandas as pd\n        >>> expanding_std = ExpandingSTD()\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> ans = expanding_std(times, [5, 4, 3, 2, 1]).tolist()\n        >>> [round(x, 2) for x in ans]\n        [nan, nan, 0.71, 1.0, 1.29]\n\n        We can also control the gap before the expanding calculation.\n\n        >>> import pandas as pd\n        >>> expanding_std = ExpandingSTD(gap=0)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> ans = expanding_std(times, [5, 4, 3, 2, 1]).tolist()\n        >>> [round(x, 2) for x in ans]\n        [nan, 0.71, 1.0, 1.29, 1.58]\n\n        We can also control the minimum number of periods required for the rolling calculation.\n\n        >>> import pandas as pd\n        >>> expanding_std = ExpandingSTD(min_periods=3)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> ans = expanding_std(times, [5, 4, 3, 2, 1]).tolist()\n        >>> [round(x, 2) for x in ans]\n        [nan, nan, nan, 1.0, 1.29]\n    \"\"\"\n\n    name = \"expanding_std\"\n    input_types = [\n        ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"}),\n        ColumnSchema(semantic_tags={\"numeric\"}),\n    ]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    uses_full_dataframe = True\n\n    def __init__(self, gap=1, min_periods=1):\n        self.gap = gap\n        self.min_periods = min_periods\n\n    def get_function(self):\n        def expanding_std(datetime, numeric):\n            x = pd.Series(numeric.values, index=datetime)\n            x = _apply_gap_for_expanding_primitives(x, self.gap)\n            return x.expanding(min_periods=self.min_periods).std().values\n\n        return expanding_std\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/time_series/expanding/expanding_trend.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Double\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.time_series.utils import (\n    _apply_gap_for_expanding_primitives,\n)\nfrom featuretools.utils import calculate_trend\n\n\nclass ExpandingTrend(TransformPrimitive):\n    \"\"\"Computes the expanding trend for events over a given window.\n\n    Description:\n        Given a list of datetimes, returns the expanding trend starting\n        at the row `gap` rows away from the current row. An expanding\n        primitive calculates the value of a primitive for a given time\n        with all the data available up to the corresponding point in time.\n\n        Input datetimes should be monotonic.\n\n    Args:\n        gap (int, optional): Specifies a gap backwards from each instance before the\n            usable data begins. Corresponds to number of rows. Defaults to 1.\n        min_periods (int, optional): Minimum number of observations required for performing calculations\n            over the window. Defaults to 1.\n\n\n    Examples:\n        >>> import pandas as pd\n        >>> expanding_trend = ExpandingTrend()\n        >>> times = pd.date_range(start='2019-01-01', freq='1D', periods=5)\n        >>> ans = expanding_trend(times, [5, 4, 3, 2, 1]).tolist()\n        >>> [round(x, 2) for x in ans]\n        [nan, nan, nan, -1.0, -1.0]\n\n        We can also control the gap before the expanding calculation.\n\n        >>> import pandas as pd\n        >>> expanding_trend = ExpandingTrend(gap=0)\n        >>> times = pd.date_range(start='2019-01-01', freq='1D', periods=5)\n        >>> ans = expanding_trend(times, [5, 4, 3, 2, 1]).tolist()\n        >>> [round(x, 2) for x in ans]\n        [nan, nan, -1.0, -1.0, -1.0]\n\n        We can also control the minimum number of periods required for the rolling calculation.\n\n        >>> import pandas as pd\n        >>> expanding_trend = ExpandingTrend(min_periods=3)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> ans = expanding_trend(times, [50, 4, 13, 22, 10]).tolist()\n        >>> [round(x, 2) for x in ans]\n        [nan, nan, nan, -18.5, -7.5]\n    \"\"\"\n\n    name = \"expanding_trend\"\n    input_types = [\n        ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"}),\n        ColumnSchema(semantic_tags={\"numeric\"}),\n    ]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    uses_full_dataframe = True\n\n    def __init__(self, gap=1, min_periods=1):\n        self.gap = gap\n        self.min_periods = min_periods\n\n    def get_function(self):\n        def expanding_trend(datetime, numeric):\n            x = pd.Series(numeric.values, index=datetime)\n            x = _apply_gap_for_expanding_primitives(x, self.gap)\n            return (\n                x.expanding(min_periods=self.min_periods)\n                .aggregate(calculate_trend)\n                .values\n            )\n\n        return expanding_trend\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/time_series/lag.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Boolean, BooleanNullable\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass Lag(TransformPrimitive):\n    \"\"\"Shifts an array of values by a specified number of periods.\n\n    Args:\n        periods (int): The number of periods by which to shift the input.\n            Default is 1. Periods correspond to rows.\n\n    Examples:\n        >>> lag = Lag()\n        >>> lag([1, 2, 3, 4, 5], pd.Series(pd.date_range(start=\"2020-01-01\", periods=5, freq='D'))).tolist()\n        [nan, 1.0, 2.0, 3.0, 4.0]\n\n        You can specify the number of periods to shift the values\n\n        >>> lag_periods = Lag(periods=3)\n        >>> lag_periods([True, False, False, True, True], pd.Series(pd.date_range(start=\"2020-01-01\", periods=5, freq='D'))).tolist()\n        [nan, nan, nan, True, False]\n    \"\"\"\n\n    # Note: with pandas 1.5.0, using Lag with a string input will result in `None` values\n    # being introduced instead of `nan` values that were present in previous versions.\n    # All missing values will be replaced by `np.nan` (for Double) or `pd.NA` (all other types)\n    # once Woodwork is initialized on the feature matrix.\n    name = \"lag\"\n    input_types = [\n        [\n            ColumnSchema(semantic_tags={\"category\"}),\n            ColumnSchema(semantic_tags={\"time_index\"}),\n        ],\n        [\n            ColumnSchema(semantic_tags={\"numeric\"}),\n            ColumnSchema(semantic_tags={\"time_index\"}),\n        ],\n        [\n            ColumnSchema(logical_type=Boolean),\n            ColumnSchema(semantic_tags={\"time_index\"}),\n        ],\n        [\n            ColumnSchema(logical_type=BooleanNullable),\n            ColumnSchema(semantic_tags={\"time_index\"}),\n        ],\n    ]\n    return_type = None\n    uses_full_dataframe = True\n\n    def __init__(self, periods=1):\n        self.periods = periods\n\n    def get_function(self):\n        def lag(input_col, time_index):\n            x = pd.Series(input_col.values, index=time_index.values)\n            return x.shift(periods=self.periods, fill_value=None).values\n\n        return lag\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/time_series/numeric_lag.py",
    "content": "import warnings\n\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass NumericLag(TransformPrimitive):\n    \"\"\"Shifts an array of values by a specified number of periods.\n\n    Args:\n        periods (int): The number of periods by which to shift the input.\n            Default is 1. Periods correspond to rows.\n\n        fill_value (int, float, optional): The value to use to fill in\n            the gaps left after shifting the input. Default is None.\n\n    Examples:\n        >>> lag = NumericLag()\n        >>> lag(pd.Series(pd.date_range(start=\"2020-01-01\", periods=5, freq='D')), [1, 2, 3, 4, 5]).tolist()\n        [nan, 1.0, 2.0, 3.0, 4.0]\n\n        You can specify the number of periods to shift the values\n\n        >>> lag_periods = NumericLag(periods=3)\n        >>> lag_periods(pd.Series(pd.date_range(start=\"2020-01-01\", periods=5, freq='D')), [1, 2, 3, 4, 5]).tolist()\n        [nan, nan, nan, 1.0, 2.0]\n\n        You can specify the fill value to use\n\n        >>> lag_fill_value = NumericLag(fill_value=100)\n        >>> lag_fill_value(pd.Series(pd.date_range(start=\"2020-01-01\", periods=4, freq='D')), [1, 2, 3, 4]).tolist()\n        [100, 1, 2, 3]\n    \"\"\"\n\n    name = \"numeric_lag\"\n    input_types = [\n        ColumnSchema(semantic_tags={\"time_index\"}),\n        ColumnSchema(semantic_tags={\"numeric\"}),\n    ]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    uses_full_dataframe = True\n\n    def __init__(self, periods=1, fill_value=None):\n        self.periods = periods\n        self.fill_value = fill_value\n        warnings.warn(\n            \"NumericLag is deprecated and will be removed in a future version. Please use the 'Lag' primitive instead.\",\n            FutureWarning,\n        )\n\n    def get_function(self):\n        def lag(time_index, numeric):\n            x = pd.Series(numeric.values, index=time_index.values)\n            return x.shift(periods=self.periods, fill_value=self.fill_value).values\n\n        return lag\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/time_series/rolling_count.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Double\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.time_series.utils import (\n    apply_rolling_agg_to_series,\n)\n\n\nclass RollingCount(TransformPrimitive):\n    \"\"\"Determines a rolling count of events over a given window.\n\n    Description:\n        Given a list of datetimes, return a rolling count starting\n        at the row `gap` rows away from the current row and looking backward over the specified\n        time window (by `window_length` and `gap`).\n\n        Input datetimes should be monotonic.\n\n    Args:\n        window_length (int, string, optional): Specifies the amount of data included in each window.\n            If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency,\n            for example of one day, the window_length will correspond to a period of time, in this case,\n            7 days for a window_length of 7.\n            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),\n            and it will indicate a length of time that each window should span.\n            The list of available offset aliases can be found at\n            https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.\n            Defaults to 3.\n        gap (int, string, optional): Specifies a gap backwards from each instance before the\n            window of usable data begins. If an integer is provided, it will correspond to a number of rows.\n            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),\n            and it will indicate a length of time between a target instance and the beginning of its window.\n            Defaults to 1.\n        min_periods (int, optional): Minimum number of observations required for performing calculations\n            over the window. Can only be as large as window_length when window_length is an integer.\n            When window_length is an offset alias string, this limitation does not exist, but care should be taken\n            to not choose a min_periods that will always be larger than the number of observations in a window.\n            Defaults to 1.\n\n    Note:\n        Only offset aliases with fixed frequencies can be used when defining gap and h.\n        This means that aliases such as `M` or `W` cannot be used, as they can indicate different\n        numbers of days. ('M', because different months have different numbers of days;\n        'W' because week will indicate a certain day of the week, like W-Wed, so that will\n        indicate a different number of days depending on the anchoring date.)\n\n    Note:\n        When using an offset alias to define `gap`, an offset alias must also be used to define `window_length`.\n        This limitation does not exist when using an offset alias to define `window_length`. In fact,\n        if the data has a uniform sampling frequency, it is preferable to use a numeric `gap` as it is more\n        efficient.\n\n    Examples:\n        >>> import pandas as pd\n        >>> rolling_count = RollingCount(window_length=3)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> rolling_count(times).tolist()\n        [nan, 1.0, 2.0, 3.0, 3.0]\n\n        We can also control the gap before the rolling calculation.\n\n        >>> import pandas as pd\n        >>> rolling_count = RollingCount(window_length=3, gap=0)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> rolling_count(times).tolist()\n        [1.0, 2.0, 3.0, 3.0, 3.0]\n\n        We can also control the minimum number of periods required for the rolling calculation.\n\n        >>> import pandas as pd\n        >>> rolling_count = RollingCount(window_length=3, min_periods=3, gap=0)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> rolling_count(times).tolist()\n        [nan, nan, 3.0, 3.0, 3.0]\n\n        We can also set the window_length and gap using offset alias strings.\n        >>> import pandas as pd\n        >>> rolling_count = RollingCount(window_length='3min', gap='1min')\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> rolling_count(times).tolist()\n        [nan, 1.0, 2.0, 3.0, 3.0]\n\n    \"\"\"\n\n    name = \"rolling_count\"\n    input_types = [ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"})]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    uses_full_dataframe = True\n\n    def __init__(self, window_length=3, gap=1, min_periods=0):\n        self.window_length = window_length\n        self.gap = gap\n        self.min_periods = min_periods\n\n    def get_function(self):\n        def rolling_count(datetime):\n            x = pd.Series(1, index=datetime)\n            return apply_rolling_agg_to_series(\n                x,\n                lambda series: series.count(),\n                self.window_length,\n                self.gap,\n                self.min_periods,\n                ignore_window_nans=True,\n            )\n\n        return rolling_count\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/time_series/rolling_max.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Double\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.time_series.utils import (\n    apply_rolling_agg_to_series,\n)\n\n\nclass RollingMax(TransformPrimitive):\n    \"\"\"Determines the maximum of entries over a given window.\n\n    Description:\n        Given a list of numbers and a corresponding list of\n        datetimes, return a rolling maximum of the numeric values,\n        starting at the row `gap` rows away from the current row and looking backward\n        over the specified window (by `window_length` and `gap`).\n\n        Input datetimes should be monotonic.\n\n    Args:\n        window_length (int, string, optional): Specifies the amount of data included in each window.\n            If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency,\n            for example of one day, the window_length will correspond to a period of time, in this case,\n            7 days for a window_length of 7.\n            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),\n            and it will indicate a length of time that each window should span.\n            The list of available offset aliases can be found at\n            https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.\n            Defaults to 3.\n        gap (int, string, optional): Specifies a gap backwards from each instance before the\n            window of usable data begins. If an integer is provided, it will correspond to a number of rows.\n            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),\n            and it will indicate a length of time between a target instance and the beginning of its window.\n            Defaults to 1.\n        min_periods (int, optional): Minimum number of observations required for performing calculations\n            over the window. Can only be as large as window_length when window_length is an integer.\n            When window_length is an offset alias string, this limitation does not exist, but care should be taken\n            to not choose a min_periods that will always be larger than the number of observations in a window.\n            Defaults to 1.\n\n    Note:\n        Only offset aliases with fixed frequencies can be used when defining gap and window_length.\n        This means that aliases such as `M` or `W` cannot be used, as they can indicate different\n        numbers of days. ('M', because different months have different numbers of days;\n        'W' because week will indicate a certain day of the week, like W-Wed, so that will\n        indicate a different number of days depending on the anchoring date.)\n\n    Note:\n        When using an offset alias to define `gap`, an offset alias must also be used to define `window_length`.\n        This limitation does not exist when using an offset alias to define `window_length`. In fact,\n        if the data has a uniform sampling frequency, it is preferable to use a numeric `gap` as it is more\n        efficient.\n\n    Examples:\n        >>> import pandas as pd\n        >>> rolling_max = RollingMax(window_length=3)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> rolling_max(times, [4, 3, 2, 1, 0]).tolist()\n        [nan, 4.0, 4.0, 4.0, 3.0]\n\n        We can also control the gap before the rolling calculation.\n\n        >>> import pandas as pd\n        >>> rolling_max = RollingMax(window_length=3, gap=0)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> rolling_max(times, [4, 3, 2, 1, 0]).tolist()\n        [4.0, 4.0, 4.0, 3.0, 2.0]\n\n        We can also control the minimum number of periods required for the rolling calculation.\n\n        >>> import pandas as pd\n        >>> rolling_max = RollingMax(window_length=3, min_periods=3, gap=0)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> rolling_max(times, [4, 3, 2, 1, 0]).tolist()\n        [nan, nan, 4.0, 3.0, 2.0]\n\n        We can also set the window_length and gap using offset alias strings.\n\n        >>> import pandas as pd\n        >>> rolling_max = RollingMax(window_length='3min', gap='1min')\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> rolling_max(times, [4, 3, 2, 1, 0]).tolist()\n        [nan, 4.0, 4.0, 4.0, 3.0]\n    \"\"\"\n\n    name = \"rolling_max\"\n    input_types = [\n        ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"}),\n        ColumnSchema(semantic_tags={\"numeric\"}),\n    ]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    uses_full_dataframe = True\n\n    def __init__(self, window_length=3, gap=1, min_periods=1):\n        self.window_length = window_length\n        self.gap = gap\n        self.min_periods = min_periods\n\n    def get_function(self):\n        def rolling_max(datetime, numeric):\n            x = pd.Series(numeric.values, index=datetime.values)\n            return apply_rolling_agg_to_series(\n                x,\n                lambda series: series.max(),\n                self.window_length,\n                self.gap,\n                self.min_periods,\n            )\n\n        return rolling_max\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/time_series/rolling_mean.py",
    "content": "import numpy as np\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Double\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.time_series.utils import (\n    apply_rolling_agg_to_series,\n)\n\n\nclass RollingMean(TransformPrimitive):\n    \"\"\"Calculates the mean of entries over a given window.\n\n    Description:\n        Given a list of numbers and a corresponding list of\n        datetimes, return a rolling mean of the numeric values,\n        starting at the row `gap` rows away from the current row and looking backward\n        over the specified time window (by `window_length` and `gap`).\n\n        Input datetimes should be monotonic.\n\n    Args:\n        window_length (int, string, optional): Specifies the amount of data included in each window.\n            If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency,\n            for example of one day, the window_length will correspond to a period of time, in this case,\n            7 days for a window_length of 7.\n            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),\n            and it will indicate a length of time that each window should span.\n            The list of available offset aliases can be found at\n            https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.\n            Defaults to 3.\n        gap (int, string, optional): Specifies a gap backwards from each instance before the\n            window of usable data begins. If an integer is provided, it will correspond to a number of rows.\n            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),\n            and it will indicate a length of time between a target instance and the beginning of its window.\n            Defaults to 1.\n        min_periods (int, optional): Minimum number of observations required for performing calculations\n            over the window. Can only be as large as window_length when window_length is an integer.\n            When window_length is an offset alias string, this limitation does not exist, but care should be taken\n            to not choose a min_periods that will always be larger than the number of observations in a window.\n            Defaults to 1.\n\n    Note:\n        Only offset aliases with fixed frequencies can be used when defining gap and window_length.\n        This means that aliases such as `M` or `W` cannot be used, as they can indicate different\n        numbers of days. ('M', because different months have different numbers of days;\n        'W' because week will indicate a certain day of the week, like W-Wed, so that will\n        indicate a different number of days depending on the anchoring date.)\n\n    Note:\n        When using an offset alias to define `gap`, an offset alias must also be used to define `window_length`.\n        This limitation does not exist when using an offset alias to define `window_length`. In fact,\n        if the data has a uniform sampling frequency, it is preferable to use a numeric `gap` as it is more\n        efficient.\n\n    Examples:\n        >>> import pandas as pd\n        >>> rolling_mean = RollingMean(window_length=3)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> rolling_mean(times, [4, 3, 2, 1, 0]).tolist()\n        [nan, 4.0, 3.5, 3.0, 2.0]\n\n        We can also control the gap before the rolling calculation.\n\n        >>> import pandas as pd\n        >>> rolling_mean = RollingMean(window_length=3, gap=0)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> rolling_mean(times, [4, 3, 2, 1, 0]).tolist()\n        [4.0, 3.5, 3.0, 2.0, 1.0]\n\n        We can also control the minimum number of periods required for the rolling calculation.\n\n        >>> import pandas as pd\n        >>> rolling_mean = RollingMean(window_length=3, min_periods=3, gap=0)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> rolling_mean(times, [4, 3, 2, 1, 0]).tolist()\n        [nan, nan, 3.0, 2.0, 1.0]\n    \"\"\"\n\n    name = \"rolling_mean\"\n    input_types = [\n        ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"}),\n        ColumnSchema(semantic_tags={\"numeric\"}),\n    ]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    uses_full_dataframe = True\n\n    def __init__(self, window_length=3, gap=1, min_periods=0):\n        self.window_length = window_length\n        self.gap = gap\n        self.min_periods = min_periods\n\n    def get_function(self):\n        def rolling_mean(datetime, numeric):\n            x = pd.Series(numeric.values, index=datetime.values)\n            return apply_rolling_agg_to_series(\n                x,\n                np.mean,\n                self.window_length,\n                self.gap,\n                self.min_periods,\n            )\n\n        return rolling_mean\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/time_series/rolling_min.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Double\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.time_series.utils import (\n    apply_rolling_agg_to_series,\n)\n\n\nclass RollingMin(TransformPrimitive):\n    \"\"\"Determines the minimum of entries over a given window.\n\n    Description:\n        Given a list of numbers and a corresponding list of\n        datetimes, return a rolling minimum of the numeric values,\n        starting at the row `gap` rows away from the current row and looking backward\n        over the specified window (by `window_length` and `gap`).\n        Input datetimes should be monotonic.\n\n    Args:\n        window_length (int, string, optional): Specifies the amount of data included in each window.\n            If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency,\n            for example of one day, the window_length will correspond to a period of time, in this case,\n            7 days for a window_length of 7.\n            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),\n            and it will indicate a length of time that each window should span.\n            The list of available offset aliases can be found at\n            https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.\n            Defaults to 3.\n        gap (int, string, optional): Specifies a gap backwards from each instance before the\n            window of usable data begins. If an integer is provided, it will correspond to a number of rows.\n            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),\n            and it will indicate a length of time between a target instance and the beginning of its window.\n            Defaults to 1.\n        min_periods (int, optional): Minimum number of observations required for performing calculations\n            over the window. Can only be as large as window_length when window_length is an integer.\n            When window_length is an offset alias string, this limitation does not exist, but care should be taken\n            to not choose a min_periods that will always be larger than the number of observations in a window.\n            Defaults to 1.\n\n    Note:\n        Only offset aliases with fixed frequencies can be used when defining gap and window_length.\n        This means that aliases such as `M` or `W` cannot be used, as they can indicate different\n        numbers of days. ('M', because different months have different numbers of days;\n        'W' because week will indicate a certain day of the week, like W-Wed, so that will\n        indicate a different number of days depending on the anchoring date.)\n\n    Note:\n        When using an offset alias to define `gap`, an offset alias must also be used to define `window_length`.\n        This limitation does not exist when using an offset alias to define `window_length`. In fact,\n        if the data has a uniform sampling frequency, it is preferable to use a numeric `gap` as it is more\n        efficient.\n\n    Examples:\n        >>> import pandas as pd\n        >>> rolling_min = RollingMin(window_length=3)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> rolling_min(times, [4, 3, 2, 1, 0]).tolist()\n        [nan, 4.0, 3.0, 2.0, 1.0]\n\n        We can also control the gap before the rolling calculation.\n\n        >>> import pandas as pd\n        >>> rolling_min = RollingMin(window_length=3, gap=0)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> rolling_min(times, [4, 3, 2, 1, 0]).tolist()\n        [4.0, 3.0, 2.0, 1.0, 0.0]\n\n        We can also control the minimum number of periods required for the rolling calculation.\n\n        >>> import pandas as pd\n        >>> rolling_min = RollingMin(window_length=3, min_periods=3, gap=0)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> rolling_min(times, [4, 3, 2, 1, 0]).tolist()\n        [nan, nan, 2.0, 1.0, 0.0]\n\n        We can also set the window_length and gap using offset alias strings.\n\n        >>> import pandas as pd\n        >>> rolling_min = RollingMin(window_length='3min', gap='1min')\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> rolling_min(times, [4, 3, 2, 1, 0]).tolist()\n        [nan, 4.0, 3.0, 2.0, 1.0]\n    \"\"\"\n\n    name = \"rolling_min\"\n    input_types = [\n        ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"}),\n        ColumnSchema(semantic_tags={\"numeric\"}),\n    ]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    uses_full_dataframe = True\n\n    def __init__(self, window_length=3, gap=1, min_periods=1):\n        self.window_length = window_length\n        self.gap = gap\n        self.min_periods = min_periods\n\n    def get_function(self):\n        def rolling_min(datetime, numeric):\n            x = pd.Series(numeric.values, index=datetime.values)\n            return apply_rolling_agg_to_series(\n                x,\n                lambda series: series.min(),\n                self.window_length,\n                self.gap,\n                self.min_periods,\n            )\n\n        return rolling_min\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/time_series/rolling_outlier_count.py",
    "content": "import numpy as np\nimport pandas as pd\nfrom woodwork import init_series\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Double\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.time_series.utils import (\n    apply_rolling_agg_to_series,\n)\n\n\nclass RollingOutlierCount(TransformPrimitive):\n    \"\"\"Determines how many values are outliers over a given window.\n\n    Description:\n        Given a list of numbers and a corresponding list of\n        datetimes, return a rolling count of outliers within the numeric values,\n        starting at the row `gap` rows away from the current row and looking backward\n        over the specified window (by `window_length` and `gap`). Values are deemed\n        outliers using the IQR method, computed over the whole series.\n        Input datetimes should be monotonic.\n\n    Args:\n        window_length (int, string, optional): Specifies the amount of data included in each window.\n            If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling\n            frequency, for example of one day, the window_length will correspond to a period of time, in this case,\n            7 days for a window_length of 7.\n            If a string is provided, it must be one of Pandas' offset alias strings ('1D', '1H', etc),\n            and it will indicate a length of time that each window should span.\n            The list of available offset aliases can be found at\n            https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.\n            Defaults to 3.\n        gap (int, string, optional): Specifies a gap backwards from each instance before the\n            window of usable data begins. If an integer is provided, it will correspond to a number of rows.\n            If a string is provided, it must be one of Pandas' offset alias strings ('1D', '1H', etc),\n            and it will indicate a length of time between a target instance and the beginning of its window.\n            Defaults to 1, which excludes the target instance from the window.\n        min_periods (int, optional): Minimum number of observations required for performing calculations\n            over the window. Can only be as large as window_length when window_length is an integer.\n            When window_length is an offset alias string, this limitation does not exist, but care should be taken\n            to not choose a min_periods that will always be larger than the number of observations in a window.\n            Defaults to 1.\n\n    Note:\n        Only offset aliases with fixed frequencies can be used when defining gap and window_length.\n        This means that aliases such as `M` or `W` cannot be used, as they can indicate different\n        numbers of days. ('M', because different months are different numbers of days;\n        'W' because week will indicate a certain day of the week, like W-Wed, so that will\n        indicate a different number of days depending on the anchoring date.)\n\n    Note:\n        When using an offset alias to define `gap`, an offset alias must also be used to define `window_length`.\n        This limitation does not exist when using an offset alias to define `window_length`. In fact,\n        if the data has a uniform sampling frequency, it is preferable to use a numeric `gap` as it is more\n        efficient.\n\n    Examples:\n        >>> import pandas as pd\n        >>> rolling_outlier_count = RollingOutlierCount(window_length=4)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=6)\n        >>> rolling_outlier_count(times, [0, 0, 0, 0, 10, 0]).tolist()\n        [nan, 0.0, 0.0, 0.0, 0.0, 1.0]\n\n        We can also control the gap before the rolling calculation.\n        >>> import pandas as pd\n        >>> rolling_outlier_count = RollingOutlierCount(window_length=4, gap=0)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=6)\n        >>> rolling_outlier_count(times, [0, 0, 0, 0, 10, 0]).tolist()\n        [0.0, 0.0, 0.0, 0.0, 1.0, 1.0]\n\n        We can also control the minimum number of periods required for the rolling calculation.\n        >>> import pandas as pd\n        >>> rolling_outlier_count = RollingOutlierCount(window_length=4, min_periods=3)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=6)\n        >>> rolling_outlier_count(times,  [0, 0, 0, 0, 10, 0]).tolist()\n        [nan, nan, nan, 0.0, 0.0, 1.0]\n\n        We can also set the window_length and gap using offset alias strings.\n        >>> import pandas as pd\n        >>> rolling_outlier_count = RollingOutlierCount(window_length='4min', gap='1min')\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=6)\n        >>> rolling_outlier_count(times, [0, 0, 0, 0, 10, 0]).tolist()\n        [nan, 0.0, 0.0, 0.0, 0.0, 1.0]\n    \"\"\"\n\n    name = \"rolling_outlier_count\"\n    input_types = [\n        ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"}),\n        ColumnSchema(semantic_tags={\"numeric\"}),\n    ]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    uses_full_dataframe = True\n\n    def __init__(self, window_length=3, gap=1, min_periods=0):\n        self.window_length = window_length\n        self.gap = gap\n        self.min_periods = min_periods\n\n    def get_outliers_count(self, numeric_series):\n        # We know the column is numeric, so use the Double logical type in case Woodwork's\n        # type inference could not infer a numeric type\n        if not len(numeric_series.dropna()):\n            return np.nan\n        if numeric_series.ww.schema is None:\n            numeric_series = init_series(numeric_series, logical_type=\"Double\")\n        box_plot_info = numeric_series.ww.box_plot_dict()\n        return len(box_plot_info[\"high_values\"]) + len(box_plot_info[\"low_values\"])\n\n    def get_function(self):\n        def rolling_outlier_count(datetime, numeric):\n            x = pd.Series(numeric.values, index=datetime.values)\n            return apply_rolling_agg_to_series(\n                series=x,\n                agg_func=self.get_outliers_count,\n                window_length=self.window_length,\n                gap=self.gap,\n                min_periods=self.min_periods,\n                ignore_window_nans=False,\n            )\n\n        return rolling_outlier_count\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/time_series/rolling_std.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Double\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.time_series.utils import (\n    apply_rolling_agg_to_series,\n)\n\n\nclass RollingSTD(TransformPrimitive):\n    \"\"\"Calculates the standard deviation of entries over a given window.\n\n    Description:\n        Given a list of numbers and a corresponding list of\n        datetimes, return a rolling standard deviation of\n        the numeric values, starting at the row `gap` rows away from the current row and\n        looking backward over the specified time window\n        (by `window_length` and `gap`). Input datetimes should be monotonic.\n\n    Args:\n        window_length (int, string, optional): Specifies the amount of data included in each window.\n            If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency,\n            for example of one day, the window_length will correspond to a period of time, in this case,\n            7 days for a window_length of 7.\n            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),\n            and it will indicate a length of time that each window should span.\n            The list of available offset aliases can be found at\n            https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.\n            Defaults to 3.\n        gap (int, string, optional): Specifies a gap backwards from each instance before the\n            window of usable data begins. If an integer is provided, it will correspond to a number of rows.\n            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),\n            and it will indicate a length of time between a target instance and the beginning of its window.\n            Defaults to 1.\n        min_periods (int, optional): Minimum number of observations required for performing calculations\n            over the window. Can only be as large as window_length when window_length is an integer.\n            When window_length is an offset alias string, this limitation does not exist, but care should be taken\n            to not choose a min_periods that will always be larger than the number of observations in a window.\n            Defaults to 1.\n\n    Note:\n        Only offset aliases with fixed frequencies can be used when defining gap and window_length.\n        This means that aliases such as `M` or `W` cannot be used, as they can indicate different\n        numbers of days. ('M', because different months have different numbers of days;\n        'W' because week will indicate a certain day of the week, like W-Wed, so that will\n        indicate a different number of days depending on the anchoring date.)\n\n    Note:\n        When using an offset alias to define `gap`, an offset alias must also be used to define `window_length`.\n        This limitation does not exist when using an offset alias to define `window_length`. In fact,\n        if the data has a uniform sampling frequency, it is preferable to use a numeric `gap` as it is more\n        efficient.\n\n    Examples:\n        >>> import pandas as pd\n        >>> rolling_std = RollingSTD(window_length=4)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> rolling_std(times, [4, 3, 2, 1, 0]).tolist()\n        [nan, nan, 0.7071067811865476, 1.0, 1.2909944487358056]\n\n        We can also control the gap before the rolling calculation.\n\n        >>> import pandas as pd\n        >>> rolling_std = RollingSTD(window_length=4, gap=0)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> rolling_std(times, [4, 3, 2, 1, 0]).tolist()\n        [nan, 0.7071067811865476, 1.0, 1.2909944487358056, 1.2909944487358056]\n\n        We can also control the minimum number of periods required for the rolling calculation.\n\n        >>> import pandas as pd\n        >>> rolling_std = RollingSTD(window_length=4, min_periods=4, gap=0)\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> rolling_std(times, [4, 3, 2, 1, 0]).tolist()\n        [nan, nan, nan, 1.2909944487358056, 1.2909944487358056]\n\n        We can also set the window_length and gap using offset alias strings.\n        >>> import pandas as pd\n        >>> rolling_std = RollingSTD(window_length='4min', gap='1min')\n        >>> times = pd.date_range(start='2019-01-01', freq='1min', periods=5)\n        >>> rolling_std(times, [4, 3, 2, 1, 0]).tolist()\n        [nan, nan, 0.7071067811865476, 1.0, 1.2909944487358056]\n    \"\"\"\n\n    name = \"rolling_std\"\n    input_types = [\n        ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"}),\n        ColumnSchema(semantic_tags={\"numeric\"}),\n    ]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    uses_full_dataframe = True\n\n    def __init__(self, window_length=3, gap=1, min_periods=1):\n        self.window_length = window_length\n        self.gap = gap\n        self.min_periods = min_periods\n\n    def get_function(self):\n        def rolling_std(datetime, numeric):\n            x = pd.Series(numeric.values, index=datetime.values)\n            return apply_rolling_agg_to_series(\n                x,\n                lambda series: series.std(),\n                self.window_length,\n                self.gap,\n                self.min_periods,\n            )\n\n        return rolling_std\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/time_series/rolling_trend.py",
    "content": "import pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Double\n\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\nfrom featuretools.primitives.standard.transform.time_series.utils import (\n    apply_rolling_agg_to_series,\n)\nfrom featuretools.utils import calculate_trend\n\n\nclass RollingTrend(TransformPrimitive):\n    \"\"\"Calculates the trend of a given window of entries of a column over time.\n\n    Description:\n        Given a list of numbers and a corresponding list of\n        datetimes, return a rolling slope of the linear trend\n        of values, starting at the row `gap` rows away from the current row and looking backward\n        over the specified time window (by `window_length` and `gap`).\n\n        Input datetimes should be monotonic.\n\n     Args:\n        window_length (int, string, optional): Specifies the amount of data included in each window.\n            If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency,\n            for example of one day, the window_length will correspond to a period of time, in this case,\n            7 days for a window_length of 7.\n            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),\n            and it will indicate a length of time that each window should span.\n            The list of available offset aliases can be found at\n            https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases.\n            Defaults to 3.\n        gap (int, string, optional): Specifies a gap backwards from each instance before the\n            window of usable data begins. If an integer is provided, it will correspond to a number of rows.\n            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),\n            and it will indicate a length of time between a target instance and the beginning of its window.\n            Defaults to 1.\n        min_periods (int, optional): Minimum number of observations required for performing calculations\n            over the window. Can only be as large as window_length when window_length is an integer.\n            When window_length is an offset alias string, this limitation does not exist, but care should be taken\n            to not choose a min_periods that will always be larger than the number of observations in a window.\n            Defaults to 1.\n\n    Examples:\n        >>> import pandas as pd\n        >>> rolling_trend = RollingTrend()\n        >>> times = pd.date_range(start=\"2019-01-01\", freq=\"1D\", periods=10)\n        >>> rolling_trend(times, [1, 2, 4, 8, 16, 24, 48, 96, 192, 384]).tolist()\n        [nan, nan, nan, 1.4999999999999998, 2.9999999999999996, 5.999999999999999, 7.999999999999999, 16.0, 36.0, 72.0]\n\n        We can also control the gap before the rolling calculation.\n\n        >>> rolling_trend = RollingTrend(gap=0)\n        >>> rolling_trend(times, [1, 2, 4, 8, 16, 24, 48, 96, 192, 384]).tolist()\n        [nan, nan, 1.4999999999999998, 2.9999999999999996, 5.999999999999999, 7.999999999999999, 16.0, 36.0, 72.0, 144.0]\n\n        We can also control the minimum number of periods required for the rolling calculation.\n\n        >>> rolling_trend = RollingTrend(window_length=4, min_periods=4, gap=0)\n        >>> rolling_trend(times, [1, 2, 4, 8, 16, 24, 48, 96, 192, 384]).tolist()\n        [nan, nan, nan, 2.299999999999999, 4.599999999999998, 6.799999999999996, 12.799999999999992, 26.399999999999984, 55.19999999999997, 110.39999999999993]\n\n        We can also set the window_length and gap using offset alias strings.\n\n        >>> rolling_trend = RollingTrend(window_length=\"4D\", gap=\"1D\")\n        >>> rolling_trend(times, [1, 2, 4, 8, 16, 24, 48, 96, 192, 384]).tolist()\n        [nan, nan, nan, 1.4999999999999998, 2.299999999999999, 4.599999999999998, 6.799999999999996, 12.799999999999992, 26.399999999999984, 55.19999999999997]\n    \"\"\"\n\n    name = \"rolling_trend\"\n    input_types = [\n        ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"}),\n        ColumnSchema(semantic_tags={\"numeric\"}),\n    ]\n    return_type = ColumnSchema(logical_type=Double, semantic_tags={\"numeric\"})\n    uses_full_dataframe = True\n\n    def __init__(self, window_length=3, gap=1, min_periods=0):\n        self.window_length = window_length\n        self.gap = gap\n        self.min_periods = min_periods\n\n    def get_function(self):\n        def rolling_trend(datetime, numeric):\n            x = pd.Series(numeric.values, index=datetime.values)\n            return apply_rolling_agg_to_series(\n                x,\n                calculate_trend,\n                self.window_length,\n                self.gap,\n                self.min_periods,\n            )\n\n        return rolling_trend\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/time_series/utils.py",
    "content": "from typing import Callable, Optional, Union\n\nimport numpy as np\nimport pandas as pd\nfrom pandas import Series\nfrom pandas.core.window.rolling import Rolling\nfrom pandas.tseries.frequencies import to_offset\n\n\ndef roll_series_with_gap(\n    series: Series,\n    window_length: Union[int, str],\n    gap: Union[int, str],\n    min_periods: int,\n) -> Rolling:\n    \"\"\"Provide rolling window calculations where the windows are determined using both a gap parameter\n    that indicates the amount of time between each instance and its window and a window length parameter\n    that determines the amount of data in each window.\n\n    Args:\n        series (Series): The series over which rolling windows will be created. The series must have numeric values and a DatetimeIndex.\n        window_length (int, string): Specifies the amount of data included in each window.\n            If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency,\n            for example of one day, the window_length will correspond to a period of time, in this case,\n            7 days for a window_length of 7.\n            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),\n            and it will indicate a length of time that each window should span.\n            The list of available offset aliases can be found at\n            https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases\n        gap (int, string, optional): Specifies a gap backwards from each instance before the\n            window of usable data begins. If an integer is provided, it will correspond to a number of rows.\n            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),\n            and it will indicate a length of time between a target instance and the beginning of its window.\n            Defaults to 0, which will include the target instance in the window.\n        min_periods (int, optional): Minimum number of observations required for performing calculations\n            over the window. Can only be as large as window_length when window_length is an integer.\n            When window_length is an offset alias string, this limitation does not exist, but care should be taken\n            to not choose a min_periods that will always be larger than the number of observations in a window.\n            Defaults to 1.\n\n    Returns:\n        pandas.core.window.rolling.Rolling: The Rolling object for the series passed in.\n    \"\"\"\n    _check_window_length(window_length)\n    _check_gap(window_length, gap)\n\n    functional_window_length = window_length\n    if isinstance(gap, str):\n        # Add the window_length and gap so that the rolling operation correctly takes gap into account.\n        # That way, we can later remove the gap rows in order to apply the primitive function\n        # to the correct window\n        functional_window_length = to_offset(window_length) + to_offset(gap)\n    elif gap > 0:\n        # When gap is numeric, we can apply a shift to incorporate gap right now\n        # since the gap will be the same number of rows for the whole dataset\n        series = series.shift(gap)\n\n    return series.rolling(functional_window_length, min_periods)\n\n\ndef _get_rolled_series_without_gap(window: Series, gap_offset: str) -> Series:\n    \"\"\"Applies the gap offset_string to the rolled window, returning a window\n    that is the correct length of time away from the original instance.\n\n    Args:\n        window (Series): A rolling window that includes both the window length and gap spans of time.\n        gap_offset (string): The pandas offset alias that determines how much time at the end of the window\n            should be removed.\n\n    Returns:\n        Series: The window with gap rows removed\n    \"\"\"\n    if not len(window):\n        return window\n\n    window_start_date = window.index[0]\n    window_end_date = window.index[-1]\n\n    gap_bound = window_end_date - to_offset(gap_offset)\n\n    # If the gap is larger than the series, no rows are left in the window\n    if gap_bound < window_start_date:\n        return Series(dtype=\"float64\")\n\n    # Only return the rows that are within the offset's bounds\n    return window[window.index <= gap_bound]\n\n\ndef apply_roll_with_offset_gap(\n    window: Series,\n    gap_offset: str,\n    reducer_fn: Callable[[Series], float],\n    min_periods: int,\n) -> float:\n    \"\"\"Takes in a series to which an offset gap will be applied, removing however many\n    rows fall under the gap before applying the reducing function.\n\n    Args:\n        window (Series):  A rolling window that includes both the window length and gap spans of time.\n        gap_offset (string): The pandas offset alias that determines how much time at the end of the window\n            should be removed.\n        reducer_fn (callable[Series -> float]): The function to be applied to the window in order to produce\n            the aggregate that will be included in the resulting feature.\n        min_periods (int): Minimum number of observations required for performing calculations\n            over the window.\n\n    Returns:\n        float: The aggregate value to be used as a feature value.\n    \"\"\"\n    window = _get_rolled_series_without_gap(window, gap_offset)\n\n    if min_periods is None:\n        min_periods = 1\n\n    if len(window) < min_periods or not len(window):\n        return np.nan\n\n    return reducer_fn(window)\n\n\ndef _check_window_length(window_length: Union[int, str]) -> None:\n    # Window length must either be a valid offset alias\n    if isinstance(window_length, str):\n        try:\n            to_offset(window_length)\n        except ValueError:\n            raise ValueError(\n                f\"Cannot roll series. The specified window length, {window_length}, is not a valid offset alias.\",\n            )\n    # Or an integer greater than zero\n    elif isinstance(window_length, int):\n        if window_length <= 0:\n            raise ValueError(\"Window length must be greater than zero.\")\n    else:\n        raise TypeError(\"Window length must be either an offset string or an integer.\")\n\n\ndef _check_gap(window_length: Union[int, str], gap: Union[int, str]) -> None:\n    # Gap must either be a valid offset string that also has an offset string window length\n    if isinstance(gap, str):\n        if not isinstance(window_length, str):\n            raise TypeError(\n                f\"Cannot roll series with offset gap, {gap}, and numeric window length, {window_length}. \"\n                \"If an offset alias is used for gap, the window length must also be defined as an offset alias. \"\n                \"Please either change gap to be numeric or change window length to be an offset alias.\",\n            )\n        try:\n            to_offset(gap)\n        except ValueError:\n            raise ValueError(\n                f\"Cannot roll series. The specified gap, {gap}, is not a valid offset alias.\",\n            )\n    # Or an integer greater than or equal to zero\n    elif isinstance(gap, int):\n        if gap < 0:\n            raise ValueError(\"Gap must be greater than or equal to zero.\")\n    else:\n        raise TypeError(\"Gap must be either an offset string or an integer.\")\n\n\ndef apply_rolling_agg_to_series(\n    series: Series,\n    agg_func: Callable[[Series], float],\n    window_length: Union[int, str],\n    gap: Union[int, str] = 0,\n    min_periods: int = 1,\n    ignore_window_nans: bool = False,\n) -> np.ndarray:\n    \"\"\"Applies a given aggregation function to a rolled series.\n\n    Args:\n        series (Series): The series over which rolling windows will be created. The series must have numeric values and a DatetimeIndex.\n        agg_func (callable[Series -> float]): The aggregation function to apply to a rolled series.\n        window_length (int, string): Specifies the amount of data included in each window.\n            If an integer is provided, it will correspond to a number of rows. For data with a uniform sampling frequency,\n            for example of one day, the window_length will correspond to a period of time, in this case,\n            7 days for a window_length of 7.\n            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),\n            and it will indicate a length of time that each window should span.\n            The list of available offset aliases can be found at\n            https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases\n        gap (int, string, optional): Specifies a gap backwards from each instance before the\n            window of usable data begins. If an integer is provided, it will correspond to a number of rows.\n            If a string is provided, it must be one of pandas' offset alias strings ('1D', '1H', etc),\n            and it will indicate a length of time between a target instance and the beginning of its window.\n            Defaults to 0, which will include the target instance in the window.\n        min_periods (int, optional): Minimum number of observations required for performing calculations\n            over the window. Can only be as large as window_length when window_length is an integer.\n            When window_length is an offset alias string, this limitation does not exist, but care should be taken\n            to not choose a min_periods that will always be larger than the number of observations in a window.\n            Defaults to 1.\n        ignore_window_nans (bool, optional): Whether or not NaNs in the rolling window should be included in the rolling calculation.\n            NaNs by default get counted towards min_periods. When set to True,\n            all partial values calculated by `agg_func` in the rolling window get replaced with NaN.\n            Defaults to False.\n\n    Returns:\n        numpy.ndarray: The array of rolling calculated values.\n\n    Note:\n        Certain operations, like `pandas.core.window.rolling.Rolling.count` that can be performed\n        on the Rolling object returned here may treat NaNs as periods to include in window calculations.\n        So a window [NaN, 1, 3]  when `min_periods=3` will proceed with count, saying there are three periods\n        but only two values and would return count=2. The calculation `max` on the other hand,\n        would not recognize NaN as a valid period, and would therefore return `max=NaN` as the window has\n        less valid periods (two, in this case) than `min_periods` (three, in this case).\n        Most rolling calculations act this way. The implication of that here is that in order to\n        achieve the gap, we insert NaNs at the beginning of the series, which would cause `count` to calculate\n        on windows that technically should not have the correct number of periods. Any primitive that uses this function\n        should determine whether `ignore_window_nans` should be set to `true`.\n\n    Note:\n        Only offset aliases with fixed frequencies can be used when defining gap and window_length.\n        This means that aliases such as `M` or `W` cannot be used, as they can indicate different\n        numbers of days. ('M', because different months have different numbers of days;\n        'W' because week will indicate a certain day of the week, like W-Wed, so that will\n        indicate a different number of days depending on the anchoring date.)\n\n    Note:\n        When using an offset alias to define `gap`, an offset alias must also be used to define `window_length`.\n        This limitation does not exist when using an offset alias to define `window_length`. In fact,\n        if the data has a uniform sampling frequency, it is preferable to use a numeric `gap` as it is more\n        efficient.\"\"\"\n    rolled_series = roll_series_with_gap(series, window_length, gap, min_periods)\n    if isinstance(gap, str):\n        additional_args = (gap, agg_func, min_periods)\n        return rolled_series.apply(\n            apply_roll_with_offset_gap,\n            args=additional_args,\n        ).values\n    applied_rolled_series = rolled_series.apply(agg_func)\n\n    if ignore_window_nans:\n        if not min_periods:\n            # when min periods is 0 or None it's treated the same as if it's 1\n            num_nans = gap\n        else:\n            num_nans = min_periods - 1 + gap\n        applied_rolled_series.iloc[range(num_nans)] = np.nan\n    return applied_rolled_series.values\n\n\ndef _apply_gap_for_expanding_primitives(\n    x: Union[Series, pd.Index],\n    gap: Union[int, str],\n) -> Optional[Series]:\n    if not isinstance(gap, int):\n        raise TypeError(\n            \"String offsets are not supported for the gap parameter in Expanding primitives\",\n        )\n    if isinstance(x, pd.Index):\n        return x.to_series().shift(gap)\n    return x.shift(gap)\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/url/__init__.py",
    "content": "from featuretools.primitives.standard.transform.url.url_to_domain import URLToDomain\nfrom featuretools.primitives.standard.transform.url.url_to_protocol import URLToProtocol\nfrom featuretools.primitives.standard.transform.url.url_to_tld import URLToTLD\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/url/url_to_domain.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import URL, Categorical\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass URLToDomain(TransformPrimitive):\n    \"\"\"Determines the domain of a url.\n\n    Description:\n        Calculates the label to identify the network domain of a URL. Supports\n        urls with or without protocol as well as international country domains.\n\n    Examples:\n        >>> url_to_domain = URLToDomain()\n        >>> urls =  ['https://play.google.com',\n        ...          'http://www.google.co.in',\n        ...          'www.facebook.com']\n        >>> url_to_domain(urls).tolist()\n        ['play.google.com', 'google.co.in', 'facebook.com']\n    \"\"\"\n\n    name = \"url_to_domain\"\n    input_types = [ColumnSchema(logical_type=URL)]\n    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={\"category\"})\n\n    def get_function(self):\n        def url_to_domain(x):\n            p = r\"^(?:https?:\\/\\/)?(?:[^@\\/\\n]+@)?(?:www\\.)?([^:\\/?\\n]+)\"\n            return x.str.extract(p, expand=False)\n\n        return url_to_domain\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/url/url_to_protocol.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import URL, Categorical\n\nfrom featuretools.primitives.base import TransformPrimitive\n\n\nclass URLToProtocol(TransformPrimitive):\n    \"\"\"Determines the protocol (http or https) of a url.\n\n    Description:\n        Extract the protocol of a url using regex.\n        It will be either https or http. Returns nan if\n        the url doesn't contain a protocol.\n\n    Examples:\n        >>> url_to_protocol = URLToProtocol()\n        >>> urls =  ['https://play.google.com',\n        ...          'http://www.google.co.in',\n        ...          'www.facebook.com']\n        >>> url_to_protocol(urls).to_list()\n        ['https', 'http', nan]\n    \"\"\"\n\n    name = \"url_to_protocol\"\n    input_types = [ColumnSchema(logical_type=URL)]\n    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={\"category\"})\n\n    def get_function(self):\n        def url_to_protocol(x):\n            p = r\"^(https|http)(?:\\:)\"\n            return x.str.extract(p, expand=False)\n\n        return url_to_protocol\n"
  },
  {
    "path": "featuretools/primitives/standard/transform/url/url_to_tld.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import URL, Categorical\n\nfrom featuretools.primitives.base import TransformPrimitive\nfrom featuretools.utils.common_tld_utils import COMMON_TLDS\n\n\nclass URLToTLD(TransformPrimitive):\n    \"\"\"Determines the top level domain of a url.\n\n    Description:\n        Extract the top level domain of a url, using regex,\n        and a list of common top level domains. Returns nan if\n        the url is invalid or null.\n        Common top level domains were pulled from this list:\n        https://www.hayksaakian.com/most-popular-tlds/\n\n    Examples:\n        >>> url_to_tld = URLToTLD()\n        >>> urls = ['https://www.google.com', 'http://www.google.co.in',\n        ...         'www.facebook.com']\n        >>> url_to_tld(urls).to_list()\n        ['com', 'in', 'com']\n    \"\"\"\n\n    name = \"url_to_tld\"\n    input_types = [ColumnSchema(logical_type=URL)]\n    return_type = ColumnSchema(logical_type=Categorical, semantic_tags={\"category\"})\n\n    def get_function(self):\n        self.tlds_pattern = r\"(?:\\.({}))\".format(\"|\".join(COMMON_TLDS))\n\n        def url_to_domain(x):\n            p = r\"^(?:https?:\\/\\/)?(?:[^@\\/\\n]+@)?(?:www\\.)?([^:\\/?\\n]+)\"\n            return x.str.extract(p, expand=False)\n\n        def url_to_tld(x):\n            domains = url_to_domain(x)\n            df = domains.str.extractall(self.tlds_pattern)\n            matches = df.groupby(level=0).last()[0]\n            return matches.reindex(x.index)\n\n        return url_to_tld\n"
  },
  {
    "path": "featuretools/primitives/utils.py",
    "content": "import importlib.util\nimport os\nfrom inspect import getfullargspec, getsource, isclass\nfrom typing import Dict, List\n\nimport pandas as pd\nfrom woodwork import list_logical_types, list_semantic_tags, type_system\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import NaturalLanguage\n\nimport featuretools\nfrom featuretools.primitives import NumberOfCommonWords\nfrom featuretools.primitives.base import (\n    AggregationPrimitive,\n    PrimitiveBase,\n    TransformPrimitive,\n)\nfrom featuretools.utils.gen_utils import find_descendents\n\n\ndef _get_primitives(primitive_kind):\n    \"\"\"Helper function that selects all primitives\n    that are instances of `primitive_kind`\n    \"\"\"\n    primitives = set()\n    for attribute_string in dir(featuretools.primitives):\n        attribute = getattr(featuretools.primitives, attribute_string)\n        if isclass(attribute):\n            if issubclass(attribute, primitive_kind) and attribute.name:\n                primitives.add(attribute)\n    return {prim.name.lower(): prim for prim in primitives}\n\n\ndef get_aggregation_primitives():\n    \"\"\"Returns all aggregation primitives\"\"\"\n    return _get_primitives(featuretools.primitives.AggregationPrimitive)\n\n\ndef get_transform_primitives():\n    \"\"\"Returns all transform primitives\"\"\"\n    return _get_primitives(featuretools.primitives.TransformPrimitive)\n\n\ndef get_all_primitives():\n    \"\"\"Helper function to return all primitives\"\"\"\n    primitives = set()\n    for attribute_string in dir(featuretools.primitives):\n        attribute = getattr(featuretools.primitives, attribute_string)\n        if isclass(attribute):\n            if issubclass(attribute, PrimitiveBase) and attribute.name:\n                primitives.add(attribute)\n    return {prim.__name__: prim for prim in primitives}\n\n\ndef _get_natural_language_primitives():\n    \"\"\"Returns all Natural Language transform primitives\"\"\"\n    transform_primitives = get_transform_primitives()\n\n    def _natural_language_in_input_type(primitive):\n        for input_type in primitive.input_types:\n            if isinstance(input_type, list):\n                if any(\n                    isinstance(column_schema.logical_type, NaturalLanguage)\n                    for column_schema in input_type\n                ):\n                    return True\n            else:\n                if isinstance(input_type.logical_type, NaturalLanguage):\n                    return True\n        return False\n\n    return {\n        name: primitive\n        for name, primitive in transform_primitives.items()\n        if _natural_language_in_input_type(primitive)\n    }\n\n\ndef list_primitives():\n    \"\"\"Returns a DataFrame that lists and describes each built-in primitive.\"\"\"\n    trans_names, trans_primitives, valid_inputs, return_type = _get_names_primitives(\n        get_transform_primitives,\n    )\n    transform_df = pd.DataFrame(\n        {\n            \"name\": trans_names,\n            \"description\": _get_descriptions(trans_primitives),\n            \"valid_inputs\": valid_inputs,\n            \"return_type\": return_type,\n        },\n    )\n    transform_df[\"type\"] = \"transform\"\n\n    agg_names, agg_primitives, valid_inputs, return_type = _get_names_primitives(\n        get_aggregation_primitives,\n    )\n    agg_df = pd.DataFrame(\n        {\n            \"name\": agg_names,\n            \"description\": _get_descriptions(agg_primitives),\n            \"valid_inputs\": valid_inputs,\n            \"return_type\": return_type,\n        },\n    )\n    agg_df[\"type\"] = \"aggregation\"\n\n    columns = [\n        \"name\",\n        \"type\",\n        \"description\",\n        \"valid_inputs\",\n        \"return_type\",\n    ]\n    return pd.concat([agg_df, transform_df], ignore_index=True)[columns]\n\n\ndef summarize_primitives() -> pd.DataFrame:\n    \"\"\"Returns a metrics summary DataFrame of all primitives found in list_primitives.\"\"\"\n    (\n        trans_names,\n        trans_primitives,\n        trans_valid_inputs,\n        trans_return_type,\n    ) = _get_names_primitives(get_transform_primitives)\n\n    (\n        agg_names,\n        agg_primitives,\n        agg_valid_inputs,\n        agg_return_type,\n    ) = _get_names_primitives(get_aggregation_primitives)\n\n    tot_trans = len(trans_names)\n    tot_agg = len(agg_names)\n    tot_prims = tot_trans + tot_agg\n    all_primitives = trans_primitives + agg_primitives\n    primitives_summary = _get_summary_primitives(all_primitives)\n    summary_dict = {\n        \"total_primitives\": tot_prims,\n        \"aggregation_primitives\": tot_agg,\n        \"transform_primitives\": tot_trans,\n        **primitives_summary[\"general_metrics\"],\n    }\n    summary_dict.update(\n        {\n            f\"uses_{ltype}_input\": count\n            for ltype, count in primitives_summary[\"logical_type_input_metrics\"].items()\n        },\n    )\n    summary_dict.update(\n        {\n            f\"uses_{tag}_tag_input\": count\n            for tag, count in primitives_summary[\"semantic_tag_metrics\"].items()\n        },\n    )\n    summary_df = pd.DataFrame(\n        [{\"Metric\": k, \"Count\": v} for k, v in summary_dict.items()],\n    )\n    return summary_df\n\n\ndef get_default_aggregation_primitives():\n    agg_primitives = [\n        featuretools.primitives.Sum,\n        featuretools.primitives.Std,\n        featuretools.primitives.Max,\n        featuretools.primitives.Skew,\n        featuretools.primitives.Min,\n        featuretools.primitives.Mean,\n        featuretools.primitives.Count,\n        featuretools.primitives.PercentTrue,\n        featuretools.primitives.NumUnique,\n        featuretools.primitives.Mode,\n    ]\n    return agg_primitives\n\n\ndef get_default_transform_primitives():\n    # featuretools.primitives.TimeSince\n    trans_primitives = [\n        featuretools.primitives.Age,\n        featuretools.primitives.Day,\n        featuretools.primitives.Year,\n        featuretools.primitives.Month,\n        featuretools.primitives.Weekday,\n        featuretools.primitives.Haversine,\n        featuretools.primitives.NumWords,\n        featuretools.primitives.NumCharacters,\n    ]\n    return trans_primitives\n\n\ndef _get_descriptions(primitives):\n    descriptions = []\n    for prim in primitives:\n        description = \"\"\n        if prim.__doc__ is not None:\n            # Break on the empty line between the docstring description and the remainder of the docstring\n            description = prim.__doc__.split(\"\\n\\n\")[0]\n            # remove any excess whitespace from line breaks\n            description = \" \".join(description.split())\n        descriptions.append(description)\n    return descriptions\n\n\ndef _get_summary_primitives(primitives: List) -> Dict[str, int]:\n    \"\"\"Provides metrics for a list of primitives.\"\"\"\n    unique_input_types = set()\n    unique_output_types = set()\n    uses_multi_input = 0\n    uses_multi_output = 0\n    uses_external_data = 0\n    are_controllable = 0\n    logical_type_metrics = {\n        log_type: 0 for log_type in list(list_logical_types()[\"type_string\"])\n    }\n    semantic_tag_metrics = {\n        sem_tag: 0 for sem_tag in list(list_semantic_tags()[\"name\"])\n    }\n    semantic_tag_metrics.update(\n        {\"foreign_key\": 0},\n    )  # not currently in list_semantic_tags()\n\n    for prim in primitives:\n        log_in_type_checks = set()\n        sem_tag_type_checks = set()\n        input_types = prim.flatten_nested_input_types(prim.input_types)\n        _check_input_types(\n            input_types,\n            log_in_type_checks,\n            sem_tag_type_checks,\n            unique_input_types,\n        )\n        for ltype in list(log_in_type_checks):\n            logical_type_metrics[ltype] += 1\n\n        for sem_tag in list(sem_tag_type_checks):\n            semantic_tag_metrics[sem_tag] += 1\n\n        if len(prim.input_types) > 1:\n            uses_multi_input += 1\n\n        # checks if number_output_features is set as an instance variable or set as a constant\n        if (\n            \"self.number_output_features =\" in getsource(prim.__init__)\n            or prim.number_output_features > 1\n        ):\n            uses_multi_output += 1\n        unique_output_types.add(str(prim.return_type))\n\n        if hasattr(prim, \"filename\"):\n            uses_external_data += 1\n\n        if len(getfullargspec(prim.__init__).args) > 1:\n            are_controllable += 1\n\n    return {\n        \"general_metrics\": {\n            \"unique_input_types\": len(unique_input_types),\n            \"unique_output_types\": len(unique_output_types),\n            \"uses_multi_input\": uses_multi_input,\n            \"uses_multi_output\": uses_multi_output,\n            \"uses_external_data\": uses_external_data,\n            \"are_controllable\": are_controllable,\n        },\n        \"logical_type_input_metrics\": logical_type_metrics,\n        \"semantic_tag_metrics\": semantic_tag_metrics,\n    }\n\n\ndef _check_input_types(\n    input_types: List[ColumnSchema],\n    log_in_type_checks: set,\n    sem_tag_type_checks: set,\n    unique_input_types: set,\n):\n    \"\"\"Checks if any logical types or semantic tags occur in a list of Woodwork input types and keeps track of unique input types.\"\"\"\n    for in_type in input_types:\n        if in_type.semantic_tags:\n            for sem_tag in in_type.semantic_tags:\n                sem_tag_type_checks.add(sem_tag)\n        if in_type.logical_type:\n            log_in_type_checks.add(in_type.logical_type.type_string)\n        unique_input_types.add(str(in_type))\n\n\ndef _get_names_primitives(primitive_func):\n    names = []\n    primitives = []\n    valid_inputs = []\n    return_type = []\n    for name, primitive in primitive_func().items():\n        names.append(name)\n        primitives.append(primitive)\n        input_types = _get_unique_input_types(primitive.input_types)\n        valid_inputs.append(\", \".join(input_types))\n        return_type.append(\n            str(primitive.return_type),\n        ) if primitive.return_type is not None else return_type.append(None)\n    return names, primitives, valid_inputs, return_type\n\n\ndef _get_unique_input_types(input_types):\n    types = set()\n    for input_type in input_types:\n        if isinstance(input_type, list):\n            types |= _get_unique_input_types(input_type)\n        else:\n            types.add(str(input_type))\n    return types\n\n\ndef list_primitive_files(directory):\n    \"\"\"returns list of files in directory that might contain primitives\"\"\"\n    files = os.listdir(directory)\n    keep = []\n    for path in files:\n        if not check_valid_primitive_path(path):\n            continue\n        keep.append(os.path.join(directory, path))\n    return keep\n\n\ndef check_valid_primitive_path(path):\n    if os.path.isdir(path):\n        return False\n\n    filename = os.path.basename(path)\n\n    if filename[:2] == \"__\" or filename[0] == \".\" or filename[-3:] != \".py\":\n        return False\n\n    return True\n\n\ndef load_primitive_from_file(filepath):\n    \"\"\"load primitive objects in a file\"\"\"\n    module = os.path.basename(filepath)[:-3]\n    # TODO: what is the first argument\"?\n    spec = importlib.util.spec_from_file_location(module, filepath)\n    module = importlib.util.module_from_spec(spec)\n    spec.loader.exec_module(module)\n\n    primitives = []\n    for primitive_name in vars(module):\n        primitive_class = getattr(module, primitive_name)\n        if (\n            isclass(primitive_class)\n            and issubclass(primitive_class, PrimitiveBase)\n            and primitive_class not in (AggregationPrimitive, TransformPrimitive)\n        ):\n            primitives.append((primitive_name, primitive_class))\n\n    if len(primitives) == 0:\n        raise RuntimeError(\"No primitive defined in file %s\" % filepath)\n    elif len(primitives) > 1:\n        raise RuntimeError(\"More than one primitive defined in file %s\" % filepath)\n\n    return primitives[0]\n\n\ndef serialize_primitive(primitive: PrimitiveBase):\n    \"\"\"build a dictionary with the data necessary to construct the given primitive\"\"\"\n    args_dict = {name: val for name, val in primitive.get_arguments()}\n    cls = type(primitive)\n    if cls == NumberOfCommonWords and \"word_set\" in args_dict:\n        args_dict[\"word_set\"] = list(args_dict[\"word_set\"])\n    return {\n        \"type\": cls.__name__,\n        \"module\": cls.__module__,\n        \"arguments\": args_dict,\n    }\n\n\nclass PrimitivesDeserializer(object):\n    \"\"\"\n    This class wraps a cache and a generator which iterates over all primitive\n    classes. When deserializing a primitive if it is not in the cache then we\n    iterate until it is found, adding every seen class to the cache. When\n    deserializing the next primitive the iteration resumes where it left off. This\n    means that we never visit a class more than once.\n    \"\"\"\n\n    def __init__(self):\n        # Cache to avoid repeatedly searching for primitive class\n        # (class_name, module_name) -> class\n        self.class_cache = {}\n\n        self.primitive_classes = find_descendents(PrimitiveBase)\n\n    def deserialize_primitive(self, primitive_dict):\n        \"\"\"\n        Construct a primitive from the given dictionary (output from\n        serialize_primitive).\n        \"\"\"\n        class_name = primitive_dict[\"type\"]\n        module_name = primitive_dict[\"module\"]\n        class_cache_key = (class_name, module_name.split(\".\")[0])\n\n        if class_cache_key in self.class_cache:\n            cls = self.class_cache[class_cache_key]\n        else:\n            cls = self._find_class_in_descendants(class_cache_key)\n\n        if not cls:\n            raise RuntimeError(\n                'Primitive \"%s\" in module \"%s\" not found' % (class_name, module_name),\n            )\n        arguments = primitive_dict[\"arguments\"]\n        if cls == NumberOfCommonWords and \"word_set\" in arguments:\n            # We converted word_set from a set to a list to make it serializable,\n            # we should convert it back now.\n            arguments[\"word_set\"] = set(arguments[\"word_set\"])\n        primitive_instance = cls(**arguments)\n\n        return primitive_instance\n\n    def _find_class_in_descendants(self, search_key):\n        for cls in self.primitive_classes:\n            cls_key = (cls.__name__, cls.__module__.split(\".\")[0])\n            self.class_cache[cls_key] = cls\n\n            if cls_key == search_key:\n                return cls\n\n\ndef get_all_logical_type_names():\n    \"\"\"Helper function that returns all registered woodwork logical types\"\"\"\n    return {lt.__name__: lt for lt in type_system.registered_types}\n"
  },
  {
    "path": "featuretools/selection/__init__.py",
    "content": "# flake8: noqa\nfrom featuretools.selection.api import *\n"
  },
  {
    "path": "featuretools/selection/api.py",
    "content": "# flake8: noqa\nfrom featuretools.selection.selection import *\n"
  },
  {
    "path": "featuretools/selection/selection.py",
    "content": "import pandas as pd\nfrom woodwork.logical_types import Boolean, BooleanNullable\n\n\ndef remove_low_information_features(feature_matrix, features=None):\n    \"\"\"Select features that have at least 2 unique values and that are not all null\n\n    Args:\n        feature_matrix (:class:`pd.DataFrame`): DataFrame whose columns are feature names and rows are instances\n        features (list[:class:`featuretools.FeatureBase`] or list[str], optional): List of features to select\n\n    Returns:\n        (feature_matrix, features)\n\n    \"\"\"\n    keep = [\n        c\n        for c in feature_matrix\n        if (\n            feature_matrix[c].nunique(dropna=False) > 1\n            and feature_matrix[c].dropna().shape[0] > 0\n        )\n    ]\n    feature_matrix = feature_matrix[keep]\n    if features is not None:\n        features = [f for f in features if f.get_name() in feature_matrix.columns]\n        return feature_matrix, features\n    return feature_matrix\n\n\ndef remove_highly_null_features(feature_matrix, features=None, pct_null_threshold=0.95):\n    \"\"\"\n    Removes columns from a feature matrix that have higher than a set threshold\n    of null values.\n\n    Args:\n        feature_matrix (:class:`pd.DataFrame`): DataFrame whose columns are feature names and rows are instances.\n        features (list[:class:`featuretools.FeatureBase`] or list[str], optional): List of features to select.\n        pct_null_threshold (float): If the percentage of NaN values in an input feature exceeds this amount,\n                that feature will be considered highly-null. Defaults to 0.95.\n\n    Returns:\n        pd.DataFrame, list[:class:`.FeatureBase`]:\n            The feature matrix and the list of generated feature definitions. Matches dfs output.\n            If no feature list is provided as input, the feature list will not be returned.\n    \"\"\"\n    if pct_null_threshold < 0 or pct_null_threshold > 1:\n        raise ValueError(\n            \"pct_null_threshold must be a float between 0 and 1, inclusive.\",\n        )\n\n    percent_null_by_col = (feature_matrix.isnull().mean()).to_dict()\n\n    if pct_null_threshold == 0.0:\n        keep = [\n            f_name\n            for f_name, pct_null in percent_null_by_col.items()\n            if pct_null <= pct_null_threshold\n        ]\n    else:\n        keep = [\n            f_name\n            for f_name, pct_null in percent_null_by_col.items()\n            if pct_null < pct_null_threshold\n        ]\n\n    return _apply_feature_selection(keep, feature_matrix, features)\n\n\ndef remove_single_value_features(\n    feature_matrix,\n    features=None,\n    count_nan_as_value=False,\n):\n    \"\"\"Removes columns in feature matrix where all the values are the same.\n\n    Args:\n        feature_matrix (:class:`pd.DataFrame`): DataFrame whose columns are feature names and rows are instances.\n        features (list[:class:`featuretools.FeatureBase`] or list[str], optional): List of features to select.\n        count_nan_as_value (bool): If True, missing values will be counted as their own unique value.\n                    If set to False, a feature that has one unique value and all other\n                    data missing will be removed from the feature matrix. Defaults to False.\n\n     Returns:\n        pd.DataFrame, list[:class:`.FeatureBase`]:\n            The feature matrix and the list of generated feature definitions.\n            Matches dfs output.\n            If no feature list is provided as input, the feature list will not be returned.\n    \"\"\"\n    unique_counts_by_col = feature_matrix.nunique(\n        dropna=not count_nan_as_value,\n    ).to_dict()\n\n    keep = [\n        f_name\n        for f_name, unique_count in unique_counts_by_col.items()\n        if unique_count > 1\n    ]\n    return _apply_feature_selection(keep, feature_matrix, features)\n\n\ndef remove_highly_correlated_features(\n    feature_matrix,\n    features=None,\n    pct_corr_threshold=0.95,\n    features_to_check=None,\n    features_to_keep=None,\n):\n    \"\"\"Removes columns in feature matrix that are highly correlated with another column.\n\n    Note:\n        We make the assumption that, for a pair of features, the feature that is further\n        right in the feature matrix produced by ``dfs`` is the more complex one.\n        The assumption does not hold if the order of columns in the feature\n        matrix has changed from what ``dfs`` produces.\n\n    Args:\n        feature_matrix (:class:`pd.DataFrame`): DataFrame whose columns are feature\n                    names and rows are instances. If Woodwork is not initalized, will\n                    perform Woodwork initialization, which may result in slightly different\n                    types than those in the original feature matrix created by Featuretools.\n        features (list[:class:`featuretools.FeatureBase`] or list[str], optional):\n                    List of features to select.\n        pct_corr_threshold (float): The correlation threshold to be considered highly\n                    correlated. Defaults to 0.95.\n        features_to_check (list[str], optional): List of column names to check\n                    whether any pairs are highly correlated. Will not check any\n                    other columns, meaning the only columns that can be removed\n                    are in this list. If null, defaults to checking all columns.\n        features_to_keep (list[str], optional): List of colum names to keep even\n                    if correlated to another column. If null, all columns will be\n                    candidates for removal.\n\n    Returns:\n        pd.DataFrame, list[:class:`.FeatureBase`]:\n            The feature matrix and the list of generated feature definitions.\n            Matches dfs output. If no feature list is provided as input,\n            the feature list will not be returned. For consistent results,\n            do not change the order of features outputted by dfs.\n    \"\"\"\n    if feature_matrix.ww.schema is None:\n        feature_matrix.ww.init()\n\n    if pct_corr_threshold < 0 or pct_corr_threshold > 1:\n        raise ValueError(\n            \"pct_corr_threshold must be a float between 0 and 1, inclusive.\",\n        )\n\n    if features_to_check is None:\n        features_to_check = list(feature_matrix.columns)\n    else:\n        for f_name in features_to_check:\n            assert (\n                f_name in feature_matrix.columns\n            ), \"feature named {} is not in feature matrix\".format(f_name)\n\n    if features_to_keep is None:\n        features_to_keep = []\n\n    to_select = [\"numeric\", Boolean, BooleanNullable]\n    fm = feature_matrix.ww[features_to_check]\n    fm_to_check = fm.ww.select(include=to_select)\n\n    dropped = set()\n    columns_to_check = fm_to_check.columns\n    # When two features are found to be highly correlated,\n    # we drop the more complex feature\n    # Columns produced later in dfs are more complex\n    for i in range(len(columns_to_check) - 1, 0, -1):\n        more_complex_name = columns_to_check[i]\n        more_complex_col = fm_to_check[more_complex_name]\n\n        # Convert boolean or Int64 column to be float64\n        if pd.api.types.is_bool_dtype(more_complex_col) or isinstance(\n            more_complex_col.dtype,\n            pd.Int64Dtype,\n        ):\n            more_complex_col = more_complex_col.astype(\"float64\")\n\n        for j in range(i - 1, -1, -1):\n            less_complex_name = columns_to_check[j]\n            less_complex_col = fm_to_check[less_complex_name]\n\n            # Convert boolean or Int64 column to be float64\n            if pd.api.types.is_bool_dtype(less_complex_col) or isinstance(\n                less_complex_col.dtype,\n                pd.Int64Dtype,\n            ):\n                less_complex_col = less_complex_col.astype(\"float64\")\n\n            if abs(more_complex_col.corr(less_complex_col)) >= pct_corr_threshold:\n                dropped.add(more_complex_name)\n                break\n\n    keep = [\n        f_name\n        for f_name in feature_matrix.columns\n        if (f_name in features_to_keep or f_name not in dropped)\n    ]\n    return _apply_feature_selection(keep, feature_matrix, features)\n\n\ndef _apply_feature_selection(keep, feature_matrix, features=None):\n    new_matrix = feature_matrix[keep]\n    new_feature_names = set(new_matrix.columns)\n\n    if features is not None:\n        new_features = []\n        for f in features:\n            if f.number_output_features > 1:\n                slices = [\n                    f[i]\n                    for i in range(f.number_output_features)\n                    if f[i].get_name() in new_feature_names\n                ]\n                if len(slices) == f.number_output_features:\n                    new_features.append(f)\n                else:\n                    new_features.extend(slices)\n            else:\n                if f.get_name() in new_feature_names:\n                    new_features.append(f)\n\n        return new_matrix, new_features\n    return new_matrix\n"
  },
  {
    "path": "featuretools/synthesis/__init__.py",
    "content": "# flake8: noqa\nfrom featuretools.synthesis.api import *\n"
  },
  {
    "path": "featuretools/synthesis/api.py",
    "content": "# flake8: noqa\nfrom featuretools.synthesis.deep_feature_synthesis import DeepFeatureSynthesis\nfrom featuretools.synthesis.dfs import dfs\nfrom featuretools.synthesis.encode_features import encode_features\nfrom featuretools.synthesis.get_valid_primitives import get_valid_primitives\n"
  },
  {
    "path": "featuretools/synthesis/deep_feature_synthesis.py",
    "content": "import functools\nimport logging\nimport operator\nimport warnings\nfrom collections import defaultdict\nfrom typing import Any, DefaultDict, Dict, List, Tuple, Type\n\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Boolean, BooleanNullable\n\nfrom featuretools import primitives\nfrom featuretools.entityset.entityset import LTI_COLUMN_NAME\nfrom featuretools.entityset.relationship import RelationshipPath\nfrom featuretools.feature_base import (\n    AggregationFeature,\n    DirectFeature,\n    FeatureBase,\n    GroupByTransformFeature,\n    IdentityFeature,\n    TransformFeature,\n)\nfrom featuretools.feature_base.cache import CacheType, feature_cache\nfrom featuretools.feature_base.utils import is_valid_input\nfrom featuretools.primitives.base import (\n    AggregationPrimitive,\n    PrimitiveBase,\n    TransformPrimitive,\n)\nfrom featuretools.primitives.options_utils import (\n    filter_groupby_matches_by_options,\n    filter_matches_by_options,\n    generate_all_primitive_options,\n    ignore_dataframe_for_primitive,\n)\nfrom featuretools.utils.gen_utils import camel_and_title_to_snake\n\nlogger = logging.getLogger(\"featuretools\")\n\n\nclass DeepFeatureSynthesis(object):\n    \"\"\"Automatically produce features for a target dataframe in an Entityset.\n\n    Args:\n        target_dataframe_name (str): Name of dataframe for which to build features.\n\n        entityset (EntitySet): Entityset for which to build features.\n\n        agg_primitives (list[str or :class:`.primitives.`], optional):\n            list of Aggregation Feature types to apply.\n\n            Default: [\"sum\", \"std\", \"max\", \"skew\", \"min\", \"mean\", \"count\", \"percent_true\", \"num_unique\", \"mode\"]\n\n        trans_primitives (list[str or :class:`.primitives.TransformPrimitive`], optional):\n            list of Transform primitives to use.\n\n            Default: [\"day\", \"year\", \"month\", \"weekday\", \"haversine\", \"num_words\", \"num_characters\"]\n\n        where_primitives (list[str or :class:`.primitives.PrimitiveBase`], optional):\n            only add where clauses to these types of Primitives\n\n            Default:\n\n                [\"count\"]\n\n        groupby_trans_primitives (list[str or :class:`.primitives.TransformPrimitive`], optional):\n            list of Transform primitives to make GroupByTransformFeatures with\n\n        max_depth (int, optional) : maximum allowed depth of features.\n            Default: 2. If -1, no limit.\n\n        max_features (int, optional) : Cap the number of generated features to\n            this number. If -1, no limit.\n\n        allowed_paths (list[list[str]], optional): Allowed dataframe paths to make\n            features for. If None, use all paths.\n\n        ignore_dataframes (list[str], optional): List of dataframes to\n            blacklist when creating features. If None, use all dataframes.\n\n        ignore_columns (dict[str -> list[str]], optional): List of specific\n            columns within each dataframe to blacklist when creating features.\n            If None, use all columns.\n\n        seed_features (list[:class:`.FeatureBase`], optional): List of manually\n            defined features to use.\n\n        drop_contains (list[str], optional): Drop features\n            that contains these strings in name.\n\n        drop_exact (list[str], optional): Drop features that\n            exactly match these strings in name.\n\n        where_stacking_limit (int, optional): Cap the depth of the where features.\n            Default: 1\n\n        primitive_options (dict[str or tuple[str] or PrimitiveBase -> dict or list[dict]], optional):\n            Specify options for a single primitive or a group of primitives.\n            Lists of option dicts are used to specify options per input for primitives\n            with multiple inputs. Each option ``dict`` can have the following keys:\n\n\n            ``\"include_dataframes\"``\n                List of dataframes to be included when creating features for\n                the primitive(s). All other dataframes will be ignored\n                (list[str]).\n            ``\"ignore_dataframes\"``\n                List of dataframes to be blacklisted when creating features\n                for the primitive(s) (list[str]).\n            ``\"include_columns\"``\n                List of specific columns within each dataframe to include when\n                creating features for the primitive(s). All other columns\n                in a given dataframe will be ignored (dict[str -> list[str]]).\n            ``\"ignore_columns\"``\n                List of specific columns within each dataframe to blacklist\n                when creating features for the primitive(s) (dict[str ->\n                list[str]]).\n            ``\"include_groupby_dataframes\"``\n                List of dataframes to be included when finding groupbys. All\n                other dataframes will be ignored (list[str]).\n            ``\"ignore_groupby_dataframes\"``\n                List of dataframes to blacklist when finding groupbys\n                (list[str]).\n            ``\"include_groupby_columns\"``\n                List of specific columns within each dataframe to include as\n                groupbys, if applicable. All other columns in each\n                dataframe will be ignored (dict[str -> list[str]]).\n            ``\"ignore_groupby_columns\"``\n                List of specific columns within each dataframe to blacklist\n                as groupbys (dict[str -> list[str]]).\n    \"\"\"\n\n    def __init__(\n        self,\n        target_dataframe_name,\n        entityset,\n        agg_primitives=None,\n        trans_primitives=None,\n        where_primitives=None,\n        groupby_trans_primitives=None,\n        max_depth=2,\n        max_features=-1,\n        allowed_paths=None,\n        ignore_dataframes=None,\n        ignore_columns=None,\n        primitive_options=None,\n        seed_features=None,\n        drop_contains=None,\n        drop_exact=None,\n        where_stacking_limit=1,\n    ):\n        if target_dataframe_name not in entityset.dataframe_dict:\n            es_name = entityset.id or \"entity set\"\n            msg = \"Provided target dataframe %s does not exist in %s\" % (\n                target_dataframe_name,\n                es_name,\n            )\n            raise KeyError(msg)\n\n        # Multiple calls to dfs() should start with a fresh cache\n        feature_cache.clear_all()\n        feature_cache.enabled = True\n\n        # need to change max_depth to None because DFs terminates when  <0\n        if max_depth == -1:\n            max_depth = None\n\n        # if just one dataframe, set max depth to 1 (transform stacking rule)\n        if len(entityset.dataframe_dict) == 1 and (max_depth is None or max_depth > 1):\n            warnings.warn(\n                \"Only one dataframe in entityset, changing max_depth to \"\n                \"1 since deeper features cannot be created\",\n            )\n            max_depth = 1\n\n        self.max_depth = max_depth\n\n        self.max_features = max_features\n\n        self.allowed_paths = allowed_paths\n        if self.allowed_paths:\n            self.allowed_paths = set()\n            for path in allowed_paths:\n                self.allowed_paths.add(tuple(path))\n\n        if ignore_dataframes is None:\n            self.ignore_dataframes = set()\n        else:\n            if not isinstance(ignore_dataframes, list):\n                raise TypeError(\"ignore_dataframes must be a list\")\n            assert (\n                target_dataframe_name not in ignore_dataframes\n            ), \"Can't ignore target_dataframe!\"\n            self.ignore_dataframes = set(ignore_dataframes)\n\n        self.ignore_columns = _build_ignore_columns(ignore_columns)\n        self.target_dataframe_name = target_dataframe_name\n        self.es = entityset\n\n        aggregation_primitive_dict = primitives.get_aggregation_primitives()\n        transform_primitive_dict = primitives.get_transform_primitives()\n        if agg_primitives is None:\n            agg_primitives = primitives.get_default_aggregation_primitives()\n        self.agg_primitives = sorted(\n            [\n                check_primitive(\n                    p,\n                    \"aggregation\",\n                    aggregation_primitive_dict,\n                    transform_primitive_dict,\n                )\n                for p in agg_primitives\n            ],\n        )\n\n        if trans_primitives is None:\n            trans_primitives = primitives.get_default_transform_primitives()\n\n        self.trans_primitives = sorted(\n            [\n                check_primitive(\n                    p,\n                    \"transform\",\n                    aggregation_primitive_dict,\n                    transform_primitive_dict,\n                )\n                for p in trans_primitives\n            ],\n        )\n\n        if where_primitives is None:\n            where_primitives = [primitives.Count]\n        self.where_primitives = sorted(\n            [\n                check_primitive(\n                    p,\n                    \"where\",\n                    aggregation_primitive_dict,\n                    transform_primitive_dict,\n                )\n                for p in where_primitives\n            ],\n        )\n\n        if groupby_trans_primitives is None:\n            groupby_trans_primitives = []\n        self.groupby_trans_primitives = sorted(\n            [\n                check_primitive(\n                    p,\n                    \"groupby transform\",\n                    aggregation_primitive_dict,\n                    transform_primitive_dict,\n                )\n                for p in groupby_trans_primitives\n            ],\n        )\n\n        if primitive_options is None:\n            primitive_options = {}\n        all_primitives = (\n            self.trans_primitives\n            + self.agg_primitives\n            + self.where_primitives\n            + self.groupby_trans_primitives\n        )\n\n        (\n            self.primitive_options,\n            self.ignore_dataframes,\n            self.ignore_columns,\n        ) = generate_all_primitive_options(\n            all_primitives,\n            primitive_options,\n            self.ignore_dataframes,\n            self.ignore_columns,\n            self.es,\n        )\n        self.seed_features = sorted(seed_features or [], key=lambda f: f.unique_name())\n        self.drop_exact = drop_exact or []\n        self.drop_contains = drop_contains or []\n        self.where_stacking_limit = where_stacking_limit\n\n    def build_features(self, return_types=None, verbose=False):\n        \"\"\"Automatically builds feature definitions for target\n            dataframe using Deep Feature Synthesis algorithm\n\n        Args:\n            return_types (list[woodwork.ColumnSchema] or str, optional):\n                List of ColumnSchemas defining the types of\n                columns to return. If None, defaults to returning all\n                numeric, categorical and boolean types. If given as\n                the string 'all', use all available return types.\n\n            verbose (bool, optional): If True, print progress.\n\n        Returns:\n            list[BaseFeature]: Returns a list of\n                features for target dataframe, sorted by feature depth\n                (shallow first).\n        \"\"\"\n        all_features = {}\n\n        self.where_clauses = defaultdict(set)\n\n        if return_types is None:\n            return_types = [\n                ColumnSchema(semantic_tags=[\"numeric\"]),\n                ColumnSchema(semantic_tags=[\"category\"]),\n                ColumnSchema(logical_type=Boolean),\n                ColumnSchema(logical_type=BooleanNullable),\n            ]\n        elif return_types == \"all\":\n            pass\n        else:\n            msg = \"return_types must be a list, or 'all'\"\n            assert isinstance(return_types, list), msg\n\n        self._run_dfs(\n            self.es[self.target_dataframe_name],\n            RelationshipPath([]),\n            all_features,\n            max_depth=self.max_depth,\n        )\n\n        new_features = list(all_features[self.target_dataframe_name].values())\n\n        def filt(f):\n            # remove identity features of the ID field of the target dataframe\n            if (\n                isinstance(f, IdentityFeature)\n                and f.dataframe_name == self.target_dataframe_name\n                and f.column_name == self.es[self.target_dataframe_name].ww.index\n            ):\n                return False\n\n            return True\n\n        # filter out features with undesired return types\n        if return_types != \"all\":\n            new_features = [\n                f\n                for f in new_features\n                if any(\n                    True\n                    for schema in return_types\n                    if is_valid_input(f.column_schema, schema)\n                )\n            ]\n        new_features = list(filter(filt, new_features))\n\n        new_features.sort(key=lambda f: f.get_depth())\n\n        new_features = self._filter_features(new_features)\n\n        if self.max_features > 0:\n            new_features = new_features[: self.max_features]\n\n        if verbose:\n            print(\"Built {} features\".format(len(new_features)))\n            verbose = None\n        return new_features\n\n    def _filter_features(self, features):\n        assert isinstance(self.drop_exact, list), \"drop_exact must be a list\"\n        assert isinstance(self.drop_contains, list), \"drop_contains must be a list\"\n        f_keep = []\n        for f in features:\n            keep = True\n            for contains in self.drop_contains:\n                if contains in f.get_name():\n                    keep = False\n                    break\n\n            if f.get_name() in self.drop_exact:\n                keep = False\n\n            if keep:\n                f_keep.append(f)\n\n        return f_keep\n\n    def _run_dfs(self, dataframe, relationship_path, all_features, max_depth):\n        \"\"\"\n        Create features for the provided dataframe\n\n        Args:\n            dataframe (DataFrame): Dataframe for which to create features.\n            relationship_path (RelationshipPath): The path to this dataframe.\n            all_features (dict[dataframe name -> dict[str -> BaseFeature]]):\n                Dict containing a dict for each dataframe. Each nested dict\n                has features as values with their ids as keys.\n            max_depth (int) : Maximum allowed depth of features.\n        \"\"\"\n        if max_depth is not None and max_depth < 0:\n            return\n\n        all_features[dataframe.ww.name] = {}\n\n        \"\"\"\n        Step 1 - Create identity features\n        \"\"\"\n        self._add_identity_features(all_features, dataframe)\n\n        \"\"\"\n        Step 2 - Recursively build features for each dataframe in a backward relationship\n        \"\"\"\n\n        backward_dataframes = self.es.get_backward_dataframes(dataframe.ww.name)\n        for b_dataframe_id, sub_relationship_path in backward_dataframes:\n            # Skip if we've already created features for this dataframe.\n            if b_dataframe_id in all_features:\n                continue\n\n            if b_dataframe_id in self.ignore_dataframes:\n                continue\n\n            new_path = relationship_path + sub_relationship_path\n            if (\n                self.allowed_paths\n                and tuple(new_path.dataframes()) not in self.allowed_paths\n            ):\n                continue\n\n            new_max_depth = None\n            if max_depth is not None:\n                new_max_depth = max_depth - 1\n            self._run_dfs(\n                dataframe=self.es[b_dataframe_id],\n                relationship_path=new_path,\n                all_features=all_features,\n                max_depth=new_max_depth,\n            )\n\n        \"\"\"\n        Step 3 - Create aggregation features for all deep backward relationships\n        \"\"\"\n\n        backward_dataframes = self.es.get_backward_dataframes(\n            dataframe.ww.name,\n            deep=True,\n        )\n        for b_dataframe_id, sub_relationship_path in backward_dataframes:\n            if b_dataframe_id in self.ignore_dataframes:\n                continue\n\n            new_path = relationship_path + sub_relationship_path\n            if (\n                self.allowed_paths\n                and tuple(new_path.dataframes()) not in self.allowed_paths\n            ):\n                continue\n\n            self._build_agg_features(\n                parent_dataframe=self.es[dataframe.ww.name],\n                child_dataframe=self.es[b_dataframe_id],\n                all_features=all_features,\n                max_depth=max_depth,\n                relationship_path=sub_relationship_path,\n            )\n\n        \"\"\"\n        Step 4 - Create transform features of identity and aggregation features\n        \"\"\"\n\n        self._build_transform_features(all_features, dataframe, max_depth=max_depth)\n\n        \"\"\"\n        Step 5 - Recursively build features for each dataframe in a forward relationship\n        \"\"\"\n\n        forward_dataframes = self.es.get_forward_dataframes(dataframe.ww.name)\n        for f_dataframe_id, sub_relationship_path in forward_dataframes:\n            # Skip if we've already created features for this dataframe.\n            if f_dataframe_id in all_features:\n                continue\n\n            if f_dataframe_id in self.ignore_dataframes:\n                continue\n\n            new_path = relationship_path + sub_relationship_path\n            if (\n                self.allowed_paths\n                and tuple(new_path.dataframes()) not in self.allowed_paths\n            ):\n                continue\n\n            new_max_depth = None\n            if max_depth is not None:\n                new_max_depth = max_depth - 1\n            self._run_dfs(\n                dataframe=self.es[f_dataframe_id],\n                relationship_path=new_path,\n                all_features=all_features,\n                max_depth=new_max_depth,\n            )\n\n        \"\"\"\n        Step 6 - Create direct features for forward relationships\n        \"\"\"\n\n        forward_dataframes = self.es.get_forward_dataframes(dataframe.ww.name)\n        for f_dataframe_id, sub_relationship_path in forward_dataframes:\n            if f_dataframe_id in self.ignore_dataframes:\n                continue\n\n            new_path = relationship_path + sub_relationship_path\n            if (\n                self.allowed_paths\n                and tuple(new_path.dataframes()) not in self.allowed_paths\n            ):\n                continue\n\n            self._build_forward_features(\n                all_features=all_features,\n                relationship_path=sub_relationship_path,\n                max_depth=max_depth,\n            )\n\n        \"\"\"\n        Step 7 - Create transform features of direct features\n        \"\"\"\n\n        self._build_transform_features(\n            all_features,\n            dataframe,\n            max_depth=max_depth,\n            require_direct_input=True,\n        )\n\n        # now that all  features are added, build where clauses\n        self._build_where_clauses(all_features, dataframe)\n\n    def _handle_new_feature(self, new_feature, all_features):\n        \"\"\"Adds new feature to the dict\n\n        Args:\n            new_feature (:class:`.FeatureBase`): New feature being\n                checked.\n            all_features (dict[dataframe name -> dict[str -> BaseFeature]]):\n                Dict containing a dict for each dataframe. Each nested dict\n                has features as values with their ids as keys.\n\n        Returns:\n            dict[PrimitiveBase -> dict[feature id -> feature]]: Dict of\n                features with any new features.\n\n        Raises:\n            Exception: Attempted to add a single feature multiple times\n        \"\"\"\n        dataframe_name = new_feature.dataframe_name\n        name = new_feature.unique_name()\n\n        # Warn if this feature is already present, and it is not a seed feature.\n        # It is expected that a seed feature could also be generated by dfs.\n        if name in all_features[dataframe_name] and name not in (\n            f.unique_name() for f in self.seed_features\n        ):\n            logger.warning(\n                \"Attempting to add feature %s which is already \"\n                \"present. This is likely a bug.\" % new_feature,\n            )\n            return\n\n        all_features[dataframe_name][name] = new_feature\n\n    def _add_identity_features(self, all_features, dataframe):\n        \"\"\"converts all columns from the given dataframe into features\n\n        Args:\n            all_features (dict[dataframe name -> dict[str -> BaseFeature]]):\n                Dict containing a dict for each dataframe. Each nested dict\n                has features as values with their ids as keys.\n            dataframe (DataFrame): DataFrame to calculate features for.\n        \"\"\"\n        for col in dataframe.columns:\n            if col in self.ignore_columns[dataframe.ww.name] or col == LTI_COLUMN_NAME:\n                continue\n            new_f = IdentityFeature(self.es[dataframe.ww.name].ww[col])\n            self._handle_new_feature(all_features=all_features, new_feature=new_f)\n\n        # add seed features, if any, for dfs to build on top of\n        # if there are any multi output features, this will build on\n        # top of each output of the feature.\n        for f in self.seed_features:\n            if f.dataframe_name == dataframe.ww.name:\n                self._handle_new_feature(all_features=all_features, new_feature=f)\n\n    def _build_where_clauses(self, all_features, dataframe):\n        \"\"\"Traverses all identity features and creates a Compare for\n            each one, based on some heuristics\n\n        Args:\n            all_features (dict[dataframe name -> dict[str -> BaseFeature]]):\n                Dict containing a dict for each dataframe. Each nested dict\n                has features as values with their ids as keys.\n          dataframe (DataFrame): DataFrame to calculate features for.\n        \"\"\"\n\n        def is_valid_feature(f):\n            if isinstance(f, IdentityFeature):\n                return True\n            if isinstance(f, DirectFeature) and getattr(\n                f.base_features[0],\n                \"column_name\",\n                None,\n            ):\n                return True\n            return False\n\n        for feat in [\n            f for f in all_features[dataframe.ww.name].values() if is_valid_feature(f)\n        ]:\n            # Get interesting_values from the EntitySet that was passed, which\n            # is assumed to be the most recent version of the EntitySet.\n            # Features can contain a stale EntitySet reference without\n            # interesting_values\n            if isinstance(feat, DirectFeature):\n                df = feat.base_features[0].dataframe_name\n                col = feat.base_features[0].column_name\n            else:\n                df = feat.dataframe_name\n                col = feat.column_name\n            metadata = self.es[df].ww.columns[col].metadata\n            interesting_values = metadata.get(\"interesting_values\")\n            if interesting_values:\n                for val in interesting_values:\n                    self.where_clauses[dataframe.ww.name].add(feat == val)\n\n    def _build_transform_features(\n        self,\n        all_features,\n        dataframe,\n        max_depth=0,\n        require_direct_input=False,\n    ):\n        \"\"\"Creates trans_features for all the columns in a dataframe\n\n        Args:\n            all_features (dict[dataframe name: dict->[str->:class:`BaseFeature`]]):\n                Dict containing a dict for each dataframe. Each nested dict\n                has features as values with their ids as keys\n\n          dataframe (DataFrame): DataFrame to calculate features for.\n        \"\"\"\n\n        new_max_depth = None\n        if max_depth is not None:\n            new_max_depth = max_depth - 1\n\n        # Keep track of features to add until the end to avoid applying\n        # transform primitives to features that were also built by transform primitives\n        features_to_add = []\n\n        for trans_prim in self.trans_primitives:\n            current_options = self.primitive_options.get(\n                trans_prim,\n                self.primitive_options.get(trans_prim.name),\n            )\n            if ignore_dataframe_for_primitive(current_options, dataframe):\n                continue\n\n            input_types = trans_prim.input_types\n\n            matching_inputs = self._get_matching_inputs(\n                all_features,\n                dataframe,\n                new_max_depth,\n                input_types,\n                trans_prim,\n                current_options,\n                require_direct_input=require_direct_input,\n                feature_filter=not_a_transform_input,\n            )\n\n            for matching_input in matching_inputs:\n                if not can_stack_primitive_on_inputs(trans_prim, matching_input):\n                    continue\n                if not any(\n                    True for bf in matching_input if bf.number_output_features != 1\n                ):\n                    new_f = TransformFeature(matching_input, primitive=trans_prim)\n                    features_to_add.append(new_f)\n\n        for groupby_prim in self.groupby_trans_primitives:\n            current_options = self.primitive_options.get(\n                groupby_prim,\n                self.primitive_options.get(groupby_prim.name),\n            )\n            if ignore_dataframe_for_primitive(current_options, dataframe, groupby=True):\n                continue\n            input_types = groupby_prim.input_types[:]\n            matching_inputs = self._get_matching_inputs(\n                all_features,\n                dataframe,\n                new_max_depth,\n                input_types,\n                groupby_prim,\n                current_options,\n                feature_filter=not_a_transform_input,\n            )\n\n            # get columns to use as groupbys, use IDs as default unless other groupbys specified\n            if any(\n                True\n                for option in current_options\n                if dataframe.ww.name in option.get(\"include_groupby_columns\", [])\n            ):\n                column_schemas = \"all\"\n            else:\n                column_schemas = [ColumnSchema(semantic_tags=[\"foreign_key\"])]\n            groupby_matches = self._features_by_type(\n                all_features=all_features,\n                dataframe=dataframe,\n                max_depth=new_max_depth,\n                column_schemas=column_schemas,\n            )\n            groupby_matches = filter_groupby_matches_by_options(\n                groupby_matches,\n                current_options,\n            )\n\n            for matching_input in matching_inputs:\n                if not can_stack_primitive_on_inputs(groupby_prim, matching_input):\n                    continue\n                if any(True for bf in matching_input if bf.number_output_features != 1):\n                    continue\n                if require_direct_input:\n                    if any_direct_in_matching_input := any(\n                        isinstance(bf, DirectFeature) for bf in matching_input\n                    ):\n                        all_direct_and_same_path_in_matching_input = (\n                            _all_direct_and_same_path(matching_input)\n                        )\n                for groupby in groupby_matches:\n                    if require_direct_input:\n                        # If require_direct_input, require a DirectFeature in input or as a\n                        # groupby, and don't create features of inputs/groupbys which are\n                        # all direct features with the same relationship path\n                        #\n                        # If we require_direct_input, we skip Feature generation\n                        # in the following two cases:\n                        # (1) --> There are no DirectFeatures in the matching input,\n                        #         and groupby is not a DirectFeature\n                        # (2) --> All of the matching input and groupby are DirectFeatures\n                        #         with the same relationship path\n                        groupby_is_direct = isinstance(groupby[0], DirectFeature)\n                        # Checks case (1)\n                        if not any_direct_in_matching_input:\n                            if not groupby_is_direct:\n                                continue\n                        elif all_direct_and_same_path_in_matching_input:\n                            # Checks case (2)\n                            if (\n                                groupby_is_direct\n                                and groupby[0].relationship_path\n                                == matching_input[0].relationship_path\n                            ):\n                                continue\n                    new_f = GroupByTransformFeature(\n                        list(matching_input),\n                        groupby=groupby[0],\n                        primitive=groupby_prim,\n                    )\n                    features_to_add.append(new_f)\n        for new_f in features_to_add:\n            self._handle_new_feature(all_features=all_features, new_feature=new_f)\n\n    def _build_forward_features(self, all_features, relationship_path, max_depth=0):\n        _, relationship = relationship_path[0]\n\n        child_dataframe_name = relationship.child_dataframe.ww.name\n        parent_dataframe = relationship.parent_dataframe\n\n        features = self._features_by_type(\n            all_features=all_features,\n            dataframe=parent_dataframe,\n            max_depth=max_depth,\n            column_schemas=\"all\",\n        )\n\n        for f in features:\n            if self._feature_in_relationship_path(relationship_path, f):\n                continue\n\n            # limits allowing direct features of agg_feats with where clauses\n            if isinstance(f, AggregationFeature):\n                deep_base_features = [f] + f.get_dependencies(deep=True)\n                for feat in deep_base_features:\n                    if isinstance(feat, AggregationFeature) and feat.where is not None:\n                        continue\n\n            new_f = DirectFeature(f, child_dataframe_name, relationship=relationship)\n\n            self._handle_new_feature(all_features=all_features, new_feature=new_f)\n\n    def _build_agg_features(\n        self,\n        all_features,\n        parent_dataframe,\n        child_dataframe,\n        max_depth,\n        relationship_path,\n    ):\n        new_max_depth = None\n        if max_depth is not None:\n            new_max_depth = max_depth - 1\n        for agg_prim in self.agg_primitives:\n            current_options = self.primitive_options.get(\n                agg_prim,\n                self.primitive_options.get(agg_prim.name),\n            )\n\n            if ignore_dataframe_for_primitive(current_options, child_dataframe):\n                continue\n\n            def feature_filter(f):\n                # Remove direct features of parent dataframe and features in relationship path.\n                return (\n                    not _direct_of_dataframe(f, parent_dataframe)\n                ) and not self._feature_in_relationship_path(relationship_path, f)\n\n            input_types = agg_prim.input_types\n            matching_inputs = self._get_matching_inputs(\n                all_features,\n                child_dataframe,\n                new_max_depth,\n                input_types,\n                agg_prim,\n                current_options,\n                feature_filter=feature_filter,\n            )\n\n            matching_inputs = filter_matches_by_options(\n                matching_inputs,\n                current_options,\n            )\n            wheres = list(self.where_clauses[child_dataframe.ww.name])\n\n            for matching_input in matching_inputs:\n                if not can_stack_primitive_on_inputs(agg_prim, matching_input):\n                    continue\n                new_f = AggregationFeature(\n                    matching_input,\n                    parent_dataframe_name=parent_dataframe.ww.name,\n                    relationship_path=relationship_path,\n                    primitive=agg_prim,\n                )\n\n                self._handle_new_feature(new_f, all_features)\n\n                # limit the stacking of where features\n                # count up the the number of where features\n                # in this feature and its dependencies\n                feat_wheres = []\n                for f in matching_input:\n                    if isinstance(f, AggregationFeature) and f.where is not None:\n                        feat_wheres.append(f)\n                    for feat in f.get_dependencies(deep=True):\n                        if (\n                            isinstance(feat, AggregationFeature)\n                            and feat.where is not None\n                        ):\n                            feat_wheres.append(feat)\n\n                if len(feat_wheres) >= self.where_stacking_limit:\n                    continue\n\n                # limits the aggregation feature by the given allowed feature types.\n                if not any(\n                    True\n                    for primitive in self.where_primitives\n                    if issubclass(type(agg_prim), type(primitive))\n                ):\n                    continue\n\n                for where in wheres:\n                    # limits the where feats so they are different than base feats\n                    base_names = [f.unique_name() for f in new_f.base_features]\n                    if any(\n                        True\n                        for base_feat in where.base_features\n                        if base_feat.unique_name() in base_names\n                    ):\n                        continue\n\n                    new_f = AggregationFeature(\n                        matching_input,\n                        parent_dataframe_name=parent_dataframe.ww.name,\n                        relationship_path=relationship_path,\n                        where=where,\n                        primitive=agg_prim,\n                    )\n                    self._handle_new_feature(new_f, all_features)\n\n    def _features_by_type(\n        self,\n        all_features,\n        dataframe,\n        max_depth,\n        column_schemas=None,\n    ):\n        if max_depth is not None and max_depth < 0:\n            return []\n\n        if dataframe.ww.name not in all_features:\n            return []\n\n        def expand_features(feature) -> List[Any]:\n            \"\"\"Internal method to return either the single feature\n                or the output features\n\n            Args:\n                feature (Feature): Feature instance\n\n            Returns:\n                List[Any]: list of features\n            \"\"\"\n            outputs = feature.number_output_features\n            if outputs > 1:\n                return [feature[i] for i in range(outputs)]\n            return [feature]\n\n        # Build the complete list of features prior to processing\n        selected_features = [\n            expand_features(feature)\n            for feature in all_features[dataframe.ww.name].values()\n        ]\n        selected_features = functools.reduce(operator.iconcat, selected_features, [])\n\n        column_schemas = column_schemas if column_schemas else set()\n\n        if max_depth is None and column_schemas == \"all\":\n            return selected_features\n\n        # assigning seed_features locally adds a slight performance benefit by not having to look\n        # up the property for each round of the comprehension\n        seed_features = self.seed_features\n        if max_depth is not None:\n            selected_features = [\n                feature\n                for feature in selected_features\n                if get_feature_depth(feature, stop_at=seed_features) <= max_depth\n            ]\n\n        def valid_input(column_schema) -> bool:\n            \"\"\"Helper method to validate the feature schema\n               to the allowed column_schemas\n\n            Args:\n                column_schema (ColumnSchema): feature column schema\n\n            Returns:\n                bool: True if valid\n            \"\"\"\n            return any(\n                True\n                for schema in column_schemas\n                if is_valid_input(column_schema, schema)\n            )\n\n        if column_schemas and column_schemas != \"all\":\n            selected_features = [\n                feature\n                for feature in selected_features\n                if valid_input(feature.column_schema)\n            ]\n\n        return selected_features\n\n    def _feature_in_relationship_path(self, relationship_path, feature):\n        # must be identity feature to be in the relationship path\n        if not isinstance(feature, IdentityFeature):\n            return False\n\n        for _, relationship in relationship_path:\n            if (\n                relationship.child_name == feature.dataframe_name\n                and relationship._child_column_name == feature.column_name\n            ):\n                return True\n\n            if (\n                relationship.parent_name == feature.dataframe_name\n                and relationship._parent_column_name == feature.column_name\n            ):\n                return True\n\n        return False\n\n    def _get_matching_inputs(\n        self,\n        all_features,\n        dataframe,\n        max_depth,\n        input_types,\n        primitive,\n        primitive_options,\n        require_direct_input=False,\n        feature_filter=None,\n    ):\n        if not isinstance(input_types[0], list):\n            input_types = [input_types]\n        matching_inputs = []\n\n        for input_type in input_types:\n            features = self._features_by_type(\n                all_features=all_features,\n                dataframe=dataframe,\n                max_depth=max_depth,\n                column_schemas=list(input_type),\n            )\n            if not features:\n                continue\n\n            if feature_filter:\n                features = [f for f in features if feature_filter(f)]\n\n            matches = match(\n                input_type,\n                features,\n                commutative=primitive.commutative,\n                require_direct_input=require_direct_input,\n            )\n\n            matching_inputs.extend(matches)\n\n        # everything following depends on populated matching_inputs\n        if not matching_inputs:\n            return matching_inputs\n\n        if require_direct_input:\n            # Don't create trans features of inputs which are all direct\n            # features with the same relationship_path.\n            matching_inputs = {\n                inputs\n                for inputs in matching_inputs\n                if not _all_direct_and_same_path(inputs)\n            }\n        matching_inputs = filter_matches_by_options(\n            matching_inputs,\n            primitive_options,\n            commutative=primitive.commutative,\n        )\n\n        # Don't build features on numeric foreign key columns\n        matching_inputs = [\n            match\n            for match in matching_inputs\n            if not _match_contains_numeric_foreign_key(match)\n        ]\n\n        return matching_inputs\n\n\ndef _match_contains_numeric_foreign_key(match):\n    match_schema = ColumnSchema(semantic_tags={\"foreign_key\", \"numeric\"})\n    return any(True for f in match if is_valid_input(f.column_schema, match_schema))\n\n\ndef not_a_transform_input(feature):\n    \"\"\"\n    Verifies transform inputs are not transform features or direct features of transform features\n    Returns True if a transform primitive can stack on the feature, and False if it cannot.\n    \"\"\"\n    primitive = _find_root_primitive(feature)\n    return not isinstance(primitive, TransformPrimitive)\n\n\ndef _find_root_primitive(feature):\n    \"\"\"\n    If a feature is a DirectFeature, finds the primitive of\n    the \"original\" base feature.\n    \"\"\"\n    if isinstance(feature, DirectFeature):\n        return _find_root_primitive(feature.base_features[0])\n    return feature.primitive\n\n\ndef _check_if_stacking_is_prohibited(\n    feature: FeatureBase,\n    f_primitive: PrimitiveBase,\n    primitive: PrimitiveBase,\n    primitive_class: Type[PrimitiveBase],\n    primitive_stack_on_self: bool,\n    tuple_primitive_stack_on_exclude: Tuple[Type[PrimitiveBase]],\n):\n    if not primitive_stack_on_self and isinstance(f_primitive, primitive_class):\n        return True\n\n    if isinstance(f_primitive, tuple_primitive_stack_on_exclude):\n        return True\n\n    if feature.number_output_features > 1:\n        return True\n\n    if f_primitive.base_of_exclude is not None and isinstance(\n        primitive,\n        tuple(f_primitive.base_of_exclude),\n    ):\n        return True\n    return False\n\n\ndef _check_if_stacking_is_permitted(\n    f_primitive: PrimitiveBase,\n    primitive_class: Type[PrimitiveBase],\n    primitive_stack_on_self: bool,\n    tuple_primitive_stack_on: Tuple[Type[PrimitiveBase]],\n):\n    if primitive_stack_on_self and isinstance(f_primitive, primitive_class):\n        return True\n    if tuple_primitive_stack_on is None or isinstance(\n        f_primitive,\n        tuple_primitive_stack_on,\n    ):\n        return True\n    if f_primitive.base_of is None:\n        return True\n    if primitive_class in f_primitive.base_of:\n        return True\n    return False\n\n\ndef can_stack_primitive_on_inputs(primitive: PrimitiveBase, inputs: List[FeatureBase]):\n    \"\"\"\n    Checks if features in inputs can be used with supplied primitive\n    using the stacking rules.\n    Returns True if stacking is possible, and False if not.\n    \"\"\"\n\n    primitive_class = primitive.__class__\n    tuple_primitive_stack_on = (\n        tuple(primitive.stack_on) if primitive.stack_on is not None else None\n    )\n    tuple_primitive_stack_on_exclude = (\n        tuple(primitive.stack_on_exclude)\n        if primitive.stack_on_exclude is not None\n        else tuple()\n    )\n    primitive_stack_on_self: bool = primitive.stack_on_self\n\n    for feature in inputs:\n        # In the case that the feature is a DirectFeature, the feature's primitive will be a PrimitiveBase object.\n        # However, we want to check stacking rules with the primitive the DirectFeature is based on.\n        f_primitive = _find_root_primitive(feature)\n\n        # check if stacking is prohibited\n        if _check_if_stacking_is_prohibited(\n            feature,\n            f_primitive,\n            primitive,\n            primitive_class,\n            primitive_stack_on_self,\n            tuple_primitive_stack_on_exclude,\n        ):\n            return False\n\n        # we permit stacking only if it is not prohibited and meets the criterion to be permitted\n        if not _check_if_stacking_is_permitted(\n            f_primitive,\n            primitive_class,\n            primitive_stack_on_self,\n            tuple_primitive_stack_on,\n        ):\n            return False\n\n    # if we reach this line nothing is prohibited and stacking is permitted for all inputs\n    return True\n\n\ndef match_by_schema(features, column_schema):\n    return [f for f in features if is_valid_input(f.column_schema, column_schema)]\n\n\ndef match(\n    input_types,\n    features,\n    replace=False,\n    commutative=False,\n    require_direct_input=False,\n):\n    to_match = input_types[0]\n\n    matches = match_by_schema(features, to_match)\n\n    if len(input_types) == 1:\n        return [\n            (m,)\n            for m in matches\n            if (not require_direct_input or isinstance(m, DirectFeature))\n        ]\n\n    matching_inputs = set()\n\n    for m in matches:\n        copy = features[:]\n\n        if not replace:\n            copy = [c for c in copy if c.unique_name() != m.unique_name()]\n\n        # If we need a DirectFeature and this is not a DirectFeature then one of the rest must be.\n        still_require_direct_input = require_direct_input and not isinstance(\n            m,\n            DirectFeature,\n        )\n        rest = match(\n            input_types[1:],\n            copy,\n            replace,\n            require_direct_input=still_require_direct_input,\n        )\n\n        for r in rest:\n            new_match = [m] + list(r)\n\n            # commutative uses frozenset instead of tuple because it doesn't\n            # want multiple orderings of the same input\n            if commutative:\n                new_match = frozenset(new_match)\n            else:\n                new_match = tuple(new_match)\n            matching_inputs.add(new_match)\n\n    if commutative:\n        matching_inputs = {\n            tuple(sorted(s, key=lambda x: x.get_name().lower()))\n            for s in matching_inputs\n        }\n\n    return matching_inputs\n\n\ndef handle_primitive(primitive):\n    if not isinstance(primitive, PrimitiveBase):\n        primitive = primitive()\n    assert isinstance(primitive, PrimitiveBase), \"must be a primitive\"\n    return primitive\n\n\ndef check_primitive(\n    primitive,\n    prim_type,\n    aggregation_primitive_dict,\n    transform_primitive_dict,\n):\n    if prim_type in (\"transform\", \"groupby transform\"):\n        prim_dict = transform_primitive_dict\n        supertype = TransformPrimitive\n        arg_name = (\n            \"trans_primitives\"\n            if prim_type == \"transform\"\n            else \"groupby_trans_primitives\"\n        )\n        s = \"a transform\"\n    if prim_type in (\"aggregation\", \"where\"):\n        prim_dict = aggregation_primitive_dict\n        supertype = AggregationPrimitive\n        arg_name = (\n            \"agg_primitives\" if prim_type == \"aggregation\" else \"where_primitives\"\n        )\n        s = \"an aggregation\"\n\n    if isinstance(primitive, str):\n        prim_string = camel_and_title_to_snake(primitive)\n        if prim_string not in prim_dict:\n            raise ValueError(\n                \"Unknown {} primitive {}. \"\n                \"Call ft.primitives.list_primitives() to get\"\n                \" a list of available primitives\".format(prim_type, prim_string),\n            )\n        primitive = prim_dict[prim_string]\n    primitive = handle_primitive(primitive)\n    if not isinstance(primitive, supertype):\n        raise ValueError(\n            \"Primitive {} in {} is not {} \" \"primitive\".format(\n                type(primitive),\n                arg_name,\n                s,\n            ),\n        )\n    return primitive\n\n\ndef _all_direct_and_same_path(input_features: List[FeatureBase]) -> bool:\n    \"\"\"Given a list of features, returns True if they are all\n    DirectFeatures with the same relationship_path, and False if not\n    \"\"\"\n    path = input_features[0].relationship_path\n    for f in input_features:\n        if not isinstance(f, DirectFeature) or f.relationship_path != path:\n            return False\n    return True\n\n\ndef _build_ignore_columns(input_dict: Dict[str, List[str]]) -> DefaultDict[str, set]:\n    \"\"\"Iterates over the input dictionary to build the ignore_columns defaultdict.\n    Expects the input_dict's keys to be strings, and values to be lists of strings.\n    Throws a TypeError if they are not.\n    \"\"\"\n    ignore_columns = defaultdict(set)\n    if input_dict is not None:\n        for df_name, cols in input_dict.items():\n            if not isinstance(df_name, str) or not isinstance(cols, list):\n                raise TypeError(\"ignore_columns should be dict[str -> list]\")\n            elif not all(isinstance(c, str) for c in cols):\n                raise TypeError(\"list in ignore_columns must only have string values\")\n            ignore_columns[df_name] = set(cols)\n    return ignore_columns\n\n\ndef _direct_of_dataframe(feature, parent_dataframe):\n    return (\n        isinstance(feature, DirectFeature)\n        and feature.parent_dataframe_name == parent_dataframe.ww.name\n    )\n\n\ndef get_feature_depth(feature, stop_at=None):\n    \"\"\"Helper method to allow caching of feature.get_depth()\n    Why here and not in FeatureBase?  This keeps the caching\n    local to DFS.\n    \"\"\"\n    hash_key = hash(f\"{feature.get_name()}{feature.dataframe_name}{stop_at}\")\n    if cached_depth := feature_cache.get(CacheType.DEPTH, hash_key):\n        return cached_depth\n    depth = feature.get_depth(stop_at=stop_at)\n    feature_cache.add(CacheType.DEPTH, hash_key, depth)\n    return depth\n"
  },
  {
    "path": "featuretools/synthesis/dfs.py",
    "content": "import warnings\n\nfrom featuretools.computational_backends import calculate_feature_matrix\nfrom featuretools.entityset import EntitySet\nfrom featuretools.exceptions import UnusedPrimitiveWarning\nfrom featuretools.synthesis.deep_feature_synthesis import DeepFeatureSynthesis\nfrom featuretools.synthesis.utils import _categorize_features, get_unused_primitives\nfrom featuretools.utils import entry_point\n\n\n@entry_point(\"featuretools_dfs\")\ndef dfs(\n    dataframes=None,\n    relationships=None,\n    entityset=None,\n    target_dataframe_name=None,\n    cutoff_time=None,\n    instance_ids=None,\n    agg_primitives=None,\n    trans_primitives=None,\n    groupby_trans_primitives=None,\n    allowed_paths=None,\n    max_depth=2,\n    ignore_dataframes=None,\n    ignore_columns=None,\n    primitive_options=None,\n    seed_features=None,\n    drop_contains=None,\n    drop_exact=None,\n    where_primitives=None,\n    max_features=-1,\n    cutoff_time_in_index=False,\n    save_progress=None,\n    features_only=False,\n    training_window=None,\n    approximate=None,\n    chunk_size=None,\n    n_jobs=1,\n    dask_kwargs=None,\n    verbose=False,\n    return_types=None,\n    progress_callback=None,\n    include_cutoff_time=True,\n):\n    \"\"\"Calculates a feature matrix and features given a dictionary of dataframes\n    and a list of relationships.\n\n\n    Args:\n        dataframes (dict[str -> tuple(DataFrame, str, str, dict[str -> str/Woodwork.LogicalType], dict[str->str/set], boolean)]):\n            Dictionary of DataFrames. Entries take the format\n            {dataframe name -> (dataframe, index column, time_index, logical_types, semantic_tags, make_index)}.\n            Note that only the dataframe is required. If a Woodwork DataFrame is supplied, any other parameters\n            will be ignored.\n\n        relationships (list[(str, str, str, str)]): List of relationships\n            between dataframes. List items are a tuple with the format\n            (parent dataframe name, parent column, child dataframe name, child column).\n\n        entityset (EntitySet): An already initialized entityset. Required if\n            dataframes and relationships are not defined.\n\n        target_dataframe_name (str): Name of dataframe on which to make predictions.\n\n        cutoff_time (pd.DataFrame or Datetime or str): Specifies times at which to calculate\n            the features for each instance. The resulting feature matrix will use data\n            up to and including the cutoff_time. Can either be a DataFrame, a single\n            value, or a string that can be parsed into a datetime. If a DataFrame is passed\n            the instance ids for which to calculate features must be in a column with the\n            same name as the target dataframe index or a column named `instance_id`.\n            The cutoff time values in the DataFrame must be in a column with the same name as\n            the target dataframe time index or a column named `time`. If the DataFrame has more\n            than two columns, any additional columns will be added to the resulting feature\n            matrix. If a single value is passed, this value will be used for all instances.\n\n        instance_ids (list): List of instances on which to calculate features. Only\n            used if cutoff_time is a single datetime.\n\n        agg_primitives (list[str or AggregationPrimitive], optional): List of Aggregation\n            Feature types to apply.\n\n                Default: [\"sum\", \"std\", \"max\", \"skew\", \"min\", \"mean\", \"count\", \"percent_true\", \"num_unique\", \"mode\"]\n\n        trans_primitives (list[str or TransformPrimitive], optional):\n            List of Transform Feature functions to apply.\n\n                Default: [\"day\", \"year\", \"month\", \"weekday\", \"haversine\", \"num_words\", \"num_characters\"]\n\n        groupby_trans_primitives (list[str or TransformPrimitive], optional):\n            list of Transform primitives to make GroupByTransformFeatures with\n\n        allowed_paths (list[list[str]]): Allowed dataframe paths on which to make\n            features.\n\n        max_depth (int) : Maximum allowed depth of features.\n\n        ignore_dataframes (list[str], optional): List of dataframes to\n            blacklist when creating features.\n\n        ignore_columns (dict[str -> list[str]], optional): List of specific\n            columns within each dataframe to blacklist when creating features.\n\n        primitive_options (list[dict[str or tuple[str] -> dict] or dict[str or tuple[str] -> dict, optional]):\n            Specify options for a single primitive or a group of primitives.\n            Lists of option dicts are used to specify options per input for primitives\n            with multiple inputs. Each option ``dict`` can have the following keys:\n\n            ``\"include_dataframes\"``\n                List of dataframes to be included when creating features for\n                the primitive(s). All other dataframes will be ignored\n                (list[str]).\n            ``\"ignore_dataframes\"``\n                List of dataframes to be blacklisted when creating features\n                for the primitive(s) (list[str]).\n            ``\"include_columns\"``\n                List of specific columns within each dataframe to include when\n                creating features for the primitive(s). All other columns\n                in a given dataframe will be ignored (dict[str -> list[str]]).\n            ``\"ignore_columns\"``\n                List of specific columns within each dataframe to blacklist\n                when creating features for the primitive(s) (dict[str ->\n                list[str]]).\n            ``\"include_groupby_dataframes\"``\n                List of dataframes to be included when finding groupbys. All\n                other dataframes will be ignored (list[str]).\n            ``\"ignore_groupby_dataframes\"``\n                List of dataframes to blacklist when finding groupbys\n                (list[str]).\n            ``\"include_groupby_columns\"``\n                List of specific columns within each dataframe to include as\n                groupbys, if applicable. All other columns in each\n                dataframe will be ignored (dict[str -> list[str]]).\n            ``\"ignore_groupby_columns\"``\n                List of specific columns within each dataframe to blacklist\n                as groupbys (dict[str -> list[str]]).\n\n        seed_features (list[:class:`.FeatureBase`]): List of manually defined\n            features to use.\n\n        drop_contains (list[str], optional): Drop features\n            that contains these strings in name.\n\n        drop_exact (list[str], optional): Drop features that\n            exactly match these strings in name.\n\n        where_primitives (list[str or PrimitiveBase], optional):\n            List of Primitives names (or types) to apply with where clauses.\n\n                Default:\n\n                    [\"count\"]\n\n        max_features (int, optional) : Cap the number of generated features to\n                this number. If -1, no limit.\n\n        features_only (bool, optional): If True, returns the list of\n            features without calculating the feature matrix.\n\n        cutoff_time_in_index (bool): If True, return a DataFrame with a MultiIndex\n            where the second index is the cutoff time (first is instance id).\n            DataFrame will be sorted by (time, instance_id).\n\n        training_window (Timedelta or str, optional):\n            Window defining how much time before the cutoff time data\n            can be used when calculating features. If ``None`` , all data\n            before cutoff time is used. Defaults to ``None``. Month and year\n            units are not relative when Pandas Timedeltas are used. Relative\n            units should be passed as a Featuretools Timedelta or a string.\n\n        approximate (Timedelta): Bucket size to group instances with similar\n            cutoff times by for features with costly calculations. For example,\n            if bucket is 24 hours, all instances with cutoff times on the same\n            day will use the same calculation for expensive features.\n\n        save_progress (str, optional): Path to save intermediate computational results.\n\n        n_jobs (int, optional): number of parallel processes to use when\n            calculating feature matrix\n\n        chunk_size (int or float or None or \"cutoff time\", optional): Number\n            of rows of output feature matrix to calculate at time. If passed an\n            integer greater than 0, will try to use that many rows per chunk.\n            If passed a float value between 0 and 1 sets the chunk size to that\n            percentage of all instances. If passed the string \"cutoff time\",\n            rows are split per cutoff time.\n\n        dask_kwargs (dict, optional): Dictionary of keyword arguments to be\n            passed when creating the dask client and scheduler. Even if n_jobs\n            is not set, using `dask_kwargs` will enable multiprocessing.\n            Main parameters:\n\n            cluster (str or dask.distributed.LocalCluster):\n                cluster or address of cluster to send tasks to. If unspecified,\n                a cluster will be created.\n            diagnostics port (int):\n                port number to use for web dashboard.  If left unspecified, web\n                interface will not be enabled.\n\n            Valid keyword arguments for LocalCluster will also be accepted.\n\n        return_types (list[woodwork.ColumnSchema] or str, optional):\n            List of ColumnSchemas defining the types of\n            columns to return. If None, defaults to returning all\n            numeric, categorical and boolean types. If given as\n            the string 'all', returns all available types.\n\n        progress_callback (callable): function to be called with incremental progress updates.\n            Has the following parameters:\n\n                update: percentage change (float between 0 and 100) in progress since last call\n                progress_percent: percentage (float between 0 and 100) of total computation completed\n                time_elapsed: total time in seconds that has elapsed since start of call\n\n        include_cutoff_time (bool): Include data at cutoff times in feature calculations. Defaults to ``True``.\n\n    Returns:\n        list[:class:`.FeatureBase`], pd.DataFrame:\n            The list of generated feature defintions, and the feature matrix.\n            If ``features_only`` is ``True``, the feature matrix will not be generated.\n\n    Examples:\n        .. code-block:: python\n\n            from featuretools.primitives import Mean\n            # cutoff times per instance\n            dataframes = {\n                \"sessions\" : (session_df, \"id\"),\n                \"transactions\" : (transactions_df, \"id\", \"transaction_time\")\n            }\n            relationships = [(\"sessions\", \"id\", \"transactions\", \"session_id\")]\n            feature_matrix, features = dfs(dataframes=dataframes,\n                                           relationships=relationships,\n                                           target_dataframe_name=\"transactions\",\n                                           cutoff_time=cutoff_times)\n            feature_matrix\n\n            features = dfs(dataframes=dataframes,\n                           relationships=relationships,\n                           target_dataframe_name=\"transactions\",\n                           features_only=True)\n    \"\"\"\n    if not isinstance(entityset, EntitySet):\n        entityset = EntitySet(\"dfs\", dataframes, relationships)\n\n    dfs_object = DeepFeatureSynthesis(\n        target_dataframe_name,\n        entityset,\n        agg_primitives=agg_primitives,\n        trans_primitives=trans_primitives,\n        groupby_trans_primitives=groupby_trans_primitives,\n        max_depth=max_depth,\n        where_primitives=where_primitives,\n        allowed_paths=allowed_paths,\n        drop_exact=drop_exact,\n        drop_contains=drop_contains,\n        ignore_dataframes=ignore_dataframes,\n        ignore_columns=ignore_columns,\n        primitive_options=primitive_options,\n        max_features=max_features,\n        seed_features=seed_features,\n    )\n\n    features = dfs_object.build_features(verbose=verbose, return_types=return_types)\n\n    trans, agg, groupby, where = _categorize_features(features)\n\n    trans_unused = get_unused_primitives(trans_primitives, trans)\n    agg_unused = get_unused_primitives(agg_primitives, agg)\n    groupby_unused = get_unused_primitives(groupby_trans_primitives, groupby)\n    where_unused = get_unused_primitives(where_primitives, where)\n\n    unused_primitives = [trans_unused, agg_unused, groupby_unused, where_unused]\n    if any(unused_primitives):\n        warn_unused_primitives(unused_primitives)\n\n    if features_only:\n        return features\n\n    assert (\n        features != []\n    ), \"No features can be generated from the specified primitives. Please make sure the primitives you are using are compatible with the variable types in your data.\"\n\n    feature_matrix = calculate_feature_matrix(\n        features,\n        entityset=entityset,\n        cutoff_time=cutoff_time,\n        instance_ids=instance_ids,\n        training_window=training_window,\n        approximate=approximate,\n        cutoff_time_in_index=cutoff_time_in_index,\n        save_progress=save_progress,\n        chunk_size=chunk_size,\n        n_jobs=n_jobs,\n        dask_kwargs=dask_kwargs,\n        verbose=verbose,\n        progress_callback=progress_callback,\n        include_cutoff_time=include_cutoff_time,\n    )\n    return feature_matrix, features\n\n\ndef warn_unused_primitives(unused_primitives):\n    messages = [\n        \"  trans_primitives: {}\\n\",\n        \"  agg_primitives: {}\\n\",\n        \"  groupby_trans_primitives: {}\\n\",\n        \"  where_primitives: {}\\n\",\n    ]\n    unused_string = \"\"\n    for primitives, message in zip(unused_primitives, messages):\n        if primitives:\n            unused_string += message.format(primitives)\n\n    warning_msg = (\n        \"Some specified primitives were not used during DFS:\\n{}\".format(unused_string)\n        + \"This may be caused by a using a value of max_depth that is too small, not setting interesting values, \"\n        + \"or it may indicate no compatible columns for the primitive were found in the data. If the DFS call \"\n        + \"contained multiple instances of a primitive in the list above, none of them were used.\"\n    )\n\n    warnings.warn(warning_msg, UnusedPrimitiveWarning)\n"
  },
  {
    "path": "featuretools/synthesis/encode_features.py",
    "content": "import logging\n\nimport pandas as pd\n\nfrom featuretools.computational_backends.utils import get_ww_types_from_features\nfrom featuretools.utils.gen_utils import make_tqdm_iterator\n\nlogger = logging.getLogger(\"featuretools\")\n\nDEFAULT_TOP_N = 10\n\n\ndef encode_features(\n    feature_matrix,\n    features,\n    top_n=DEFAULT_TOP_N,\n    include_unknown=True,\n    to_encode=None,\n    inplace=False,\n    drop_first=False,\n    verbose=False,\n):\n    \"\"\"Encode categorical features\n\n    Args:\n        feature_matrix (pd.DataFrame): Dataframe of features.\n        features (list[PrimitiveBase]): Feature definitions in feature_matrix.\n        top_n (int or dict[string -> int]): Number of top values to include.\n            If dict[string -> int] is used, key is feature name and value is\n            the number of top values to include for that feature.\n            If a feature's name is not in dictionary, a default value of 10 is used.\n        include_unknown (pd.DataFrame): Add feature encoding an unknown class.\n            defaults to True\n        to_encode (list[str]): List of feature names to encode.\n            features not in this list are unencoded in the output matrix\n            defaults to encode all necessary features.\n        inplace (bool): Encode feature_matrix in place. Defaults to False.\n        drop_first (bool): Whether to get k-1 dummies out of k categorical\n                levels by removing the first level.\n                defaults to False\n        verbose (str): Print progress info.\n\n    Returns:\n        (pd.Dataframe, list) : encoded feature_matrix, encoded features\n\n    Example:\n        .. ipython:: python\n            :suppress:\n\n            from featuretools.tests.testing_utils import make_ecommerce_entityset\n            import featuretools as ft\n            es = make_ecommerce_entityset()\n\n        .. ipython:: python\n\n            f1 = ft.Feature(es[\"log\"].ww[\"product_id\"])\n            f2 = ft.Feature(es[\"log\"].ww[\"purchased\"])\n            f3 = ft.Feature(es[\"log\"].ww[\"value\"])\n\n            features = [f1, f2, f3]\n            ids = [0, 1, 2, 3, 4, 5]\n            feature_matrix = ft.calculate_feature_matrix(features, es,\n                                                         instance_ids=ids)\n\n            fm_encoded, f_encoded = ft.encode_features(feature_matrix,\n                                                       features)\n            f_encoded\n\n            fm_encoded, f_encoded = ft.encode_features(feature_matrix,\n                                                       features, top_n=2)\n            f_encoded\n\n            fm_encoded, f_encoded = ft.encode_features(feature_matrix, features,\n                                                       include_unknown=False)\n            f_encoded\n\n            fm_encoded, f_encoded = ft.encode_features(feature_matrix, features,\n                                                       to_encode=['purchased'])\n            f_encoded\n\n            fm_encoded, f_encoded = ft.encode_features(feature_matrix, features,\n                                                       drop_first=True)\n            f_encoded\n    \"\"\"\n    if inplace:\n        X = feature_matrix\n    else:\n        X = feature_matrix.copy()\n\n    old_feature_names = set()\n    for feature in features:\n        for fname in feature.get_feature_names():\n            assert fname in X.columns, \"Feature %s not found in feature matrix\" % (\n                fname\n            )\n            old_feature_names.add(fname)\n\n    pass_through = [col for col in X.columns if col not in old_feature_names]\n\n    if verbose:\n        iterator = make_tqdm_iterator(\n            iterable=features,\n            total=len(features),\n            desc=\"Encoding pass 1\",\n            unit=\"feature\",\n        )\n    else:\n        iterator = features\n\n    new_feature_list = []\n    kept_columns = []\n    encoded_columns = []\n    columns_info = feature_matrix.ww.columns\n\n    for f in iterator:\n        # TODO: features with multiple columns are not encoded by this method,\n        # which can cause an \"encoded\" matrix with non-numeric values\n        is_discrete = {\"category\", \"foreign_key\"}.intersection(\n            f.column_schema.semantic_tags,\n        )\n        if f.number_output_features > 1 or not is_discrete:\n            if f.number_output_features > 1:\n                logger.warning(\n                    \"Feature %s has multiple columns and will not \"\n                    \"be encoded.  This may result in a matrix with\"\n                    \" non-numeric values.\" % (f),\n                )\n            new_feature_list.append(f)\n            kept_columns.extend(f.get_feature_names())\n            continue\n\n        if to_encode is not None and f.get_name() not in to_encode:\n            new_feature_list.append(f)\n            kept_columns.extend(f.get_feature_names())\n            continue\n\n        val_counts = X[f.get_name()].value_counts()\n        # Remove 0 count category values\n        val_counts = val_counts[val_counts > 0].to_frame()\n        index_name = val_counts.index.name\n        val_counts = val_counts.rename(columns={val_counts.columns[0]: \"count\"})\n        if index_name is None:\n            if \"index\" in val_counts.columns:\n                index_name = \"level_0\"\n            else:\n                index_name = \"index\"\n        val_counts.reset_index(inplace=True)\n        val_counts = val_counts.sort_values([\"count\", index_name], ascending=False)\n        val_counts.set_index(index_name, inplace=True)\n        select_n = top_n\n        if isinstance(top_n, dict):\n            select_n = top_n.get(f.get_name(), DEFAULT_TOP_N)\n        if drop_first:\n            select_n = min(len(val_counts), top_n)\n            select_n = max(select_n - 1, 1)\n        unique = val_counts.head(select_n).index.tolist()\n        for label in unique:\n            add = f == label\n            add_name = add.get_name()\n            new_feature_list.append(add)\n            new_col = X[f.get_name()] == label\n            new_col.rename(add_name, inplace=True)\n            encoded_columns.append(new_col)\n\n        if include_unknown:\n            unknown = f.isin(unique).NOT().rename(f.get_name() + \" is unknown\")\n            unknown_name = unknown.get_name()\n            new_feature_list.append(unknown)\n            new_col = ~X[f.get_name()].isin(unique)\n            new_col.rename(unknown_name, inplace=True)\n            encoded_columns.append(new_col)\n\n        if inplace:\n            X.drop(f.get_name(), axis=1, inplace=True)\n\n    kept_columns.extend(pass_through)\n\n    if inplace:\n        for encoded_column in encoded_columns:\n            X[encoded_column.name] = encoded_column\n    else:\n        X = pd.concat([X[kept_columns]] + encoded_columns, axis=1)\n\n    entityset = new_feature_list[0].entityset\n    ww_init_kwargs = get_ww_types_from_features(new_feature_list, entityset)\n\n    # Grab ww metadata from feature matrix since it may be more exact\n    for column in kept_columns:\n        ww_init_kwargs[\"logical_types\"][column] = columns_info[column].logical_type\n        ww_init_kwargs[\"semantic_tags\"][column] = columns_info[column].semantic_tags\n        ww_init_kwargs[\"column_origins\"][column] = columns_info[column].origin\n\n    X.ww.init(**ww_init_kwargs)\n    return X, new_feature_list\n"
  },
  {
    "path": "featuretools/synthesis/get_valid_primitives.py",
    "content": "from featuretools.primitives import AggregationPrimitive, TransformPrimitive\nfrom featuretools.primitives.utils import (\n    get_aggregation_primitives,\n    get_transform_primitives,\n)\nfrom featuretools.synthesis.deep_feature_synthesis import DeepFeatureSynthesis\nfrom featuretools.synthesis.utils import _categorize_features, get_unused_primitives\n\n\ndef get_valid_primitives(\n    entityset,\n    target_dataframe_name,\n    max_depth=2,\n    selected_primitives=None,\n    **dfs_kwargs,\n):\n    \"\"\"\n    Returns two lists of primitives (transform and aggregation) containing\n    primitives that can be applied to the specific target dataframe to create\n    features.  If the optional 'selected_primitives' parameter is not used,\n    all discoverable primitives will be considered.\n\n    Note:\n        When using a ``max_depth`` greater than 1, some primitives returned by\n        this function may not create any features if passed to DFS alone.  These\n        primitives relied on features created by other primitives as input\n        (primitive stacking).\n\n    Args:\n        entityset (EntitySet): An already initialized entityset\n        target_dataframe_name (str): Name of dataframe to create features for.\n        max_depth (int, optional): Maximum allowed depth of features.\n        selected_primitives(list[str or AggregationPrimitive/TransformPrimitive], optional):\n            list of primitives to consider when looking for valid primitives.\n            If None, all primitives will be considered\n        dfs_kwargs (keywords): Additional keyword arguments to pass as keyword arguments to\n            the DeepFeatureSynthesis object. Should not include ``max_depth``, ``agg_primitives``,\n            or ``trans_primitives``, as those are passed in explicity.\n    Returns:\n       list[AggregationPrimitive], list[TransformPrimitive]:\n           The list of valid aggregation primitives and the list of valid\n           transform primitives.\n    \"\"\"\n    agg_primitives = []\n    trans_primitives = []\n    available_aggs = get_aggregation_primitives()\n    available_trans = get_transform_primitives()\n\n    if selected_primitives:\n        for prim in selected_primitives:\n            if not isinstance(prim, str):\n                if issubclass(prim, AggregationPrimitive):\n                    prim_list = agg_primitives\n                elif issubclass(prim, TransformPrimitive):\n                    prim_list = trans_primitives\n                else:\n                    raise ValueError(\n                        f\"Selected primitive {prim} is not an \"\n                        \"AggregationPrimitive, TransformPrimitive, or str\",\n                    )\n            elif prim in available_aggs:\n                prim = available_aggs[prim]\n                prim_list = agg_primitives\n            elif prim in available_trans:\n                prim = available_trans[prim]\n                prim_list = trans_primitives\n            else:\n                raise ValueError(f\"'{prim}' is not a recognized primitive name\")\n            prim_list.append(prim)\n    else:\n        agg_primitives = [agg for agg in available_aggs.values()]\n        trans_primitives = [trans for trans in available_trans.values()]\n\n    dfs_object = DeepFeatureSynthesis(\n        target_dataframe_name,\n        entityset,\n        agg_primitives=agg_primitives,\n        trans_primitives=trans_primitives,\n        max_depth=max_depth,\n        **dfs_kwargs,\n    )\n\n    features = dfs_object.build_features()\n\n    trans, agg, _, _ = _categorize_features(features)\n\n    trans_unused = get_unused_primitives(trans_primitives, trans)\n    agg_unused = get_unused_primitives(agg_primitives, agg)\n\n    # switch from str to class\n    agg_unused = [available_aggs[name] for name in agg_unused]\n    trans_unused = [available_trans[name] for name in trans_unused]\n\n    used_agg_prims = set(agg_primitives).difference(set(agg_unused))\n    used_trans_prims = set(trans_primitives).difference(set(trans_unused))\n    return list(used_agg_prims), list(used_trans_prims)\n"
  },
  {
    "path": "featuretools/synthesis/utils.py",
    "content": "from featuretools.feature_base import (\n    AggregationFeature,\n    FeatureOutputSlice,\n    GroupByTransformFeature,\n    TransformFeature,\n)\nfrom featuretools.utils.gen_utils import camel_and_title_to_snake\n\n\ndef _categorize_features(features):\n    \"\"\"Categorize each feature by its primitive type in a set of primitives along with any dependencies\"\"\"\n    transform = set()\n    agg = set()\n    groupby = set()\n    where = set()\n    explored = set()\n\n    def get_feature_data(feature):\n        if feature.get_name() in explored:\n            return\n\n        dependencies = []\n\n        if isinstance(feature, FeatureOutputSlice):\n            feature = feature.base_feature\n\n        if isinstance(feature, AggregationFeature):\n            if feature.where:\n                where.add(feature.primitive.name)\n            else:\n                agg.add(feature.primitive.name)\n        elif isinstance(feature, GroupByTransformFeature):\n            groupby.add(feature.primitive.name)\n        elif isinstance(feature, TransformFeature):\n            transform.add(feature.primitive.name)\n\n        feature_deps = feature.get_dependencies()\n        if feature_deps:\n            dependencies.extend(feature_deps)\n\n        explored.add(feature.get_name())\n\n        for dep in dependencies:\n            get_feature_data(dep)\n\n    for feature in features:\n        get_feature_data(feature)\n\n    return transform, agg, groupby, where\n\n\ndef get_unused_primitives(specified, used):\n    \"\"\"Get a list of unused primitives based on a list of specified primitives and a list of output features\"\"\"\n    if not specified:\n        return []\n    specified = {\n        camel_and_title_to_snake(primitive)\n        if isinstance(primitive, str)\n        else primitive.name\n        for primitive in specified\n    }\n    return sorted(specified.difference(used))\n"
  },
  {
    "path": "featuretools/tests/__init__.py",
    "content": ""
  },
  {
    "path": "featuretools/tests/computational_backend/__init__.py",
    "content": ""
  },
  {
    "path": "featuretools/tests/computational_backend/test_calculate_feature_matrix.py",
    "content": "import logging\nimport os\nimport re\nimport shutil\nfrom datetime import datetime\nfrom itertools import combinations\nfrom random import randint\n\nimport numpy as np\nimport pandas as pd\nimport psutil\nimport pytest\nfrom tqdm import tqdm\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import (\n    Age,\n    AgeNullable,\n    Boolean,\n    BooleanNullable,\n    Integer,\n    IntegerNullable,\n)\n\nfrom featuretools import (\n    EntitySet,\n    Feature,\n    GroupByTransformFeature,\n    Timedelta,\n    calculate_feature_matrix,\n    dfs,\n)\nfrom featuretools.computational_backends import utils\nfrom featuretools.computational_backends.calculate_feature_matrix import (\n    FEATURE_CALCULATION_PERCENTAGE,\n    _chunk_dataframe_groups,\n    _handle_chunk_size,\n    scatter_warning,\n)\nfrom featuretools.computational_backends.utils import (\n    bin_cutoff_times,\n    create_client_and_cluster,\n    n_jobs_to_workers,\n)\nfrom featuretools.feature_base import (\n    AggregationFeature,\n    DirectFeature,\n    FeatureOutputSlice,\n    IdentityFeature,\n)\nfrom featuretools.primitives import (\n    Count,\n    Max,\n    Min,\n    Negate,\n    NMostCommon,\n    Percentile,\n    Sum,\n    TransformPrimitive,\n)\nfrom featuretools.tests.testing_utils import (\n    backward_path,\n    get_mock_client_cluster,\n)\n\n\ndef test_scatter_warning(caplog):\n    logger = logging.getLogger(\"featuretools\")\n    match = \"EntitySet was only scattered to {} out of {} workers\"\n    warning_message = match.format(1, 2)\n    logger.propagate = True\n    scatter_warning(1, 2)\n    logger.propagate = False\n    assert warning_message in caplog.text\n\n\ndef test_calc_feature_matrix(es):\n    times = list(\n        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]\n        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)]\n        + [datetime(2011, 4, 9, 10, 40, 0)]\n        + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)]\n        + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)]\n        + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)],\n    )\n    instances = range(17)\n    cutoff_time = pd.DataFrame({\"time\": times, es[\"log\"].ww.index: instances})\n    labels = [False] * 3 + [True] * 2 + [False] * 9 + [True] + [False] * 2\n\n    property_feature = Feature(es[\"log\"].ww[\"value\"]) > 10\n\n    feature_matrix = calculate_feature_matrix(\n        [property_feature],\n        es,\n        cutoff_time=cutoff_time,\n        verbose=True,\n    )\n\n    assert (feature_matrix[property_feature.get_name()] == labels).values.all()\n\n    error_text = \"features must be a non-empty list of features\"\n    with pytest.raises(AssertionError, match=error_text):\n        feature_matrix = calculate_feature_matrix(\n            \"features\",\n            es,\n            cutoff_time=cutoff_time,\n        )\n\n    with pytest.raises(AssertionError, match=error_text):\n        feature_matrix = calculate_feature_matrix([], es, cutoff_time=cutoff_time)\n\n    with pytest.raises(AssertionError, match=error_text):\n        feature_matrix = calculate_feature_matrix(\n            [1, 2, 3],\n            es,\n            cutoff_time=cutoff_time,\n        )\n\n    error_text = (\n        \"cutoff_time times must be datetime type: try casting via \"\n        \"pd\\\\.to_datetime\\\\(\\\\)\"\n    )\n    with pytest.raises(TypeError, match=error_text):\n        calculate_feature_matrix(\n            [property_feature],\n            es,\n            instance_ids=range(17),\n            cutoff_time=17,\n        )\n\n    error_text = \"cutoff_time must be a single value or DataFrame\"\n    with pytest.raises(TypeError, match=error_text):\n        calculate_feature_matrix(\n            [property_feature],\n            es,\n            instance_ids=range(17),\n            cutoff_time=times,\n        )\n\n    cutoff_times_dup = pd.DataFrame(\n        {\n            \"time\": [datetime(2018, 3, 1), datetime(2018, 3, 1)],\n            es[\"log\"].ww.index: [1, 1],\n        },\n    )\n\n    error_text = \"Duplicated rows in cutoff time dataframe.\"\n    with pytest.raises(AssertionError, match=error_text):\n        feature_matrix = calculate_feature_matrix(\n            [property_feature],\n            entityset=es,\n            cutoff_time=cutoff_times_dup,\n        )\n\n    cutoff_reordered = cutoff_time.iloc[[-1, 10, 1]]  # 3 ids not ordered by cutoff time\n    feature_matrix = calculate_feature_matrix(\n        [property_feature],\n        es,\n        cutoff_time=cutoff_reordered,\n        verbose=True,\n    )\n\n    assert all(feature_matrix.index == cutoff_reordered[\"id\"].values)\n\n\ndef test_cfm_compose(es, lt):\n    property_feature = Feature(es[\"log\"].ww[\"value\"]) > 10\n\n    feature_matrix = calculate_feature_matrix(\n        [property_feature],\n        es,\n        cutoff_time=lt,\n        verbose=True,\n    )\n\n    assert (\n        feature_matrix[property_feature.get_name()] == feature_matrix[\"label_func\"]\n    ).values.all()\n\n\ndef test_cfm_compose_approximate(es, lt):\n    property_feature = Feature(es[\"log\"].ww[\"value\"]) > 10\n\n    feature_matrix = calculate_feature_matrix(\n        [property_feature],\n        es,\n        cutoff_time=lt,\n        approximate=\"1s\",\n        verbose=True,\n    )\n    assert type(feature_matrix) == pd.core.frame.DataFrame\n\n    assert (\n        feature_matrix[property_feature.get_name()] == feature_matrix[\"label_func\"]\n    ).values.all()\n\n\ndef test_cfm_approximate_correct_ordering():\n    trips = {\n        \"trip_id\": [i for i in range(1000)],\n        \"flight_time\": [datetime(1998, 4, 2) for i in range(350)]\n        + [datetime(1997, 4, 3) for i in range(650)],\n        \"flight_id\": [randint(1, 25) for i in range(1000)],\n        \"trip_duration\": [randint(1, 999) for i in range(1000)],\n    }\n    df = pd.DataFrame.from_dict(trips)\n    es = EntitySet(\"flights\")\n    es.add_dataframe(\n        dataframe_name=\"trips\",\n        dataframe=df,\n        index=\"trip_id\",\n        time_index=\"flight_time\",\n    )\n    es.normalize_dataframe(\n        base_dataframe_name=\"trips\",\n        new_dataframe_name=\"flights\",\n        index=\"flight_id\",\n        make_time_index=True,\n    )\n    features = dfs(entityset=es, target_dataframe_name=\"trips\", features_only=True)\n    flight_features = [\n        feature\n        for feature in features\n        if isinstance(feature, DirectFeature)\n        and isinstance(feature.base_features[0], AggregationFeature)\n    ]\n    property_feature = IdentityFeature(es[\"trips\"].ww[\"trip_id\"])\n\n    cutoff_time = pd.DataFrame.from_dict(\n        {\"instance_id\": df[\"trip_id\"], \"time\": df[\"flight_time\"]},\n    )\n    time_feature = IdentityFeature(es[\"trips\"].ww[\"flight_time\"])\n    feature_matrix = calculate_feature_matrix(\n        flight_features + [property_feature, time_feature],\n        es,\n        cutoff_time_in_index=True,\n        cutoff_time=cutoff_time,\n    )\n    feature_matrix.index.names = [\"instance\", \"time\"]\n    assert np.all(\n        feature_matrix.reset_index(\"time\").reset_index()[[\"instance\", \"time\"]].values\n        == feature_matrix[[\"trip_id\", \"flight_time\"]].values,\n    )\n    feature_matrix_2 = calculate_feature_matrix(\n        flight_features + [property_feature, time_feature],\n        es,\n        cutoff_time=cutoff_time,\n        cutoff_time_in_index=True,\n        approximate=Timedelta(2, \"d\"),\n    )\n    feature_matrix_2.index.names = [\"instance\", \"time\"]\n    assert np.all(\n        feature_matrix_2.reset_index(\"time\").reset_index()[[\"instance\", \"time\"]].values\n        == feature_matrix_2[[\"trip_id\", \"flight_time\"]].values,\n    )\n    for column in feature_matrix:\n        for x, y in zip(feature_matrix[column], feature_matrix_2[column]):\n            assert (pd.isnull(x) and pd.isnull(y)) or (x == y)\n\n\ndef test_cfm_no_cutoff_time_index(es):\n    agg_feat = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Count,\n    )\n    agg_feat4 = Feature(agg_feat, parent_dataframe_name=\"customers\", primitive=Sum)\n    dfeat = DirectFeature(agg_feat4, \"sessions\")\n    cutoff_time = pd.DataFrame(\n        {\n            \"time\": [datetime(2013, 4, 9, 10, 31, 19), datetime(2013, 4, 9, 11, 0, 0)],\n            \"instance_id\": [0, 2],\n        },\n    )\n    feature_matrix = calculate_feature_matrix(\n        [dfeat, agg_feat],\n        es,\n        cutoff_time_in_index=False,\n        approximate=Timedelta(12, \"s\"),\n        cutoff_time=cutoff_time,\n    )\n    assert feature_matrix.index.name == \"id\"\n    assert feature_matrix.index.tolist() == [0, 2]\n    assert feature_matrix[dfeat.get_name()].tolist() == [10, 10]\n    assert feature_matrix[agg_feat.get_name()].tolist() == [5, 1]\n\n    cutoff_time = pd.DataFrame(\n        {\n            \"time\": [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)],\n            \"instance_id\": [0, 2],\n        },\n    )\n    feature_matrix_2 = calculate_feature_matrix(\n        [dfeat, agg_feat],\n        es,\n        cutoff_time_in_index=False,\n        approximate=Timedelta(10, \"s\"),\n        cutoff_time=cutoff_time,\n    )\n    assert feature_matrix_2.index.name == \"id\"\n    assert feature_matrix_2.index.tolist() == [0, 2]\n    assert feature_matrix_2[dfeat.get_name()].tolist() == [7, 10]\n    assert feature_matrix_2[agg_feat.get_name()].tolist() == [5, 1]\n\n\ndef test_cfm_duplicated_index_in_cutoff_time(es):\n    times = [\n        datetime(2011, 4, 1),\n        datetime(2011, 5, 1),\n        datetime(2011, 4, 1),\n        datetime(2011, 5, 1),\n    ]\n\n    instances = [1, 1, 2, 2]\n    property_feature = Feature(es[\"log\"].ww[\"value\"]) > 10\n    cutoff_time = pd.DataFrame({\"id\": instances, \"time\": times}, index=[1, 1, 1, 1])\n\n    feature_matrix = calculate_feature_matrix(\n        [property_feature],\n        es,\n        cutoff_time=cutoff_time,\n        chunk_size=1,\n    )\n    assert feature_matrix.shape[0] == cutoff_time.shape[0]\n\n\ndef test_saveprogress(es, tmp_path):\n    times = list(\n        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]\n        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)]\n        + [datetime(2011, 4, 9, 10, 40, 0)]\n        + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)]\n        + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)]\n        + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)],\n    )\n    cutoff_time = pd.DataFrame({\"time\": times, \"instance_id\": range(17)})\n    property_feature = Feature(es[\"log\"].ww[\"value\"]) > 10\n    save_progress = str(tmp_path)\n    fm_save = calculate_feature_matrix(\n        [property_feature],\n        es,\n        cutoff_time=cutoff_time,\n        save_progress=save_progress,\n    )\n    _, _, files = next(os.walk(save_progress))\n    files = [os.path.join(save_progress, file) for file in files]\n    # there are 17 datetime files created above\n    assert len(files) == 17\n    list_df = []\n    for file_ in files:\n        df = pd.read_csv(file_, index_col=\"id\", header=0)\n        list_df.append(df)\n    merged_df = pd.concat(list_df)\n    merged_df.set_index(pd.DatetimeIndex(times), inplace=True, append=True)\n    fm_no_save = calculate_feature_matrix(\n        [property_feature],\n        es,\n        cutoff_time=cutoff_time,\n    )\n    assert np.all((merged_df.sort_index().values) == (fm_save.sort_index().values))\n    assert np.all((fm_no_save.sort_index().values) == (fm_save.sort_index().values))\n    assert np.all((fm_no_save.sort_index().values) == (merged_df.sort_index().values))\n    shutil.rmtree(save_progress)\n\n\ndef test_cutoff_time_correctly(es):\n    property_feature = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Count,\n    )\n    times = [datetime(2011, 4, 10), datetime(2011, 4, 11), datetime(2011, 4, 7)]\n    cutoff_time = pd.DataFrame({\"time\": times, \"instance_id\": [0, 1, 2]})\n    feature_matrix = calculate_feature_matrix(\n        [property_feature],\n        es,\n        cutoff_time=cutoff_time,\n    )\n    labels = [10, 5, 0]\n    assert (feature_matrix[property_feature.get_name()] == labels).values.all()\n\n\ndef test_cutoff_time_binning():\n    cutoff_time = pd.DataFrame(\n        {\n            \"time\": [\n                datetime(2011, 4, 9, 12, 31),\n                datetime(2011, 4, 10, 11),\n                datetime(2011, 4, 10, 13, 10, 1),\n            ],\n            \"instance_id\": [1, 2, 3],\n        },\n    )\n    cutoff_time.ww.init()\n    binned_cutoff_times = bin_cutoff_times(cutoff_time, Timedelta(4, \"h\"))\n    labels = [\n        datetime(2011, 4, 9, 12),\n        datetime(2011, 4, 10, 8),\n        datetime(2011, 4, 10, 12),\n    ]\n    for i in binned_cutoff_times.index:\n        assert binned_cutoff_times[\"time\"][i] == labels[i]\n\n    binned_cutoff_times = bin_cutoff_times(cutoff_time, Timedelta(25, \"h\"))\n    labels = [\n        datetime(2011, 4, 8, 22),\n        datetime(2011, 4, 9, 23),\n        datetime(2011, 4, 9, 23),\n    ]\n    for i in binned_cutoff_times.index:\n        assert binned_cutoff_times[\"time\"][i] == labels[i]\n\n    error_text = \"Unit is relative\"\n    with pytest.raises(ValueError, match=error_text):\n        binned_cutoff_times = bin_cutoff_times(cutoff_time, Timedelta(1, \"mo\"))\n\n\ndef test_cutoff_time_columns_order(es):\n    property_feature = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Count,\n    )\n    times = [datetime(2011, 4, 10), datetime(2011, 4, 11), datetime(2011, 4, 7)]\n    id_col_names = [\"instance_id\", es[\"customers\"].ww.index]\n    time_col_names = [\"time\", es[\"customers\"].ww.time_index]\n    for id_col in id_col_names:\n        for time_col in time_col_names:\n            cutoff_time = pd.DataFrame(\n                {\n                    \"dummy_col_1\": [1, 2, 3],\n                    id_col: [0, 1, 2],\n                    \"dummy_col_2\": [True, False, False],\n                    time_col: times,\n                },\n            )\n            feature_matrix = calculate_feature_matrix(\n                [property_feature],\n                es,\n                cutoff_time=cutoff_time,\n            )\n\n            labels = [10, 5, 0]\n            assert (feature_matrix[property_feature.get_name()] == labels).values.all()\n\n\ndef test_cutoff_time_df_redundant_column_names(es):\n    property_feature = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Count,\n    )\n    times = [datetime(2011, 4, 10), datetime(2011, 4, 11), datetime(2011, 4, 7)]\n\n    cutoff_time = pd.DataFrame(\n        {\n            es[\"customers\"].ww.index: [0, 1, 2],\n            \"instance_id\": [0, 1, 2],\n            \"dummy_col\": [True, False, False],\n            \"time\": times,\n        },\n    )\n    err_msg = (\n        'Cutoff time DataFrame cannot contain both a column named \"instance_id\" and a column'\n        \" with the same name as the target dataframe index\"\n    )\n    with pytest.raises(AttributeError, match=err_msg):\n        calculate_feature_matrix([property_feature], es, cutoff_time=cutoff_time)\n\n    cutoff_time = pd.DataFrame(\n        {\n            es[\"customers\"].ww.time_index: [0, 1, 2],\n            \"instance_id\": [0, 1, 2],\n            \"dummy_col\": [True, False, False],\n            \"time\": times,\n        },\n    )\n    err_msg = (\n        'Cutoff time DataFrame cannot contain both a column named \"time\" and a column'\n        \" with the same name as the target dataframe time index\"\n    )\n    with pytest.raises(AttributeError, match=err_msg):\n        calculate_feature_matrix([property_feature], es, cutoff_time=cutoff_time)\n\n\ndef test_training_window(es):\n    property_feature = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Count,\n    )\n    top_level_agg = Feature(\n        es[\"customers\"].ww[\"id\"],\n        parent_dataframe_name=\"régions\",\n        primitive=Count,\n    )\n\n    # make sure features that have a direct to a higher level agg\n    # so we have multiple \"filter eids\" in get_pandas_data_slice,\n    # and we go through the loop to pull data with a training_window param more than once\n    dagg = DirectFeature(top_level_agg, \"customers\")\n\n    # for now, warns if last_time_index not present\n    times = [\n        datetime(2011, 4, 9, 12, 31),\n        datetime(2011, 4, 10, 11),\n        datetime(2011, 4, 10, 13, 10),\n    ]\n    cutoff_time = pd.DataFrame({\"time\": times, \"instance_id\": [0, 1, 2]})\n    warn_text = (\n        \"Using training_window but last_time_index is not set for dataframe customers\"\n    )\n    with pytest.warns(UserWarning, match=warn_text):\n        feature_matrix = calculate_feature_matrix(\n            [property_feature, dagg],\n            es,\n            cutoff_time=cutoff_time,\n            training_window=\"2 hours\",\n        )\n\n    es.add_last_time_indexes()\n\n    error_text = \"Training window cannot be in observations\"\n    with pytest.raises(AssertionError, match=error_text):\n        feature_matrix = calculate_feature_matrix(\n            [property_feature],\n            es,\n            cutoff_time=cutoff_time,\n            training_window=Timedelta(2, \"observations\"),\n        )\n\n    # Case1. include_cutoff_time = True\n    feature_matrix = calculate_feature_matrix(\n        [property_feature, dagg],\n        es,\n        cutoff_time=cutoff_time,\n        training_window=\"2 hours\",\n        include_cutoff_time=True,\n    )\n    prop_values = [4, 5, 1]\n    dagg_values = [3, 2, 1]\n    assert (feature_matrix[property_feature.get_name()] == prop_values).values.all()\n    assert (feature_matrix[dagg.get_name()] == dagg_values).values.all()\n\n    # Case2. include_cutoff_time = False\n    feature_matrix = calculate_feature_matrix(\n        [property_feature, dagg],\n        es,\n        cutoff_time=cutoff_time,\n        training_window=\"2 hours\",\n        include_cutoff_time=False,\n    )\n    prop_values = [5, 5, 2]\n    dagg_values = [3, 2, 1]\n\n    assert (feature_matrix[property_feature.get_name()] == prop_values).values.all()\n    assert (feature_matrix[dagg.get_name()] == dagg_values).values.all()\n\n    # Case3. include_cutoff_time = False with single cutoff time value\n    feature_matrix = calculate_feature_matrix(\n        [property_feature, dagg],\n        es,\n        cutoff_time=pd.to_datetime(\"2011-04-09 10:40:00\"),\n        training_window=\"9 minutes\",\n        include_cutoff_time=False,\n    )\n    prop_values = [0, 4, 0]\n    dagg_values = [3, 3, 3]\n    assert (feature_matrix[property_feature.get_name()] == prop_values).values.all()\n    assert (feature_matrix[dagg.get_name()] == dagg_values).values.all()\n\n    # Case4. include_cutoff_time = True with single cutoff time value\n    feature_matrix = calculate_feature_matrix(\n        [property_feature, dagg],\n        es,\n        cutoff_time=pd.to_datetime(\"2011-04-10 10:40:00\"),\n        training_window=\"2 days\",\n        include_cutoff_time=True,\n    )\n    prop_values = [0, 10, 1]\n    dagg_values = [3, 3, 3]\n    assert (feature_matrix[property_feature.get_name()] == prop_values).values.all()\n    assert (feature_matrix[dagg.get_name()] == dagg_values).values.all()\n\n\ndef test_training_window_overlap(es):\n    es.add_last_time_indexes()\n\n    count_log = Feature(\n        Feature(es[\"log\"].ww[\"id\"]),\n        parent_dataframe_name=\"customers\",\n        primitive=Count,\n    )\n\n    cutoff_time = pd.DataFrame(\n        {\n            \"id\": [0, 0],\n            \"time\": [\"2011-04-09 10:30:00\", \"2011-04-09 10:40:00\"],\n        },\n    ).astype({\"time\": \"datetime64[ns]\"})\n\n    # Case1. include_cutoff_time = True\n    actual = calculate_feature_matrix(\n        features=[count_log],\n        entityset=es,\n        cutoff_time=cutoff_time,\n        cutoff_time_in_index=True,\n        training_window=\"10 minutes\",\n        include_cutoff_time=True,\n    )\n    actual = actual[\"COUNT(log)\"]\n    np.testing.assert_array_equal(actual.values, [1, 9])\n\n    # Case2. include_cutoff_time = False\n    actual = calculate_feature_matrix(\n        features=[count_log],\n        entityset=es,\n        cutoff_time=cutoff_time,\n        cutoff_time_in_index=True,\n        training_window=\"10 minutes\",\n        include_cutoff_time=False,\n    )\n    actual = actual[\"COUNT(log)\"]\n    np.testing.assert_array_equal(actual.values, [0, 9])\n\n\ndef test_include_cutoff_time_without_training_window(es):\n    es.add_last_time_indexes()\n\n    count_log = Feature(\n        base=Feature(es[\"log\"].ww[\"id\"]),\n        parent_dataframe_name=\"customers\",\n        primitive=Count,\n    )\n\n    cutoff_time = pd.DataFrame(\n        {\n            \"id\": [0, 0],\n            \"time\": [\"2011-04-09 10:30:00\", \"2011-04-09 10:31:00\"],\n        },\n    ).astype({\"time\": \"datetime64[ns]\"})\n\n    # Case1. include_cutoff_time = True\n    actual = calculate_feature_matrix(\n        features=[count_log],\n        entityset=es,\n        cutoff_time=cutoff_time,\n        cutoff_time_in_index=True,\n        include_cutoff_time=True,\n    )\n    actual = actual[\"COUNT(log)\"]\n    np.testing.assert_array_equal(actual.values, [1, 6])\n\n    # Case2. include_cutoff_time = False\n    actual = calculate_feature_matrix(\n        features=[count_log],\n        entityset=es,\n        cutoff_time=cutoff_time,\n        cutoff_time_in_index=True,\n        include_cutoff_time=False,\n    )\n    actual = actual[\"COUNT(log)\"]\n    np.testing.assert_array_equal(actual.values, [0, 5])\n\n    # Case3. include_cutoff_time = True with single cutoff time value\n    actual = calculate_feature_matrix(\n        features=[count_log],\n        entityset=es,\n        cutoff_time=pd.to_datetime(\"2011-04-09 10:31:00\"),\n        instance_ids=[0],\n        cutoff_time_in_index=True,\n        include_cutoff_time=True,\n    )\n    actual = actual[\"COUNT(log)\"]\n    np.testing.assert_array_equal(actual.values, [6])\n\n    # Case4. include_cutoff_time = False with single cutoff time value\n    actual = calculate_feature_matrix(\n        features=[count_log],\n        entityset=es,\n        cutoff_time=pd.to_datetime(\"2011-04-09 10:31:00\"),\n        instance_ids=[0],\n        cutoff_time_in_index=True,\n        include_cutoff_time=False,\n    )\n    actual = actual[\"COUNT(log)\"]\n    np.testing.assert_array_equal(actual.values, [5])\n\n\ndef test_approximate_dfeat_of_agg_on_target_include_cutoff_time(es):\n    agg_feat = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Count,\n    )\n    agg_feat2 = Feature(agg_feat, parent_dataframe_name=\"customers\", primitive=Sum)\n    dfeat = DirectFeature(agg_feat2, \"sessions\")\n\n    cutoff_time = pd.DataFrame(\n        {\"time\": [datetime(2011, 4, 9, 10, 31, 19)], \"instance_id\": [0]},\n    )\n    feature_matrix = calculate_feature_matrix(\n        [dfeat, agg_feat2, agg_feat],\n        es,\n        approximate=Timedelta(20, \"s\"),\n        cutoff_time=cutoff_time,\n        include_cutoff_time=False,\n    )\n\n    # binned cutoff_time will be datetime(2011, 4, 9, 10, 31, 0) and\n    # log event 5 at datetime(2011, 4, 9, 10, 31, 0) will be\n    # excluded due to approximate cutoff time point\n    assert feature_matrix[dfeat.get_name()].tolist() == [5]\n    assert feature_matrix[agg_feat.get_name()].tolist() == [5]\n\n    feature_matrix = calculate_feature_matrix(\n        [dfeat, agg_feat],\n        es,\n        approximate=Timedelta(20, \"s\"),\n        cutoff_time=cutoff_time,\n        include_cutoff_time=True,\n    )\n\n    # binned cutoff_time will be datetime(2011, 4, 9, 10, 31, 0) and\n    # log event 5 at datetime(2011, 4, 9, 10, 31, 0) will be\n    # included due to approximate cutoff time point\n    assert feature_matrix[dfeat.get_name()].tolist() == [6]\n    assert feature_matrix[agg_feat.get_name()].tolist() == [5]\n\n\ndef test_training_window_recent_time_index(es):\n    # customer with no sessions\n    row = {\n        \"id\": [3],\n        \"age\": [73],\n        \"région_id\": [\"United States\"],\n        \"cohort\": [1],\n        \"cancel_reason\": [\"Lost interest\"],\n        \"loves_ice_cream\": [True],\n        \"favorite_quote\": [\"Don't look back. Something might be gaining on you.\"],\n        \"signup_date\": [datetime(2011, 4, 10)],\n        \"upgrade_date\": [datetime(2011, 4, 12)],\n        \"cancel_date\": [datetime(2011, 5, 13)],\n        \"birthday\": [datetime(1938, 2, 1)],\n        \"engagement_level\": [2],\n    }\n    to_add_df = pd.DataFrame(row)\n    to_add_df.index = range(3, 4)\n\n    # have to convert category to int in order to concat\n    old_df = es[\"customers\"]\n    old_df.index = old_df.index.astype(\"int\")\n    old_df[\"id\"] = old_df[\"id\"].astype(int)\n\n    df = pd.concat([old_df, to_add_df], sort=True)\n\n    # convert back after\n    df.index = df.index.astype(\"category\")\n    df[\"id\"] = df[\"id\"].astype(\"category\")\n\n    es.replace_dataframe(\n        dataframe_name=\"customers\",\n        df=df,\n        recalculate_last_time_indexes=False,\n    )\n    es.add_last_time_indexes()\n\n    property_feature = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Count,\n    )\n    top_level_agg = Feature(\n        es[\"customers\"].ww[\"id\"],\n        parent_dataframe_name=\"régions\",\n        primitive=Count,\n    )\n    dagg = DirectFeature(top_level_agg, \"customers\")\n    instance_ids = [0, 1, 2, 3]\n    times = [\n        datetime(2011, 4, 9, 12, 31),\n        datetime(2011, 4, 10, 11),\n        datetime(2011, 4, 10, 13, 10, 1),\n        datetime(2011, 4, 10, 1, 59, 59),\n    ]\n    cutoff_time = pd.DataFrame({\"time\": times, \"instance_id\": instance_ids})\n\n    # Case1. include_cutoff_time = True\n    feature_matrix = calculate_feature_matrix(\n        [property_feature, dagg],\n        es,\n        cutoff_time=cutoff_time,\n        training_window=\"2 hours\",\n        include_cutoff_time=True,\n    )\n    prop_values = [4, 5, 1, 0]\n    assert (feature_matrix[property_feature.get_name()] == prop_values).values.all()\n\n    dagg_values = [3, 2, 1, 3]\n    feature_matrix.sort_index(inplace=True)\n    assert (feature_matrix[dagg.get_name()] == dagg_values).values.all()\n\n    # Case2. include_cutoff_time = False\n    feature_matrix = calculate_feature_matrix(\n        [property_feature, dagg],\n        es,\n        cutoff_time=cutoff_time,\n        training_window=\"2 hours\",\n        include_cutoff_time=False,\n    )\n    prop_values = [5, 5, 1, 0]\n    assert (feature_matrix[property_feature.get_name()] == prop_values).values.all()\n\n    dagg_values = [3, 2, 1, 3]\n    feature_matrix.sort_index(inplace=True)\n    assert (feature_matrix[dagg.get_name()] == dagg_values).values.all()\n\n\ndef test_approximate_multiple_instances_per_cutoff_time(es):\n    agg_feat = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Count,\n    )\n    agg_feat2 = Feature(agg_feat, parent_dataframe_name=\"customers\", primitive=Sum)\n    dfeat = DirectFeature(agg_feat2, \"sessions\")\n    times = [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)]\n    cutoff_time = pd.DataFrame({\"time\": times, \"instance_id\": [0, 2]})\n    feature_matrix = calculate_feature_matrix(\n        [dfeat, agg_feat],\n        es,\n        approximate=Timedelta(1, \"week\"),\n        cutoff_time=cutoff_time,\n    )\n    assert feature_matrix.shape[0] == 2\n    assert feature_matrix[agg_feat.get_name()].tolist() == [5, 1]\n\n\ndef test_approximate_with_multiple_paths(diamond_es):\n    es = diamond_es\n    path = backward_path(es, [\"regions\", \"customers\", \"transactions\"])\n    agg_feat = AggregationFeature(\n        Feature(es[\"transactions\"].ww[\"id\"]),\n        parent_dataframe_name=\"regions\",\n        relationship_path=path,\n        primitive=Count,\n    )\n    dfeat = DirectFeature(agg_feat, \"customers\")\n    times = [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)]\n    cutoff_time = pd.DataFrame({\"time\": times, \"instance_id\": [0, 2]})\n    feature_matrix = calculate_feature_matrix(\n        [dfeat],\n        es,\n        approximate=Timedelta(1, \"week\"),\n        cutoff_time=cutoff_time,\n    )\n    assert feature_matrix[dfeat.get_name()].tolist() == [6, 2]\n\n\ndef test_approximate_dfeat_of_agg_on_target(es):\n    agg_feat = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Count,\n    )\n    agg_feat2 = Feature(agg_feat, parent_dataframe_name=\"customers\", primitive=Sum)\n    dfeat = DirectFeature(agg_feat2, \"sessions\")\n    times = [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)]\n    cutoff_time = pd.DataFrame({\"time\": times, \"instance_id\": [0, 2]})\n    feature_matrix = calculate_feature_matrix(\n        [dfeat, agg_feat],\n        es,\n        approximate=Timedelta(10, \"s\"),\n        cutoff_time=cutoff_time,\n    )\n    assert feature_matrix[dfeat.get_name()].tolist() == [7, 10]\n    assert feature_matrix[agg_feat.get_name()].tolist() == [5, 1]\n\n\ndef test_approximate_dfeat_of_need_all_values(es):\n    p = Feature(es[\"log\"].ww[\"value\"], primitive=Percentile)\n    agg_feat = Feature(p, parent_dataframe_name=\"sessions\", primitive=Sum)\n    agg_feat2 = Feature(agg_feat, parent_dataframe_name=\"customers\", primitive=Sum)\n    dfeat = DirectFeature(agg_feat2, \"sessions\")\n    times = [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)]\n    cutoff_time = pd.DataFrame({\"time\": times, \"instance_id\": [0, 2]})\n    feature_matrix = calculate_feature_matrix(\n        [dfeat, agg_feat],\n        es,\n        approximate=Timedelta(10, \"s\"),\n        cutoff_time_in_index=True,\n        cutoff_time=cutoff_time,\n    )\n    log_df = es[\"log\"]\n    instances = [0, 2]\n    cutoffs = [pd.Timestamp(\"2011-04-09 10:31:19\"), pd.Timestamp(\"2011-04-09 11:00:00\")]\n    approxes = [\n        pd.Timestamp(\"2011-04-09 10:31:10\"),\n        pd.Timestamp(\"2011-04-09 11:00:00\"),\n    ]\n    true_vals = []\n    true_vals_approx = []\n    for instance, cutoff, approx in zip(instances, cutoffs, approxes):\n        log_data_cutoff = log_df[log_df[\"datetime\"] < cutoff]\n        log_data_cutoff[\"percentile\"] = log_data_cutoff[\"value\"].rank(pct=True)\n        true_agg = (\n            log_data_cutoff.loc[log_data_cutoff[\"session_id\"] == instance, \"percentile\"]\n            .fillna(0)\n            .sum()\n        )\n        true_vals.append(round(true_agg, 3))\n\n        log_data_approx = log_df[log_df[\"datetime\"] < approx]\n        log_data_approx[\"percentile\"] = log_data_approx[\"value\"].rank(pct=True)\n        true_agg_approx = (\n            log_data_approx.loc[\n                log_data_approx[\"session_id\"].isin([0, 1, 2]),\n                \"percentile\",\n            ]\n            .fillna(0)\n            .sum()\n        )\n        true_vals_approx.append(round(true_agg_approx, 3))\n    lapprox = [round(x, 3) for x in feature_matrix[dfeat.get_name()].tolist()]\n    test_list = [round(x, 3) for x in feature_matrix[agg_feat.get_name()].tolist()]\n    assert lapprox == true_vals_approx\n    assert test_list == true_vals\n\n\ndef test_uses_full_dataframe_feat_of_approximate(es):\n    agg_feat = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Sum,\n    )\n    agg_feat2 = Feature(agg_feat, parent_dataframe_name=\"customers\", primitive=Sum)\n    agg_feat3 = Feature(agg_feat, parent_dataframe_name=\"customers\", primitive=Max)\n    dfeat = DirectFeature(agg_feat2, \"sessions\")\n    dfeat2 = DirectFeature(agg_feat3, \"sessions\")\n    p = Feature(dfeat, primitive=Percentile)\n    times = [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)]\n    cutoff_time = pd.DataFrame({\"time\": times, \"instance_id\": [0, 2]})\n    # only dfeat2 should be approximated\n    # because Percentile needs all values\n\n    feature_matrix_only_dfeat2 = calculate_feature_matrix(\n        [dfeat2],\n        es,\n        approximate=Timedelta(10, \"s\"),\n        cutoff_time_in_index=True,\n        cutoff_time=cutoff_time,\n    )\n    assert feature_matrix_only_dfeat2[dfeat2.get_name()].tolist() == [50, 50]\n\n    feature_matrix_approx = calculate_feature_matrix(\n        [p, dfeat, dfeat2, agg_feat],\n        es,\n        approximate=Timedelta(10, \"s\"),\n        cutoff_time_in_index=True,\n        cutoff_time=cutoff_time,\n    )\n    assert (\n        feature_matrix_only_dfeat2[dfeat2.get_name()].tolist()\n        == feature_matrix_approx[dfeat2.get_name()].tolist()\n    )\n\n    feature_matrix_small_approx = calculate_feature_matrix(\n        [p, dfeat, dfeat2, agg_feat],\n        es,\n        approximate=Timedelta(10, \"ms\"),\n        cutoff_time_in_index=True,\n        cutoff_time=cutoff_time,\n    )\n\n    feature_matrix_no_approx = calculate_feature_matrix(\n        [p, dfeat, dfeat2, agg_feat],\n        es,\n        cutoff_time_in_index=True,\n        cutoff_time=cutoff_time,\n    )\n    for f in [p, dfeat, agg_feat]:\n        for fm1, fm2 in combinations(\n            [\n                feature_matrix_approx,\n                feature_matrix_small_approx,\n                feature_matrix_no_approx,\n            ],\n            2,\n        ):\n            assert fm1[f.get_name()].tolist() == fm2[f.get_name()].tolist()\n\n\ndef test_approximate_dfeat_of_dfeat_of_agg_on_target(es):\n    agg_feat = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Count,\n    )\n    agg_feat2 = Feature(agg_feat, parent_dataframe_name=\"customers\", primitive=Sum)\n    dfeat = DirectFeature(Feature(agg_feat2, \"sessions\"), \"log\")\n    times = [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)]\n    cutoff_time = pd.DataFrame({\"time\": times, \"instance_id\": [0, 2]})\n    feature_matrix = calculate_feature_matrix(\n        [dfeat],\n        es,\n        approximate=Timedelta(10, \"s\"),\n        cutoff_time=cutoff_time,\n    )\n    assert feature_matrix[dfeat.get_name()].tolist() == [7, 10]\n\n\ndef test_empty_path_approximate_full(es):\n    es[\"sessions\"].ww[\"customer_id\"] = pd.Series(\n        [np.nan, np.nan, np.nan, 1, 1, 2],\n        dtype=\"category\",\n    )\n    # Need to reassign the `foreign_key` tag as the column reassignment above removes it\n    es[\"sessions\"].ww.set_types(semantic_tags={\"customer_id\": \"foreign_key\"})\n    agg_feat = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Count,\n    )\n    agg_feat2 = Feature(agg_feat, parent_dataframe_name=\"customers\", primitive=Sum)\n    dfeat = DirectFeature(agg_feat2, \"sessions\")\n    times = [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)]\n    cutoff_time = pd.DataFrame({\"time\": times, \"instance_id\": [0, 2]})\n    feature_matrix = calculate_feature_matrix(\n        [dfeat, agg_feat],\n        es,\n        approximate=Timedelta(10, \"s\"),\n        cutoff_time=cutoff_time,\n    )\n    vals1 = feature_matrix[dfeat.get_name()].tolist()\n\n    assert vals1[0] == 0\n    assert vals1[1] == 0\n    assert feature_matrix[agg_feat.get_name()].tolist() == [5, 1]\n\n\ndef test_approx_base_feature_is_also_first_class_feature(es):\n    log_to_products = DirectFeature(Feature(es[\"products\"].ww[\"rating\"]), \"log\")\n    # This should still be computed properly\n    agg_feat = Feature(log_to_products, parent_dataframe_name=\"sessions\", primitive=Min)\n    customer_agg_feat = Feature(\n        agg_feat,\n        parent_dataframe_name=\"customers\",\n        primitive=Sum,\n    )\n    # This is to be approximated\n    sess_to_cust = DirectFeature(customer_agg_feat, \"sessions\")\n    times = [datetime(2011, 4, 9, 10, 31, 19), datetime(2011, 4, 9, 11, 0, 0)]\n    cutoff_time = pd.DataFrame({\"time\": times, \"instance_id\": [0, 2]})\n    feature_matrix = calculate_feature_matrix(\n        [sess_to_cust, agg_feat],\n        es,\n        approximate=Timedelta(10, \"s\"),\n        cutoff_time=cutoff_time,\n    )\n\n    vals1 = feature_matrix[sess_to_cust.get_name()].tolist()\n    assert vals1 == [8.5, 7]\n    vals2 = feature_matrix[agg_feat.get_name()].tolist()\n    assert vals2 == [4, 1.5]\n\n\ndef test_approximate_time_split_returns_the_same_result(es):\n    agg_feat = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Count,\n    )\n    agg_feat2 = Feature(agg_feat, parent_dataframe_name=\"customers\", primitive=Sum)\n    dfeat = DirectFeature(agg_feat2, \"sessions\")\n\n    cutoff_df = pd.DataFrame(\n        {\n            \"time\": [\n                pd.Timestamp(\"2011-04-09 10:07:30\"),\n                pd.Timestamp(\"2011-04-09 10:07:40\"),\n            ],\n            \"instance_id\": [0, 0],\n        },\n    )\n\n    feature_matrix_at_once = calculate_feature_matrix(\n        [dfeat, agg_feat],\n        es,\n        approximate=Timedelta(10, \"s\"),\n        cutoff_time=cutoff_df,\n    )\n    divided_matrices = []\n    separate_cutoff = [cutoff_df.iloc[0:1], cutoff_df.iloc[1:]]\n    # Make sure indexes are different\n    # Note that this step is unnecessary and done to showcase the issue here\n    separate_cutoff[0].index = [0]\n    separate_cutoff[1].index = [1]\n    for ct in separate_cutoff:\n        fm = calculate_feature_matrix(\n            [dfeat, agg_feat],\n            es,\n            approximate=Timedelta(10, \"s\"),\n            cutoff_time=ct,\n        )\n        divided_matrices.append(fm)\n    feature_matrix_from_split = pd.concat(divided_matrices)\n    assert feature_matrix_from_split.shape == feature_matrix_at_once.shape\n    for i1, i2 in zip(feature_matrix_at_once.index, feature_matrix_from_split.index):\n        assert (pd.isnull(i1) and pd.isnull(i2)) or (i1 == i2)\n    for c in feature_matrix_from_split:\n        for i1, i2 in zip(feature_matrix_at_once[c], feature_matrix_from_split[c]):\n            assert (pd.isnull(i1) and pd.isnull(i2)) or (i1 == i2)\n\n\ndef test_approximate_returns_correct_empty_default_values(es):\n    agg_feat = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Count,\n    )\n    dfeat = DirectFeature(agg_feat, \"sessions\")\n\n    cutoff_df = pd.DataFrame(\n        {\n            \"time\": [\n                pd.Timestamp(\"2011-04-08 11:00:00\"),\n                pd.Timestamp(\"2011-04-09 11:00:00\"),\n            ],\n            \"instance_id\": [0, 0],\n        },\n    )\n\n    fm = calculate_feature_matrix(\n        [dfeat],\n        es,\n        approximate=Timedelta(10, \"s\"),\n        cutoff_time=cutoff_df,\n    )\n    assert fm[dfeat.get_name()].tolist() == [0, 10]\n\n\ndef test_approximate_child_aggs_handled_correctly(es):\n    agg_feat = Feature(\n        es[\"customers\"].ww[\"id\"],\n        parent_dataframe_name=\"régions\",\n        primitive=Count,\n    )\n    dfeat = DirectFeature(agg_feat, \"customers\")\n    agg_feat_2 = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Sum,\n    )\n    cutoff_df = pd.DataFrame(\n        {\n            \"time\": [\n                pd.Timestamp(\"2011-04-08 10:30:00\"),\n                pd.Timestamp(\"2011-04-09 10:30:06\"),\n            ],\n            \"instance_id\": [0, 0],\n        },\n    )\n\n    fm = calculate_feature_matrix(\n        [dfeat],\n        es,\n        approximate=Timedelta(10, \"s\"),\n        cutoff_time=cutoff_df,\n    )\n    fm_2 = calculate_feature_matrix(\n        [dfeat, agg_feat_2],\n        es,\n        approximate=Timedelta(10, \"s\"),\n        cutoff_time=cutoff_df,\n    )\n    assert fm[dfeat.get_name()].tolist() == [2, 3]\n    assert fm_2[agg_feat_2.get_name()].tolist() == [0, 5]\n\n\ndef test_cutoff_time_naming(es):\n    agg_feat = Feature(\n        es[\"customers\"].ww[\"id\"],\n        parent_dataframe_name=\"régions\",\n        primitive=Count,\n    )\n    dfeat = DirectFeature(agg_feat, \"customers\")\n    cutoff_df = pd.DataFrame(\n        {\n            \"time\": [\n                pd.Timestamp(\"2011-04-08 10:30:00\"),\n                pd.Timestamp(\"2011-04-09 10:30:06\"),\n            ],\n            \"instance_id\": [0, 0],\n        },\n    )\n    cutoff_df_index_name = cutoff_df.rename(columns={\"instance_id\": \"id\"})\n    cutoff_df_wrong_index_name = cutoff_df.rename(columns={\"instance_id\": \"wrong_id\"})\n    cutoff_df_wrong_time_name = cutoff_df.rename(columns={\"time\": \"cutoff_time\"})\n\n    fm1 = calculate_feature_matrix([dfeat], es, cutoff_time=cutoff_df)\n    fm2 = calculate_feature_matrix([dfeat], es, cutoff_time=cutoff_df_index_name)\n    assert all((fm1 == fm2.values).values)\n\n    error_text = (\n        \"Cutoff time DataFrame must contain a column with either the same name\"\n        ' as the target dataframe index or a column named \"instance_id\"'\n    )\n    with pytest.raises(AttributeError, match=error_text):\n        calculate_feature_matrix([dfeat], es, cutoff_time=cutoff_df_wrong_index_name)\n\n    time_error_text = (\n        \"Cutoff time DataFrame must contain a column with either the same name\"\n        ' as the target dataframe time_index or a column named \"time\"'\n    )\n    with pytest.raises(AttributeError, match=time_error_text):\n        calculate_feature_matrix([dfeat], es, cutoff_time=cutoff_df_wrong_time_name)\n\n\ndef test_cutoff_time_extra_columns(es):\n    agg_feat = Feature(\n        es[\"customers\"].ww[\"id\"],\n        parent_dataframe_name=\"régions\",\n        primitive=Count,\n    )\n    dfeat = DirectFeature(agg_feat, \"customers\")\n\n    cutoff_df = pd.DataFrame(\n        {\n            \"time\": [\n                pd.Timestamp(\"2011-04-09 10:30:06\"),\n                pd.Timestamp(\"2011-04-09 10:30:03\"),\n                pd.Timestamp(\"2011-04-08 10:30:00\"),\n            ],\n            \"instance_id\": [0, 1, 0],\n            \"label\": [True, True, False],\n        },\n        columns=[\"time\", \"instance_id\", \"label\"],\n    )\n    fm = calculate_feature_matrix([dfeat], es, cutoff_time=cutoff_df)\n    # check column was added to end of matrix\n    assert \"label\" == fm.columns[-1]\n\n    assert (fm[\"label\"].values == cutoff_df[\"label\"].values).all()\n\n\ndef test_cutoff_time_extra_columns_approximate(es):\n    agg_feat = Feature(\n        es[\"customers\"].ww[\"id\"],\n        parent_dataframe_name=\"régions\",\n        primitive=Count,\n    )\n    dfeat = DirectFeature(agg_feat, \"customers\")\n\n    cutoff_df = pd.DataFrame(\n        {\n            \"time\": [\n                pd.Timestamp(\"2011-04-09 10:30:06\"),\n                pd.Timestamp(\"2011-04-09 10:30:03\"),\n                pd.Timestamp(\"2011-04-08 10:30:00\"),\n            ],\n            \"instance_id\": [0, 1, 0],\n            \"label\": [True, True, False],\n        },\n        columns=[\"time\", \"instance_id\", \"label\"],\n    )\n    fm = calculate_feature_matrix(\n        [dfeat],\n        es,\n        cutoff_time=cutoff_df,\n        approximate=\"2 days\",\n    )\n    # check column was added to end of matrix\n    assert \"label\" in fm.columns\n\n    assert (fm[\"label\"].values == cutoff_df[\"label\"].values).all()\n\n\ndef test_cutoff_time_extra_columns_same_name(es):\n    agg_feat = Feature(\n        es[\"customers\"].ww[\"id\"],\n        parent_dataframe_name=\"régions\",\n        primitive=Count,\n    )\n    dfeat = DirectFeature(agg_feat, \"customers\")\n\n    cutoff_df = pd.DataFrame(\n        {\n            \"time\": [\n                pd.Timestamp(\"2011-04-09 10:30:06\"),\n                pd.Timestamp(\"2011-04-09 10:30:03\"),\n                pd.Timestamp(\"2011-04-08 10:30:00\"),\n            ],\n            \"instance_id\": [0, 1, 0],\n            \"régions.COUNT(customers)\": [False, False, True],\n        },\n        columns=[\"time\", \"instance_id\", \"régions.COUNT(customers)\"],\n    )\n    fm = calculate_feature_matrix([dfeat], es, cutoff_time=cutoff_df)\n\n    assert (\n        fm[\"régions.COUNT(customers)\"].values\n        == cutoff_df[\"régions.COUNT(customers)\"].values\n    ).all()\n\n\ndef test_cutoff_time_extra_columns_same_name_approximate(es):\n    agg_feat = Feature(\n        es[\"customers\"].ww[\"id\"],\n        parent_dataframe_name=\"régions\",\n        primitive=Count,\n    )\n    dfeat = DirectFeature(agg_feat, \"customers\")\n\n    cutoff_df = pd.DataFrame(\n        {\n            \"time\": [\n                pd.Timestamp(\"2011-04-09 10:30:06\"),\n                pd.Timestamp(\"2011-04-09 10:30:03\"),\n                pd.Timestamp(\"2011-04-08 10:30:00\"),\n            ],\n            \"instance_id\": [0, 1, 0],\n            \"régions.COUNT(customers)\": [False, False, True],\n        },\n        columns=[\"time\", \"instance_id\", \"régions.COUNT(customers)\"],\n    )\n    fm = calculate_feature_matrix(\n        [dfeat],\n        es,\n        cutoff_time=cutoff_df,\n        approximate=\"2 days\",\n    )\n\n    assert (\n        fm[\"régions.COUNT(customers)\"].values\n        == cutoff_df[\"régions.COUNT(customers)\"].values\n    ).all()\n\n\ndef test_instances_after_cutoff_time_removed(es):\n    property_feature = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Count,\n    )\n    cutoff_time = datetime(2011, 4, 8)\n    fm = calculate_feature_matrix(\n        [property_feature],\n        es,\n        cutoff_time=cutoff_time,\n        cutoff_time_in_index=True,\n    )\n    actual_ids = (\n        [id for (id, _) in fm.index]\n        if isinstance(fm.index, pd.MultiIndex)\n        else fm.index\n    )\n\n    # Customer with id 1 should be removed\n    assert set(actual_ids) == set([2, 0])\n\n\ndef test_instances_with_id_kept_after_cutoff(es):\n    property_feature = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Count,\n    )\n    cutoff_time = datetime(2011, 4, 8)\n    fm = calculate_feature_matrix(\n        [property_feature],\n        es,\n        instance_ids=[0, 1, 2],\n        cutoff_time=cutoff_time,\n        cutoff_time_in_index=True,\n    )\n\n    # Customer #1 is after cutoff, but since it is included in instance_ids it\n    # should be kept.\n    actual_ids = (\n        [id for (id, _) in fm.index]\n        if isinstance(fm.index, pd.MultiIndex)\n        else fm.index\n    )\n    assert set(actual_ids) == set([0, 1, 2])\n\n\ndef test_cfm_returns_original_time_indexes(es):\n    agg_feat = Feature(\n        es[\"customers\"].ww[\"id\"],\n        parent_dataframe_name=\"régions\",\n        primitive=Count,\n    )\n    dfeat = DirectFeature(agg_feat, \"customers\")\n    cutoff_df = pd.DataFrame(\n        {\n            \"time\": [\n                pd.Timestamp(\"2011-04-09 10:30:06\"),\n                pd.Timestamp(\"2011-04-09 10:30:03\"),\n                pd.Timestamp(\"2011-04-08 10:30:00\"),\n            ],\n            \"instance_id\": [0, 1, 0],\n        },\n    )\n\n    fm = calculate_feature_matrix(\n        [dfeat],\n        es,\n        cutoff_time=cutoff_df,\n        cutoff_time_in_index=True,\n    )\n\n    instance_level_vals = fm.index.get_level_values(0).values\n    time_level_vals = fm.index.get_level_values(1).values\n\n    assert (instance_level_vals == cutoff_df[\"instance_id\"].values).all()\n    assert (time_level_vals == cutoff_df[\"time\"].values).all()\n\n\ndef test_cfm_returns_original_time_indexes_approximate(es):\n    agg_feat = Feature(\n        es[\"customers\"].ww[\"id\"],\n        parent_dataframe_name=\"régions\",\n        primitive=Count,\n    )\n    dfeat = DirectFeature(agg_feat, \"customers\")\n    agg_feat_2 = Feature(\n        es[\"sessions\"].ww[\"id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Count,\n    )\n    cutoff_df = pd.DataFrame(\n        {\n            \"time\": [\n                pd.Timestamp(\"2011-04-09 10:30:06\"),\n                pd.Timestamp(\"2011-04-09 10:30:03\"),\n                pd.Timestamp(\"2011-04-08 10:30:00\"),\n            ],\n            \"instance_id\": [0, 1, 0],\n        },\n    )\n    # approximate, in different windows, no unapproximated aggs\n    fm = calculate_feature_matrix(\n        [dfeat],\n        es,\n        cutoff_time=cutoff_df,\n        cutoff_time_in_index=True,\n        approximate=\"1 m\",\n    )\n    instance_level_vals = fm.index.get_level_values(0).values\n    time_level_vals = fm.index.get_level_values(1).values\n    assert (instance_level_vals == cutoff_df[\"instance_id\"].values).all()\n    assert (time_level_vals == cutoff_df[\"time\"].values).all()\n\n    # approximate, in different windows, unapproximated aggs\n    fm = calculate_feature_matrix(\n        [dfeat, agg_feat_2],\n        es,\n        cutoff_time=cutoff_df,\n        cutoff_time_in_index=True,\n        approximate=\"1 m\",\n    )\n    instance_level_vals = fm.index.get_level_values(0).values\n    time_level_vals = fm.index.get_level_values(1).values\n    assert (instance_level_vals == cutoff_df[\"instance_id\"].values).all()\n    assert (time_level_vals == cutoff_df[\"time\"].values).all()\n\n    # approximate, in same window, no unapproximated aggs\n    fm2 = calculate_feature_matrix(\n        [dfeat],\n        es,\n        cutoff_time=cutoff_df,\n        cutoff_time_in_index=True,\n        approximate=\"2 d\",\n    )\n    instance_level_vals = fm2.index.get_level_values(0).values\n    time_level_vals = fm2.index.get_level_values(1).values\n    assert (instance_level_vals == cutoff_df[\"instance_id\"].values).all()\n    assert (time_level_vals == cutoff_df[\"time\"].values).all()\n\n    # approximate, in same window, unapproximated aggs\n    fm3 = calculate_feature_matrix(\n        [dfeat, agg_feat_2],\n        es,\n        cutoff_time=cutoff_df,\n        cutoff_time_in_index=True,\n        approximate=\"2 d\",\n    )\n    instance_level_vals = fm3.index.get_level_values(0).values\n    time_level_vals = fm3.index.get_level_values(1).values\n    assert (instance_level_vals == cutoff_df[\"instance_id\"].values).all()\n    assert (time_level_vals == cutoff_df[\"time\"].values).all()\n\n\ndef test_dask_kwargs(es, dask_cluster):\n    times = (\n        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]\n        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)]\n        + [datetime(2011, 4, 9, 10, 40, 0)]\n        + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)]\n        + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)]\n        + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)]\n    )\n    labels = [False] * 3 + [True] * 2 + [False] * 9 + [True] + [False] * 2\n    cutoff_time = pd.DataFrame({\"time\": times, \"instance_id\": range(17)})\n    property_feature = IdentityFeature(es[\"log\"].ww[\"value\"]) > 10\n\n    dkwargs = {\"cluster\": dask_cluster.scheduler.address}\n    feature_matrix = calculate_feature_matrix(\n        [property_feature],\n        entityset=es,\n        cutoff_time=cutoff_time,\n        verbose=True,\n        chunk_size=0.13,\n        dask_kwargs=dkwargs,\n        approximate=\"1 hour\",\n    )\n\n    assert (feature_matrix[property_feature.get_name()] == labels).values.all()\n\n\ndef test_dask_persisted_es(es, capsys, dask_cluster):\n    times = (\n        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]\n        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)]\n        + [datetime(2011, 4, 9, 10, 40, 0)]\n        + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)]\n        + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)]\n        + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)]\n    )\n    labels = [False] * 3 + [True] * 2 + [False] * 9 + [True] + [False] * 2\n    cutoff_time = pd.DataFrame({\"time\": times, \"instance_id\": range(17)})\n    property_feature = IdentityFeature(es[\"log\"].ww[\"value\"]) > 10\n\n    dkwargs = {\"cluster\": dask_cluster.scheduler.address}\n    feature_matrix = calculate_feature_matrix(\n        [property_feature],\n        entityset=es,\n        cutoff_time=cutoff_time,\n        verbose=True,\n        chunk_size=0.13,\n        dask_kwargs=dkwargs,\n        approximate=\"1 hour\",\n    )\n    assert (feature_matrix[property_feature.get_name()] == labels).values.all()\n    feature_matrix = calculate_feature_matrix(\n        [property_feature],\n        entityset=es,\n        cutoff_time=cutoff_time,\n        verbose=True,\n        chunk_size=0.13,\n        dask_kwargs=dkwargs,\n        approximate=\"1 hour\",\n    )\n    captured = capsys.readouterr()\n    assert \"Using EntitySet persisted on the cluster as dataset \" in captured[0]\n    assert (feature_matrix[property_feature.get_name()] == labels).values.all()\n\n\nclass TestCreateClientAndCluster(object):\n    def test_user_cluster_as_string(self, monkeypatch):\n        monkeypatch.setattr(utils, \"get_client_cluster\", get_mock_client_cluster)\n        # cluster in dask_kwargs case\n        client, cluster = create_client_and_cluster(\n            n_jobs=2,\n            dask_kwargs={\"cluster\": \"tcp://127.0.0.1:54321\"},\n            entityset_size=1,\n        )\n        assert cluster == \"tcp://127.0.0.1:54321\"\n\n    def test_cluster_creation(self, monkeypatch):\n        total_memory = psutil.virtual_memory().total\n        monkeypatch.setattr(utils, \"get_client_cluster\", get_mock_client_cluster)\n        try:\n            cpus = len(psutil.Process().cpu_affinity())\n        except AttributeError:  # pragma: no cover\n            cpus = psutil.cpu_count()\n\n        # jobs < tasks case\n        client, cluster = create_client_and_cluster(\n            n_jobs=2,\n            dask_kwargs={},\n            entityset_size=1,\n        )\n        num_workers = min(cpus, 2)\n        memory_limit = int(total_memory / float(num_workers))\n        assert cluster == (min(cpus, 2), 1, None, memory_limit)\n        # jobs > tasks case\n        match = r\".*workers requested, but only .* workers created\"\n        with pytest.warns(UserWarning, match=match) as record:\n            client, cluster = create_client_and_cluster(\n                n_jobs=1000,\n                dask_kwargs={\"diagnostics_port\": 8789},\n                entityset_size=1,\n            )\n        assert len(record) == 1\n\n        num_workers = cpus\n        memory_limit = int(total_memory / float(num_workers))\n        assert cluster == (num_workers, 1, 8789, memory_limit)\n\n        # dask_kwargs sets memory limit\n        client, cluster = create_client_and_cluster(\n            n_jobs=2,\n            dask_kwargs={\"diagnostics_port\": 8789, \"memory_limit\": 1000},\n            entityset_size=1,\n        )\n        num_workers = min(cpus, 2)\n        assert cluster == (num_workers, 1, 8789, 1000)\n\n    def test_not_enough_memory(self, monkeypatch):\n        total_memory = psutil.virtual_memory().total\n        monkeypatch.setattr(utils, \"get_client_cluster\", get_mock_client_cluster)\n        # errors if not enough memory for each worker to store the entityset\n        with pytest.raises(ValueError, match=\"\"):\n            create_client_and_cluster(\n                n_jobs=1,\n                dask_kwargs={},\n                entityset_size=total_memory * 2,\n            )\n\n        # does not error even if worker memory is less than 2x entityset size\n        create_client_and_cluster(\n            n_jobs=1,\n            dask_kwargs={},\n            entityset_size=total_memory * 0.75,\n        )\n\n\ndef test_parallel_failure_raises_correct_error(es):\n    times = (\n        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]\n        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)]\n        + [datetime(2011, 4, 9, 10, 40, 0)]\n        + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)]\n        + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)]\n        + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)]\n    )\n    cutoff_time = pd.DataFrame({\"time\": times, \"instance_id\": range(17)})\n    property_feature = IdentityFeature(es[\"log\"].ww[\"value\"]) > 10\n\n    error_text = \"Need at least one worker\"\n    with pytest.raises(AssertionError, match=error_text):\n        calculate_feature_matrix(\n            [property_feature],\n            entityset=es,\n            cutoff_time=cutoff_time,\n            verbose=True,\n            chunk_size=0.13,\n            n_jobs=0,\n            approximate=\"1 hour\",\n        )\n\n\ndef test_warning_not_enough_chunks(\n    es,\n    capsys,\n    three_worker_dask_cluster,\n):  # pragma: no cover\n    property_feature = IdentityFeature(es[\"log\"].ww[\"value\"]) > 10\n\n    dkwargs = {\"cluster\": three_worker_dask_cluster.scheduler.address}\n    calculate_feature_matrix(\n        [property_feature],\n        entityset=es,\n        chunk_size=0.5,\n        verbose=True,\n        dask_kwargs=dkwargs,\n    )\n\n    captured = capsys.readouterr()\n    pattern = r\"Fewer chunks \\([0-9]+\\), than workers \\([0-9]+\\) consider reducing the chunk size\"\n    assert re.search(pattern, captured.out) is not None\n\n\ndef test_n_jobs():\n    try:\n        cpus = len(psutil.Process().cpu_affinity())\n    except AttributeError:  # pragma: no cover\n        cpus = psutil.cpu_count()\n\n    assert n_jobs_to_workers(1) == 1\n    assert n_jobs_to_workers(-1) == cpus\n    assert n_jobs_to_workers(cpus) == cpus\n    assert n_jobs_to_workers((cpus + 1) * -1) == 1\n    if cpus > 1:\n        assert n_jobs_to_workers(-2) == cpus - 1\n\n    error_text = \"Need at least one worker\"\n    with pytest.raises(AssertionError, match=error_text):\n        n_jobs_to_workers(0)\n\n\ndef test_parallel_cutoff_time_column_pass_through(es, dask_cluster):\n    times = (\n        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]\n        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)]\n        + [datetime(2011, 4, 9, 10, 40, 0)]\n        + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)]\n        + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)]\n        + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)]\n    )\n    labels = [False] * 3 + [True] * 2 + [False] * 9 + [True] + [False] * 2\n    cutoff_time = pd.DataFrame(\n        {\"time\": times, \"instance_id\": range(17), \"labels\": labels},\n    )\n    property_feature = IdentityFeature(es[\"log\"].ww[\"value\"]) > 10\n\n    dkwargs = {\"cluster\": dask_cluster.scheduler.address}\n    feature_matrix = calculate_feature_matrix(\n        [property_feature],\n        entityset=es,\n        cutoff_time=cutoff_time,\n        verbose=True,\n        dask_kwargs=dkwargs,\n        approximate=\"1 hour\",\n    )\n\n    assert (\n        feature_matrix[property_feature.get_name()] == feature_matrix[\"labels\"]\n    ).values.all()\n\n\ndef test_integer_time_index(int_es):\n    times = list(range(8, 18)) + list(range(19, 26))\n    labels = [False] * 3 + [True] * 2 + [False] * 9 + [True] + [False] * 2\n    cutoff_df = pd.DataFrame({\"time\": times, \"instance_id\": range(17)})\n    property_feature = IdentityFeature(int_es[\"log\"].ww[\"value\"]) > 10\n\n    feature_matrix = calculate_feature_matrix(\n        [property_feature],\n        int_es,\n        cutoff_time=cutoff_df,\n        cutoff_time_in_index=True,\n    )\n\n    time_level_vals = feature_matrix.index.get_level_values(1).values\n    sorted_df = cutoff_df.sort_values([\"time\", \"instance_id\"], kind=\"mergesort\")\n    assert (time_level_vals == sorted_df[\"time\"].values).all()\n    assert (feature_matrix[property_feature.get_name()] == labels).values.all()\n\n\ndef test_integer_time_index_single_cutoff_value(int_es):\n    labels = [False] * 3 + [True] * 2 + [False] * 4\n    property_feature = IdentityFeature(int_es[\"log\"].ww[\"value\"]) > 10\n\n    cutoff_times = [16, pd.Series([16])[0], 16.0, pd.Series([16.0])[0]]\n    for cutoff_time in cutoff_times:\n        feature_matrix = calculate_feature_matrix(\n            [property_feature],\n            int_es,\n            cutoff_time=cutoff_time,\n            cutoff_time_in_index=True,\n        )\n        time_level_vals = feature_matrix.index.get_level_values(1).values\n        assert (time_level_vals == [16] * 9).all()\n        assert (feature_matrix[property_feature.get_name()] == labels).values.all()\n\n\ndef test_integer_time_index_datetime_cutoffs(int_es):\n    times = [datetime.now()] * 17\n    cutoff_df = pd.DataFrame({\"time\": times, \"instance_id\": range(17)})\n    property_feature = IdentityFeature(int_es[\"log\"].ww[\"value\"]) > 10\n\n    error_text = (\n        \"cutoff_time times must be numeric: try casting via pd\\\\.to_numeric\\\\(\\\\)\"\n    )\n    with pytest.raises(TypeError, match=error_text):\n        calculate_feature_matrix(\n            [property_feature],\n            int_es,\n            cutoff_time=cutoff_df,\n            cutoff_time_in_index=True,\n        )\n\n\ndef test_integer_time_index_passes_extra_columns(int_es):\n    times = list(range(8, 18)) + list(range(19, 23)) + [25, 24, 23]\n    labels = [False] * 3 + [True] * 2 + [False] * 9 + [False] * 2 + [True]\n    instances = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 16, 15, 14]\n    cutoff_df = pd.DataFrame(\n        {\"time\": times, \"instance_id\": instances, \"labels\": labels},\n    )\n    cutoff_df = cutoff_df[[\"time\", \"instance_id\", \"labels\"]]\n    property_feature = IdentityFeature(int_es[\"log\"].ww[\"value\"]) > 10\n\n    fm = calculate_feature_matrix(\n        [property_feature],\n        int_es,\n        cutoff_time=cutoff_df,\n        cutoff_time_in_index=True,\n    )\n    assert (fm[property_feature.get_name()] == fm[\"labels\"]).all()\n\n\ndef test_integer_time_index_mixed_cutoff(int_es):\n    times_dt = list(range(8, 17)) + [datetime(2011, 1, 1), 19, 20, 21, 22, 25, 24, 23]\n    labels = [False] * 3 + [True] * 2 + [False] * 9 + [False] * 2 + [True]\n    instances = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 16, 15, 14]\n    cutoff_df = pd.DataFrame(\n        {\"time\": times_dt, \"instance_id\": instances, \"labels\": labels},\n    )\n    cutoff_df = cutoff_df[[\"time\", \"instance_id\", \"labels\"]]\n    property_feature = IdentityFeature(int_es[\"log\"].ww[\"value\"]) > 10\n\n    error_text = \"cutoff_time times must be.*try casting via.*\"\n    with pytest.raises(TypeError, match=error_text):\n        calculate_feature_matrix([property_feature], int_es, cutoff_time=cutoff_df)\n\n    times_str = list(range(8, 17)) + [\"foobar\", 19, 20, 21, 22, 25, 24, 23]\n    cutoff_df[\"time\"] = times_str\n    with pytest.raises(TypeError, match=error_text):\n        calculate_feature_matrix([property_feature], int_es, cutoff_time=cutoff_df)\n\n    times_date_str = list(range(8, 17)) + [\"2018-04-02\", 19, 20, 21, 22, 25, 24, 23]\n    cutoff_df[\"time\"] = times_date_str\n    with pytest.raises(TypeError, match=error_text):\n        calculate_feature_matrix([property_feature], int_es, cutoff_time=cutoff_df)\n\n    times_int_str = [0, 1, 2, 3, 4, 5, \"6\", 7, 8, 9, 9, 10, 11, 12, 15, 14, 13]\n    times_int_str = list(range(8, 17)) + [\"17\", 19, 20, 21, 22, 25, 24, 23]\n    cutoff_df[\"time\"] = times_int_str\n    # calculate_feature_matrix should convert time column to ints successfully here\n    with pytest.raises(TypeError, match=error_text):\n        calculate_feature_matrix([property_feature], int_es, cutoff_time=cutoff_df)\n\n\ndef test_datetime_index_mixed_cutoff(es):\n    times = list(\n        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]\n        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)]\n        + [17]\n        + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)]\n        + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)]\n        + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)],\n    )\n    labels = [False] * 3 + [True] * 2 + [False] * 9 + [False] * 2 + [True]\n    instances = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 16, 15, 14]\n    cutoff_df = pd.DataFrame(\n        {\"time\": times, \"instance_id\": instances, \"labels\": labels},\n    )\n    cutoff_df = cutoff_df[[\"time\", \"instance_id\", \"labels\"]]\n    property_feature = IdentityFeature(es[\"log\"].ww[\"value\"]) > 10\n\n    error_text = \"cutoff_time times must be.*try casting via.*\"\n    with pytest.raises(TypeError, match=error_text):\n        calculate_feature_matrix([property_feature], es, cutoff_time=cutoff_df)\n\n    times[9] = \"foobar\"\n    cutoff_df[\"time\"] = times\n    with pytest.raises(TypeError, match=error_text):\n        calculate_feature_matrix([property_feature], es, cutoff_time=cutoff_df)\n\n    times[9] = \"17\"\n    cutoff_df[\"time\"] = times\n    with pytest.raises(TypeError, match=error_text):\n        calculate_feature_matrix([property_feature], es, cutoff_time=cutoff_df)\n\n\ndef test_no_data_for_cutoff_time(mock_customer):\n    es = mock_customer\n    cutoff_times = pd.DataFrame(\n        {\"customer_id\": [4], \"time\": pd.Timestamp(\"2011-04-08 20:08:13\")},\n    )\n\n    trans_per_session = Feature(\n        es[\"transactions\"].ww[\"transaction_id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Count,\n    )\n    trans_per_customer = Feature(\n        es[\"transactions\"].ww[\"transaction_id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Count,\n    )\n    max_count = Feature(\n        trans_per_session,\n        parent_dataframe_name=\"customers\",\n        primitive=Max,\n    )\n    features = [trans_per_customer, max_count]\n\n    fm = calculate_feature_matrix(features, entityset=es, cutoff_time=cutoff_times)\n\n    # due to default values for each primitive\n    # count will be 0, but max will nan\n    answer = pd.DataFrame(\n        {\n            trans_per_customer.get_name(): pd.Series([0], dtype=\"Int64\"),\n            max_count.get_name(): pd.Series([np.nan], dtype=\"float\"),\n        },\n    )\n    for column in fm.columns:\n        pd.testing.assert_series_equal(\n            fm[column],\n            answer[column],\n            check_index=False,\n            check_names=False,\n        )\n\n\ndef test_instances_not_in_data(es):\n    last_instance = max(es[\"log\"].index.values)\n    instances = list(range(last_instance + 1, last_instance + 11))\n    identity_feature = IdentityFeature(es[\"log\"].ww[\"value\"])\n    property_feature = identity_feature > 10\n    agg_feat = AggregationFeature(\n        Feature(es[\"log\"].ww[\"value\"]),\n        parent_dataframe_name=\"sessions\",\n        primitive=Max,\n    )\n    direct_feature = DirectFeature(agg_feat, \"log\")\n    features = [identity_feature, property_feature, direct_feature]\n    fm = calculate_feature_matrix(features, entityset=es, instance_ids=instances)\n    assert all(fm.index.values == instances)\n    for column in fm.columns:\n        assert fm[column].isnull().all()\n\n    fm = calculate_feature_matrix(\n        features,\n        entityset=es,\n        instance_ids=instances,\n        approximate=\"730 days\",\n    )\n    assert all(fm.index.values == instances)\n    for column in fm.columns:\n        assert fm[column].isnull().all()\n\n\ndef test_some_instances_not_in_data(es):\n    a_time = datetime(2011, 4, 10, 10, 41, 9)  # only valid data\n    b_time = datetime(2011, 4, 10, 11, 10, 5)  # some missing data\n    c_time = datetime(2011, 4, 10, 12, 0, 0)  # all missing data\n\n    times = [a_time, b_time, a_time, a_time, b_time, b_time] + [c_time] * 4\n    cutoff_time = pd.DataFrame({\"instance_id\": list(range(12, 22)), \"time\": times})\n    identity_feature = IdentityFeature(es[\"log\"].ww[\"value\"])\n    property_feature = identity_feature > 10\n    agg_feat = AggregationFeature(\n        Feature(es[\"log\"].ww[\"value\"]),\n        parent_dataframe_name=\"sessions\",\n        primitive=Max,\n    )\n    direct_feature = DirectFeature(agg_feat, \"log\")\n    features = [identity_feature, property_feature, direct_feature]\n    fm = calculate_feature_matrix(features, entityset=es, cutoff_time=cutoff_time)\n    ifeat_answer = pd.Series([0, 7, 14, np.nan] + [np.nan] * 6)\n    prop_answer = pd.Series([0, 0, 1, pd.NA, 0] + [pd.NA] * 5, dtype=\"boolean\")\n    dfeat_answer = pd.Series([14, 14, 14, np.nan] + [np.nan] * 6)\n\n    assert all(fm.index.values == cutoff_time[\"instance_id\"].values)\n    for x, y in zip(fm.columns, [ifeat_answer, prop_answer, dfeat_answer]):\n        pd.testing.assert_series_equal(fm[x], y, check_index=False, check_names=False)\n\n    fm = calculate_feature_matrix(\n        features,\n        entityset=es,\n        cutoff_time=cutoff_time,\n        approximate=\"5 seconds\",\n    )\n\n    dfeat_answer[0] = 7  # approximate calculated before 14 appears\n    dfeat_answer[2] = 7  # approximate calculated before 14 appears\n    prop_answer[3] = False  # no_unapproximated_aggs code ignores cutoff time\n\n    assert all(fm.index.values == cutoff_time[\"instance_id\"].values)\n    for x, y in zip(fm.columns, [ifeat_answer, prop_answer, dfeat_answer]):\n        pd.testing.assert_series_equal(fm[x], y, check_index=False, check_names=False)\n\n\ndef test_missing_instances_with_categorical_index(es):\n    instance_ids = [\"coke zero\", \"car\", 3, \"taco clock\"]\n    features = dfs(\n        entityset=es,\n        target_dataframe_name=\"products\",\n        features_only=True,\n    )\n\n    fm = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=instance_ids,\n    )\n    assert fm.index.values.to_list() == instance_ids\n    assert isinstance(fm.index, pd.CategoricalIndex)\n\n\ndef test_handle_chunk_size():\n    total_size = 100\n\n    # user provides no chunk size\n    assert _handle_chunk_size(None, total_size) is None\n\n    # user provides fractional size\n    assert _handle_chunk_size(0.1, total_size) == total_size * 0.1\n    assert _handle_chunk_size(0.001, total_size) == 1  # rounds up\n    assert _handle_chunk_size(0.345, total_size) == 35  # rounds up\n\n    # user provides absolute size\n    assert _handle_chunk_size(1, total_size) == 1\n    assert _handle_chunk_size(100, total_size) == 100\n    assert isinstance(_handle_chunk_size(100.0, total_size), int)\n\n    # test invalid cases\n    with pytest.raises(AssertionError, match=\"Chunk size must be greater than 0\"):\n        _handle_chunk_size(0, total_size)\n\n    with pytest.raises(AssertionError, match=\"Chunk size must be greater than 0\"):\n        _handle_chunk_size(-1, total_size)\n\n\ndef test_chunk_dataframe_groups():\n    df = pd.DataFrame({\"group\": [1, 1, 1, 1, 2, 2, 3]})\n\n    grouped = df.groupby(\"group\")\n    chunked_grouped = _chunk_dataframe_groups(grouped, 2)\n\n    # test group larger than chunk size gets split up\n    first = next(chunked_grouped)\n    assert first[0] == 1 and first[1].shape[0] == 2\n    second = next(chunked_grouped)\n    assert second[0] == 1 and second[1].shape[0] == 2\n\n    # test that equal to and less than chunk size stays together\n    third = next(chunked_grouped)\n    assert third[0] == 2 and third[1].shape[0] == 2\n    fourth = next(chunked_grouped)\n    assert fourth[0] == 3 and fourth[1].shape[0] == 1\n\n\ndef test_calls_progress_callback(mock_customer):\n    class MockProgressCallback:\n        def __init__(self):\n            self.progress_history = []\n            self.total_update = 0\n            self.total_progress_percent = 0\n\n        def __call__(self, update, progress_percent, time_elapsed):\n            self.total_update += update\n            self.total_progress_percent = progress_percent\n            self.progress_history.append(progress_percent)\n\n    mock_progress_callback = MockProgressCallback()\n\n    es = mock_customer\n\n    # make sure to calculate features that have different paths to same base feature\n    trans_per_session = Feature(\n        es[\"transactions\"].ww[\"transaction_id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Count,\n    )\n    trans_per_customer = Feature(\n        es[\"transactions\"].ww[\"transaction_id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Count,\n    )\n    features = [trans_per_session, Feature(trans_per_customer, \"sessions\")]\n    calculate_feature_matrix(\n        features,\n        entityset=es,\n        progress_callback=mock_progress_callback,\n    )\n\n    # second to last entry is the last update from feature calculation\n    assert np.isclose(\n        mock_progress_callback.progress_history[-2],\n        FEATURE_CALCULATION_PERCENTAGE * 100,\n    )\n    assert np.isclose(mock_progress_callback.total_update, 100.0)\n    assert np.isclose(mock_progress_callback.total_progress_percent, 100.0)\n\n    # test with cutoff time dataframe\n    mock_progress_callback = MockProgressCallback()\n    cutoff_time = pd.DataFrame(\n        {\n            \"instance_id\": [1, 2, 3],\n            \"time\": [\n                pd.to_datetime(\"2014-01-01 01:00:00\"),\n                pd.to_datetime(\"2014-01-01 02:00:00\"),\n                pd.to_datetime(\"2014-01-01 03:00:00\"),\n            ],\n        },\n    )\n\n    calculate_feature_matrix(\n        features,\n        entityset=es,\n        cutoff_time=cutoff_time,\n        progress_callback=mock_progress_callback,\n    )\n    assert np.isclose(\n        mock_progress_callback.progress_history[-2],\n        FEATURE_CALCULATION_PERCENTAGE * 100,\n    )\n    assert np.isclose(mock_progress_callback.total_update, 100.0)\n    assert np.isclose(mock_progress_callback.total_progress_percent, 100.0)\n\n\ndef test_calls_progress_callback_cluster(mock_customer, dask_cluster):\n    class MockProgressCallback:\n        def __init__(self):\n            self.progress_history = []\n            self.total_update = 0\n            self.total_progress_percent = 0\n\n        def __call__(self, update, progress_percent, time_elapsed):\n            self.total_update += update\n            self.total_progress_percent = progress_percent\n            self.progress_history.append(progress_percent)\n\n    mock_progress_callback = MockProgressCallback()\n\n    trans_per_session = Feature(\n        mock_customer[\"transactions\"].ww[\"transaction_id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Count,\n    )\n    trans_per_customer = Feature(\n        mock_customer[\"transactions\"].ww[\"transaction_id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Count,\n    )\n    features = [trans_per_session, Feature(trans_per_customer, \"sessions\")]\n\n    dkwargs = {\"cluster\": dask_cluster.scheduler.address}\n    calculate_feature_matrix(\n        features,\n        entityset=mock_customer,\n        progress_callback=mock_progress_callback,\n        dask_kwargs=dkwargs,\n    )\n\n    assert np.isclose(mock_progress_callback.total_update, 100.0)\n    assert np.isclose(mock_progress_callback.total_progress_percent, 100.0)\n\n\ndef test_closes_tqdm(es):\n    class ErrorPrim(TransformPrimitive):\n        \"\"\"A primitive whose function raises an error\"\"\"\n\n        name = \"error_prim\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = \"Numeric\"\n\n        def get_function(self):\n            def error(s):\n                raise RuntimeError(\"This primitive has errored\")\n\n            return error\n\n    value = Feature(es[\"log\"].ww[\"value\"])\n    property_feature = value > 10\n    error_feature = Feature(value, primitive=ErrorPrim)\n\n    calculate_feature_matrix([property_feature], es, verbose=True)\n\n    assert len(tqdm._instances) == 0\n\n    match = \"This primitive has errored\"\n    with pytest.raises(RuntimeError, match=match):\n        calculate_feature_matrix([value, error_feature], es, verbose=True)\n    assert len(tqdm._instances) == 0\n\n\ndef test_approximate_with_single_cutoff_warns(es):\n    features = dfs(\n        entityset=es,\n        target_dataframe_name=\"customers\",\n        features_only=True,\n        ignore_dataframes=[\"cohorts\"],\n        agg_primitives=[\"sum\"],\n    )\n\n    match = (\n        \"Using approximate with a single cutoff_time value or no cutoff_time \"\n        \"provides no computational efficiency benefit\"\n    )\n    # test warning with single cutoff time\n    with pytest.warns(UserWarning, match=match):\n        calculate_feature_matrix(\n            features,\n            es,\n            cutoff_time=pd.to_datetime(\"2020-01-01\"),\n            approximate=\"1 day\",\n        )\n    # test warning with no cutoff time\n    with pytest.warns(UserWarning, match=match):\n        calculate_feature_matrix(features, es, approximate=\"1 day\")\n\n    # check proper handling of approximate\n    feature_matrix = calculate_feature_matrix(\n        features,\n        es,\n        cutoff_time=pd.to_datetime(\"2011-04-09 10:31:30\"),\n        approximate=\"1 minute\",\n    )\n\n    expected_values = [50, 50, 50]\n    assert (feature_matrix[\"régions.SUM(log.value)\"] == expected_values).values.all()\n\n\ndef test_calc_feature_matrix_with_cutoff_df_and_instance_ids(es):\n    times = list(\n        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]\n        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)]\n        + [datetime(2011, 4, 9, 10, 40, 0)]\n        + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)]\n        + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)]\n        + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)],\n    )\n    instances = range(17)\n    cutoff_time = pd.DataFrame({\"time\": times, es[\"log\"].ww.index: instances})\n    labels = [False] * 3 + [True] * 2 + [False] * 9 + [True] + [False] * 2\n\n    property_feature = Feature(es[\"log\"].ww[\"value\"]) > 10\n\n    match = \"Passing 'instance_ids' is valid only if 'cutoff_time' is a single value or None - ignoring\"\n    with pytest.warns(UserWarning, match=match):\n        feature_matrix = calculate_feature_matrix(\n            [property_feature],\n            es,\n            cutoff_time=cutoff_time,\n            instance_ids=[1, 3, 5],\n            verbose=True,\n        )\n\n    assert (feature_matrix[property_feature.get_name()] == labels).values.all()\n\n\ndef test_calculate_feature_matrix_returns_default_values(default_value_es):\n    sum_features = Feature(\n        default_value_es[\"transactions\"].ww[\"value\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Sum,\n    )\n    sessions_sum = Feature(sum_features, \"transactions\")\n\n    feature_matrix = calculate_feature_matrix(\n        features=[sessions_sum],\n        entityset=default_value_es,\n    )\n\n    expected_values = [2.0, 2.0, 1.0, 0.0]\n\n    assert (feature_matrix[sessions_sum.get_name()] == expected_values).values.all()\n\n\ndef test_dataframes_relationships(dataframes, relationships):\n    fm_1, features = dfs(\n        dataframes=dataframes,\n        relationships=relationships,\n        target_dataframe_name=\"transactions\",\n    )\n\n    fm_2 = calculate_feature_matrix(\n        features=features,\n        dataframes=dataframes,\n        relationships=relationships,\n    )\n\n    assert fm_1.equals(fm_2)\n\n\ndef test_no_dataframes(dataframes, relationships):\n    features = dfs(\n        dataframes=dataframes,\n        relationships=relationships,\n        target_dataframe_name=\"transactions\",\n        features_only=True,\n    )\n\n    msg = \"No dataframes or valid EntitySet provided\"\n    with pytest.raises(TypeError, match=msg):\n        calculate_feature_matrix(features=features, dataframes=None, relationships=None)\n\n\ndef test_no_relationships(dataframes):\n    fm_1, features = dfs(\n        dataframes=dataframes,\n        relationships=None,\n        target_dataframe_name=\"transactions\",\n    )\n\n    fm_2 = calculate_feature_matrix(\n        features=features,\n        dataframes=dataframes,\n        relationships=None,\n    )\n\n    assert fm_1.equals(fm_2)\n\n\ndef test_cfm_with_invalid_time_index(es):\n    features = dfs(entityset=es, target_dataframe_name=\"customers\", features_only=True)\n    es[\"customers\"].ww.set_types(logical_types={\"signup_date\": \"integer\"})\n    match = \"customers time index is numeric type \"\n    match += \"which differs from other entityset time indexes\"\n    with pytest.raises(TypeError, match=match):\n        calculate_feature_matrix(features=features, entityset=es)\n\n\ndef test_cfm_introduces_nan_values_in_direct_feats(es):\n    es[\"customers\"].ww.set_types(\n        logical_types={\"age\": \"Age\", \"engagement_level\": \"Integer\"},\n    )\n    age_feat = Feature(es[\"customers\"].ww[\"age\"])\n    engagement_feat = Feature(es[\"customers\"].ww[\"engagement_level\"])\n    loves_ice_cream_feat = Feature(es[\"customers\"].ww[\"loves_ice_cream\"])\n    features = [age_feat, engagement_feat, loves_ice_cream_feat]\n    fm = calculate_feature_matrix(\n        features=features,\n        entityset=es,\n        cutoff_time=pd.Timestamp(\"2010-04-08 04:00\"),\n        instance_ids=[1],\n    )\n\n    assert isinstance(es[\"customers\"].ww.logical_types[\"age\"], Age)\n    assert isinstance(es[\"customers\"].ww.logical_types[\"engagement_level\"], Integer)\n    assert isinstance(es[\"customers\"].ww.logical_types[\"loves_ice_cream\"], Boolean)\n\n    assert isinstance(fm.ww.logical_types[\"age\"], AgeNullable)\n    assert isinstance(fm.ww.logical_types[\"engagement_level\"], IntegerNullable)\n    assert isinstance(fm.ww.logical_types[\"loves_ice_cream\"], BooleanNullable)\n\n\ndef test_feature_origins_present_on_all_fm_cols(es):\n    class MultiCumSum(TransformPrimitive):\n        name = \"multi_cum_sum\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n        number_output_features = 3\n\n        def get_function(self):\n            def multi_cum_sum(x):\n                return x.cumsum(), x.cummax(), x.cummin()\n\n            return multi_cum_sum\n\n    feature_matrix, _ = dfs(\n        entityset=es,\n        target_dataframe_name=\"log\",\n        trans_primitives=[MultiCumSum],\n    )\n\n    for col in feature_matrix.columns:\n        origin = feature_matrix.ww[col].ww.origin\n        assert origin in [\"base\", \"engineered\"]\n\n\ndef test_renamed_features_have_expected_column_names_in_feature_matrix(es):\n    class MultiCumulative(TransformPrimitive):\n        name = \"multi_cum_sum\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n        number_output_features = 3\n\n        def get_function(self):\n            def multi_cum_sum(x):\n                return x.cumsum(), x.cummax(), x.cummin()\n\n            return multi_cum_sum\n\n    multi_output_trans_feat = Feature(\n        es[\"log\"].ww[\"value\"],\n        primitive=MultiCumulative,\n    )\n    groupby_trans_feat = GroupByTransformFeature(\n        es[\"log\"].ww[\"value\"],\n        primitive=MultiCumulative,\n        groupby=es[\"log\"].ww[\"product_id\"],\n    )\n    multi_output_agg_feat = Feature(\n        es[\"log\"].ww[\"product_id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=NMostCommon(n=2),\n    )\n    slice = FeatureOutputSlice(multi_output_trans_feat, 1)\n    stacked_feat = Feature(slice, primitive=Negate)\n\n    multi_output_trans_names = [\"cumulative_sum\", \"cumulative_max\", \"cumulative_min\"]\n    multi_output_trans_feat.set_feature_names(multi_output_trans_names)\n    groupby_trans_feat_names = [\"grouped_sum\", \"grouped_max\", \"grouped_min\"]\n    groupby_trans_feat.set_feature_names(groupby_trans_feat_names)\n    agg_names = [\"first_most_common\", \"second_most_common\"]\n    multi_output_agg_feat.set_feature_names(agg_names)\n\n    features = [\n        multi_output_trans_feat,\n        multi_output_agg_feat,\n        stacked_feat,\n        groupby_trans_feat,\n    ]\n    feature_matrix = calculate_feature_matrix(entityset=es, features=features)\n    expected_names = multi_output_trans_names + agg_names + groupby_trans_feat_names\n    for renamed_col in expected_names:\n        assert renamed_col in feature_matrix.columns\n\n    expected_stacked_name = \"-(cumulative_max)\"\n    assert expected_stacked_name in feature_matrix.columns\n"
  },
  {
    "path": "featuretools/tests/computational_backend/test_feature_set.py",
    "content": "from featuretools import (\n    AggregationFeature,\n    DirectFeature,\n    IdentityFeature,\n    TransformFeature,\n    primitives,\n)\nfrom featuretools.computational_backends.feature_set import FeatureSet\nfrom featuretools.entityset.relationship import RelationshipPath\nfrom featuretools.tests.testing_utils import backward_path\nfrom featuretools.utils import Trie\n\n\ndef test_feature_trie_without_needs_full_dataframe(diamond_es):\n    es = diamond_es\n    country_name = IdentityFeature(es[\"countries\"].ww[\"name\"])\n    direct_name = DirectFeature(country_name, \"regions\")\n    amount = IdentityFeature(es[\"transactions\"].ww[\"amount\"])\n\n    path_through_customers = backward_path(es, [\"regions\", \"customers\", \"transactions\"])\n    through_customers = AggregationFeature(\n        amount,\n        \"regions\",\n        primitive=primitives.Mean,\n        relationship_path=path_through_customers,\n    )\n    path_through_stores = backward_path(es, [\"regions\", \"stores\", \"transactions\"])\n    through_stores = AggregationFeature(\n        amount,\n        \"regions\",\n        primitive=primitives.Mean,\n        relationship_path=path_through_stores,\n    )\n    customers_to_transactions = backward_path(es, [\"customers\", \"transactions\"])\n    customers_mean = AggregationFeature(\n        amount,\n        \"customers\",\n        primitive=primitives.Mean,\n        relationship_path=customers_to_transactions,\n    )\n\n    negation = TransformFeature(customers_mean, primitives.Negate)\n    regions_to_customers = backward_path(es, [\"regions\", \"customers\"])\n    mean_of_mean = AggregationFeature(\n        negation,\n        \"regions\",\n        primitive=primitives.Mean,\n        relationship_path=regions_to_customers,\n    )\n\n    features = [direct_name, through_customers, through_stores, mean_of_mean]\n\n    feature_set = FeatureSet(features)\n    trie = feature_set.feature_trie\n\n    assert trie.value == (False, set(), {f.unique_name() for f in features})\n    assert trie.get_node(direct_name.relationship_path).value == (\n        False,\n        set(),\n        {country_name.unique_name()},\n    )\n    assert trie.get_node(regions_to_customers).value == (\n        False,\n        set(),\n        {negation.unique_name(), customers_mean.unique_name()},\n    )\n    regions_to_stores = backward_path(es, [\"regions\", \"stores\"])\n    assert trie.get_node(regions_to_stores).value == (False, set(), set())\n    assert trie.get_node(path_through_customers).value == (\n        False,\n        set(),\n        {amount.unique_name()},\n    )\n    assert trie.get_node(path_through_stores).value == (\n        False,\n        set(),\n        {amount.unique_name()},\n    )\n\n\ndef test_feature_trie_with_needs_full_dataframe(diamond_es):\n    es = diamond_es\n    amount = IdentityFeature(es[\"transactions\"].ww[\"amount\"])\n\n    path_through_customers = backward_path(\n        es,\n        [\"regions\", \"customers\", \"transactions\"],\n    )\n    agg = AggregationFeature(\n        amount,\n        \"regions\",\n        primitive=primitives.Mean,\n        relationship_path=path_through_customers,\n    )\n    trans_of_agg = TransformFeature(agg, primitives.CumSum)\n\n    path_through_stores = backward_path(es, [\"regions\", \"stores\", \"transactions\"])\n    trans = TransformFeature(amount, primitives.CumSum)\n    agg_of_trans = AggregationFeature(\n        trans,\n        \"regions\",\n        primitive=primitives.Mean,\n        relationship_path=path_through_stores,\n    )\n\n    features = [agg, trans_of_agg, agg_of_trans]\n    feature_set = FeatureSet(features)\n    trie = feature_set.feature_trie\n\n    assert trie.value == (\n        True,\n        {agg.unique_name(), trans_of_agg.unique_name()},\n        {agg_of_trans.unique_name()},\n    )\n    assert trie.get_node(path_through_customers).value == (\n        True,\n        {amount.unique_name()},\n        set(),\n    )\n    assert trie.get_node(path_through_customers[:1]).value == (True, set(), set())\n    assert trie.get_node(path_through_stores).value == (\n        True,\n        {amount.unique_name(), trans.unique_name()},\n        set(),\n    )\n    assert trie.get_node(path_through_stores[:1]).value == (False, set(), set())\n\n\ndef test_feature_trie_with_needs_full_dataframe_direct(es):\n    value = IdentityFeature(es[\"log\"].ww[\"value\"])\n    agg = AggregationFeature(value, \"sessions\", primitive=primitives.Mean)\n    agg_of_agg = AggregationFeature(agg, \"customers\", primitive=primitives.Sum)\n    direct = DirectFeature(agg_of_agg, \"sessions\")\n    trans = TransformFeature(direct, primitives.CumSum)\n\n    features = [trans, agg]\n    feature_set = FeatureSet(features)\n    trie = feature_set.feature_trie\n\n    assert trie.value == (\n        True,\n        {direct.unique_name(), trans.unique_name()},\n        {agg.unique_name()},\n    )\n\n    assert trie.get_node(agg.relationship_path).value == (\n        False,\n        set(),\n        {value.unique_name()},\n    )\n\n    parent_node = trie.get_node(direct.relationship_path)\n    assert parent_node.value == (True, {agg_of_agg.unique_name()}, set())\n\n    child_through_parent_node = parent_node.get_node(agg_of_agg.relationship_path)\n    assert child_through_parent_node.value == (True, {agg.unique_name()}, set())\n\n    assert child_through_parent_node.get_node(agg.relationship_path).value == (\n        True,\n        {value.unique_name()},\n        set(),\n    )\n\n\ndef test_feature_trie_ignores_approximate_features(es):\n    value = IdentityFeature(es[\"log\"].ww[\"value\"])\n    agg = AggregationFeature(value, \"sessions\", primitive=primitives.Mean)\n    agg_of_agg = AggregationFeature(agg, \"customers\", primitive=primitives.Sum)\n    direct = DirectFeature(agg_of_agg, \"sessions\")\n    features = [direct, agg]\n\n    approximate_feature_trie = Trie(default=list, path_constructor=RelationshipPath)\n    approximate_feature_trie.get_node(direct.relationship_path).value = [agg_of_agg]\n    feature_set = FeatureSet(\n        features,\n        approximate_feature_trie=approximate_feature_trie,\n    )\n    trie = feature_set.feature_trie\n\n    # Since agg_of_agg is ignored it and its dependencies should not be in the\n    # trie.\n    sub_trie = trie.get_node(direct.relationship_path)\n    for _path, (_, _, features) in sub_trie:\n        assert not features\n\n    assert trie.value == (False, set(), {direct.unique_name(), agg.unique_name()})\n    assert trie.get_node(agg.relationship_path).value == (\n        False,\n        set(),\n        {value.unique_name()},\n    )\n"
  },
  {
    "path": "featuretools/tests/computational_backend/test_feature_set_calculator.py",
    "content": "from datetime import datetime\n\nimport numpy as np\nimport pandas as pd\nimport pytest\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Categorical, Datetime, Double, Integer\n\nfrom featuretools import (\n    AggregationFeature,\n    EntitySet,\n    Feature,\n    Timedelta,\n    calculate_feature_matrix,\n)\nfrom featuretools.computational_backends.feature_set import FeatureSet\nfrom featuretools.computational_backends.feature_set_calculator import (\n    FeatureSetCalculator,\n)\nfrom featuretools.entityset.relationship import RelationshipPath\nfrom featuretools.feature_base import DirectFeature, IdentityFeature\nfrom featuretools.primitives import (\n    And,\n    Count,\n    CumSum,\n    EqualScalar,\n    GreaterThanEqualToScalar,\n    GreaterThanScalar,\n    LessThanEqualToScalar,\n    LessThanScalar,\n    Mean,\n    Min,\n    Mode,\n    Negate,\n    NMostCommon,\n    NotEqualScalar,\n    NumTrue,\n    Sum,\n    TimeSinceLast,\n    Trend,\n)\nfrom featuretools.primitives.base import AggregationPrimitive\nfrom featuretools.primitives.standard.aggregation.num_unique import NumUnique\nfrom featuretools.tests.testing_utils import backward_path\nfrom featuretools.utils import Trie\n\n\ndef test_make_identity(es):\n    f = IdentityFeature(es[\"log\"].ww[\"datetime\"])\n\n    feature_set = FeatureSet([f])\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n    df = calculator.run(np.array([0]))\n\n    v = df[f.get_name()][0]\n    assert v == datetime(2011, 4, 9, 10, 30, 0)\n\n\ndef test_make_dfeat(es):\n    f = DirectFeature(\n        Feature(es[\"customers\"].ww[\"age\"]),\n        child_dataframe_name=\"sessions\",\n    )\n\n    feature_set = FeatureSet([f])\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n    df = calculator.run(np.array([0]))\n\n    v = df[f.get_name()][0]\n    assert v == 33\n\n\ndef test_make_agg_feat_of_identity_column(es):\n    agg_feat = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Sum,\n    )\n\n    feature_set = FeatureSet([agg_feat])\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n    df = calculator.run(np.array([0]))\n\n    v = df[agg_feat.get_name()][0]\n    assert v == 50\n\n\ndef test_full_dataframe_trans_of_agg(es):\n    agg_feat = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Sum,\n    )\n    trans_feat = Feature(agg_feat, primitive=CumSum)\n\n    feature_set = FeatureSet([trans_feat])\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n    df = calculator.run(np.array([1]))\n\n    v = df[trans_feat.get_name()].values[0]\n    assert v == 82\n\n\ndef test_make_agg_feat_of_identity_index_column(es):\n    agg_feat = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Count,\n    )\n\n    feature_set = FeatureSet([agg_feat])\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n    df = calculator.run(np.array([0]))\n\n    v = df[agg_feat.get_name()][0]\n    assert v == 5\n\n\ndef test_make_agg_feat_where_count(es):\n    agg_feat = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"sessions\",\n        where=IdentityFeature(es[\"log\"].ww[\"product_id\"]) == \"coke zero\",\n        primitive=Count,\n    )\n\n    feature_set = FeatureSet([agg_feat])\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n    df = calculator.run(np.array([0]))\n\n    v = df[agg_feat.get_name()][0]\n    assert v == 3\n\n\ndef test_make_agg_feat_using_prev_time(es):\n    agg_feat = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"sessions\",\n        use_previous=Timedelta(10, \"s\"),\n        primitive=Count,\n    )\n\n    feature_set = FeatureSet([agg_feat])\n    calculator = FeatureSetCalculator(\n        es,\n        time_last=datetime(2011, 4, 9, 10, 30, 10),\n        feature_set=feature_set,\n    )\n    df = calculator.run(np.array([0]))\n\n    v = df[agg_feat.get_name()][0]\n    assert v == 2\n\n    calculator = FeatureSetCalculator(\n        es,\n        time_last=datetime(2011, 4, 9, 10, 30, 30),\n        feature_set=feature_set,\n    )\n    df = calculator.run(np.array([0]))\n\n    v = df[agg_feat.get_name()][0]\n    assert v == 1\n\n\ndef test_make_agg_feat_using_prev_n_events(es):\n    agg_feat_1 = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"sessions\",\n        use_previous=Timedelta(1, \"observations\"),\n        primitive=Min,\n    )\n\n    agg_feat_2 = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"sessions\",\n        use_previous=Timedelta(3, \"observations\"),\n        primitive=Min,\n    )\n\n    assert (\n        agg_feat_1.get_name() != agg_feat_2.get_name()\n    ), \"Features should have different names based on use_previous\"\n\n    feature_set = FeatureSet([agg_feat_1, agg_feat_2])\n    calculator = FeatureSetCalculator(\n        es,\n        time_last=datetime(2011, 4, 9, 10, 30, 6),\n        feature_set=feature_set,\n    )\n    df = calculator.run(np.array([0]))\n\n    # time_last is included by default\n    v1 = df[agg_feat_1.get_name()][0]\n    v2 = df[agg_feat_2.get_name()][0]\n    assert v1 == 5\n    assert v2 == 0\n\n    calculator = FeatureSetCalculator(\n        es,\n        time_last=datetime(2011, 4, 9, 10, 30, 30),\n        feature_set=feature_set,\n    )\n    df = calculator.run(np.array([0]))\n\n    v1 = df[agg_feat_1.get_name()][0]\n    v2 = df[agg_feat_2.get_name()][0]\n    assert v1 == 20\n    assert v2 == 10\n\n\ndef test_make_agg_feat_multiple_dtypes(es):\n    compare_prod = IdentityFeature(es[\"log\"].ww[\"product_id\"]) == \"coke zero\"\n\n    agg_feat = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"sessions\",\n        where=compare_prod,\n        primitive=Count,\n    )\n\n    agg_feat2 = Feature(\n        es[\"log\"].ww[\"product_id\"],\n        parent_dataframe_name=\"sessions\",\n        where=compare_prod,\n        primitive=Mode,\n    )\n\n    feature_set = FeatureSet([agg_feat, agg_feat2])\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n    df = calculator.run(np.array([0]))\n\n    v = df[agg_feat.get_name()][0]\n    v2 = df[agg_feat2.get_name()][0]\n    assert v == 3\n    assert v2 == \"coke zero\"\n\n\ndef test_make_agg_feat_where_different_identity_feat(es):\n    feats = []\n    where_cmps = [\n        LessThanScalar,\n        GreaterThanScalar,\n        LessThanEqualToScalar,\n        GreaterThanEqualToScalar,\n        EqualScalar,\n        NotEqualScalar,\n    ]\n    for where_cmp in where_cmps:\n        feats.append(\n            Feature(\n                es[\"log\"].ww[\"id\"],\n                parent_dataframe_name=\"sessions\",\n                where=Feature(\n                    es[\"log\"].ww[\"value\"],\n                    primitive=where_cmp(10.0),\n                ),\n                primitive=Count,\n            ),\n        )\n\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=feats,\n        instance_ids=[0, 1, 2, 3],\n    )\n\n    for i, where_cmp in enumerate(where_cmps):\n        name = feats[i].get_name()\n        instances = df[name]\n        v0, v1, v2, v3 = instances[0:4]\n        if where_cmp == LessThanScalar:\n            assert v0 == 2\n            assert v1 == 4\n            assert v2 == 1\n            assert v3 == 2\n        elif where_cmp == GreaterThanScalar:\n            assert v0 == 2\n            assert v1 == 0\n            assert v2 == 0\n            assert v3 == 0\n        elif where_cmp == LessThanEqualToScalar:\n            assert v0 == 3\n            assert v1 == 4\n            assert v2 == 1\n            assert v3 == 2\n        elif where_cmp == GreaterThanEqualToScalar:\n            assert v0 == 3\n            assert v1 == 0\n            assert v2 == 0\n            assert v3 == 0\n        elif where_cmp == EqualScalar:\n            assert v0 == 1\n            assert v1 == 0\n            assert v2 == 0\n            assert v3 == 0\n        elif where_cmp == NotEqualScalar:\n            assert v0 == 4\n            assert v1 == 4\n            assert v2 == 1\n            assert v3 == 2\n\n\ndef test_make_agg_feat_of_grandchild_dataframe(es):\n    agg_feat = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Count,\n    )\n\n    feature_set = FeatureSet([agg_feat])\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n    df = calculator.run(np.array([0]))\n    v = df[agg_feat.get_name()].values[0]\n    assert v == 10\n\n\ndef test_make_agg_feat_where_count_feat(es):\n    \"\"\"\n    Feature we're creating is:\n    Number of sessions for each customer where the\n    number of logs in the session is less than 3\n    \"\"\"\n    log_count_feat = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Count,\n    )\n\n    feat = Feature(\n        es[\"sessions\"].ww[\"id\"],\n        parent_dataframe_name=\"customers\",\n        where=log_count_feat > 1,\n        primitive=Count,\n    )\n\n    feature_set = FeatureSet([feat])\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n    df = calculator.run(np.array([0, 1]))\n\n    name = feat.get_name()\n    instances = df[name]\n    v0, v1 = instances[0:2]\n    assert v0 == 2\n    assert v1 == 2\n\n\ndef test_make_compare_feat(es):\n    \"\"\"\n    Feature we're creating is:\n    Number of sessions for each customer where the\n    number of logs in the session is less than 3\n    \"\"\"\n    log_count_feat = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Count,\n    )\n\n    mean_agg_feat = Feature(\n        log_count_feat,\n        parent_dataframe_name=\"customers\",\n        primitive=Mean,\n    )\n\n    mean_feat = DirectFeature(mean_agg_feat, child_dataframe_name=\"sessions\")\n\n    feat = log_count_feat > mean_feat\n\n    feature_set = FeatureSet([feat])\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n    df = calculator.run(np.array([0, 1, 2]))\n\n    name = feat.get_name()\n    instances = df[name]\n    v0, v1, v2 = instances[0:3]\n    assert v0\n    assert v1\n    assert not v2\n\n\ndef test_make_agg_feat_where_count_and_device_type_feat(es):\n    \"\"\"\n    Feature we're creating is:\n    Number of sessions for each customer where the\n    number of logs in the session is less than 3\n    \"\"\"\n    log_count_feat = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Count,\n    )\n\n    compare_count = log_count_feat == 1\n    compare_device_type = IdentityFeature(es[\"sessions\"].ww[\"device_type\"]) == 1\n    and_feat = Feature([compare_count, compare_device_type], primitive=And)\n    feat = Feature(\n        es[\"sessions\"].ww[\"id\"],\n        parent_dataframe_name=\"customers\",\n        where=and_feat,\n        primitive=Count,\n    )\n\n    feature_set = FeatureSet([feat])\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n    df = calculator.run(np.array([0]))\n\n    name = feat.get_name()\n    instances = df[name]\n    assert instances.values[0] == 1\n\n\ndef test_make_agg_feat_where_count_or_device_type_feat(es):\n    \"\"\"\n    Feature we're creating is:\n    Number of sessions for each customer where the\n    number of logs in the session is less than 3\n    \"\"\"\n    log_count_feat = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Count,\n    )\n\n    compare_count = log_count_feat > 1\n    compare_device_type = IdentityFeature(es[\"sessions\"].ww[\"device_type\"]) == 1\n    or_feat = compare_count.OR(compare_device_type)\n    feat = Feature(\n        es[\"sessions\"].ww[\"id\"],\n        parent_dataframe_name=\"customers\",\n        where=or_feat,\n        primitive=Count,\n    )\n\n    feature_set = FeatureSet([feat])\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n    df = calculator.run(np.array([0]))\n\n    name = feat.get_name()\n    instances = df[name]\n    assert instances.values[0] == 3\n\n\ndef test_make_agg_feat_of_agg_feat(es):\n    log_count_feat = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Count,\n    )\n\n    customer_sum_feat = Feature(\n        log_count_feat,\n        parent_dataframe_name=\"customers\",\n        primitive=Sum,\n    )\n\n    feature_set = FeatureSet([customer_sum_feat])\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n    df = calculator.run(np.array([0]))\n    v = df[customer_sum_feat.get_name()].values[0]\n    assert v == 10\n\n\n@pytest.fixture\ndef df():\n    return pd.DataFrame(\n        {\n            \"id\": [\"a\", \"b\", \"c\", \"d\", \"e\"],\n            \"e1\": [\"h\", \"h\", \"i\", \"i\", \"j\"],\n            \"e2\": [\"x\", \"x\", \"y\", \"y\", \"x\"],\n            \"e3\": [\"z\", \"z\", \"z\", \"z\", \"z\"],\n            \"val\": [1, 1, 1, 1, 1],\n        },\n    )\n\n\ndef test_make_3_stacked_agg_feats(df):\n    \"\"\"\n    Tests stacking 3 agg features.\n\n    The test specifically uses non numeric indices to test how ancestor columns are handled\n    as dataframes are merged together\n\n    \"\"\"\n    es = EntitySet()\n    ltypes = {\"e1\": Categorical, \"e2\": Categorical, \"e3\": Categorical, \"val\": Double}\n    es.add_dataframe(\n        dataframe=df,\n        index=\"id\",\n        dataframe_name=\"e0\",\n        logical_types=ltypes,\n    )\n\n    es.normalize_dataframe(\n        base_dataframe_name=\"e0\",\n        new_dataframe_name=\"e1\",\n        index=\"e1\",\n        additional_columns=[\"e2\", \"e3\"],\n    )\n\n    es.normalize_dataframe(\n        base_dataframe_name=\"e1\",\n        new_dataframe_name=\"e2\",\n        index=\"e2\",\n        additional_columns=[\"e3\"],\n    )\n\n    es.normalize_dataframe(\n        base_dataframe_name=\"e2\",\n        new_dataframe_name=\"e3\",\n        index=\"e3\",\n    )\n\n    sum_1 = Feature(es[\"e0\"].ww[\"val\"], parent_dataframe_name=\"e1\", primitive=Sum)\n    sum_2 = Feature(sum_1, parent_dataframe_name=\"e2\", primitive=Sum)\n    sum_3 = Feature(sum_2, parent_dataframe_name=\"e3\", primitive=Sum)\n\n    feature_set = FeatureSet([sum_3])\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n    df = calculator.run(np.array([\"z\"]))\n    v = df[sum_3.get_name()][0]\n    assert v == 5\n\n\ndef test_make_dfeat_of_agg_feat_on_self(es):\n    \"\"\"\n    The graph looks like this:\n\n        R       R = Regions, a parent of customers\n        |\n        C       C = Customers, the dataframe we're trying to predict on\n        |\n       etc.\n\n    We're trying to calculate a DFeat from C to R on an agg_feat of R on C.\n    \"\"\"\n    customer_count_feat = Feature(\n        es[\"customers\"].ww[\"id\"],\n        parent_dataframe_name=\"régions\",\n        primitive=Count,\n    )\n\n    num_customers_feat = DirectFeature(\n        customer_count_feat,\n        child_dataframe_name=\"customers\",\n    )\n\n    feature_set = FeatureSet([num_customers_feat])\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n    df = calculator.run(np.array([0]))\n    v = df[num_customers_feat.get_name()].values[0]\n    assert v == 3\n\n\ndef test_make_dfeat_of_agg_feat_through_parent(es):\n    \"\"\"\n    The graph looks like this:\n\n        R       C = Customers, the dataframe we're trying to predict on\n       / \\\\     R = Regions, a parent of customers\n      S   C     S = Stores, a child of regions\n          |\n         etc.\n\n    We're trying to calculate a DFeat from C to R on an agg_feat of R on S.\n    \"\"\"\n    store_id_feat = IdentityFeature(es[\"stores\"].ww[\"id\"])\n\n    store_count_feat = Feature(\n        store_id_feat,\n        parent_dataframe_name=\"régions\",\n        primitive=Count,\n    )\n\n    num_stores_feat = DirectFeature(store_count_feat, child_dataframe_name=\"customers\")\n\n    feature_set = FeatureSet([num_stores_feat])\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n    df = calculator.run(np.array([0]))\n    v = df[num_stores_feat.get_name()].values[0]\n    assert v == 3\n\n\ndef test_make_deep_agg_feat_of_dfeat_of_agg_feat(es):\n    \"\"\"\n    The graph looks like this (higher implies parent):\n\n          C     C = Customers, the dataframe we're trying to predict on\n          |     S = Sessions, a child of Customers\n      P   S     L = Log, a child of both Sessions and Log\n       \\\\ /     P = Products, a parent of Log which is not a descendent of customers\n        L\n\n    We're trying to calculate a DFeat from L to P on an agg_feat of P on L, and\n    then aggregate it with another agg_feat of C on L.\n    \"\"\"\n    log_count_feat = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"products\",\n        primitive=Count,\n    )\n\n    product_purchases_feat = DirectFeature(log_count_feat, child_dataframe_name=\"log\")\n\n    purchase_popularity = Feature(\n        product_purchases_feat,\n        parent_dataframe_name=\"customers\",\n        primitive=Mean,\n    )\n\n    feature_set = FeatureSet([purchase_popularity])\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n    df = calculator.run(np.array([0]))\n    v = df[purchase_popularity.get_name()].values[0]\n    assert v == 38.0 / 10.0\n\n\ndef test_deep_agg_feat_chain(es):\n    \"\"\"\n    Agg feat of agg feat:\n        region.Mean(customer.Count(Log))\n    \"\"\"\n    customer_count_feat = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Count,\n    )\n\n    region_avg_feat = Feature(\n        customer_count_feat,\n        parent_dataframe_name=\"régions\",\n        primitive=Mean,\n    )\n\n    feature_set = FeatureSet([region_avg_feat])\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n    df = calculator.run(np.array([\"United States\"]))\n\n    v = df[region_avg_feat.get_name()][0]\n    assert v == 17 / 3.0\n\n\ndef test_topn(es):\n    topn = Feature(\n        es[\"log\"].ww[\"product_id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=NMostCommon(n=2),\n    )\n    feature_set = FeatureSet([topn])\n\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n    df = calculator.run(np.array([0, 1, 2]))\n    true_results = pd.DataFrame(\n        [\n            [\"toothpaste\", \"coke zero\"],\n            [\"coke zero\", \"Haribo sugar-free gummy bears\"],\n            [\"taco clock\", np.nan],\n        ],\n    )\n    assert [name in df.columns for name in topn.get_feature_names()]\n\n    for i in range(df.shape[0]):\n        true = true_results.loc[i]\n        actual = df.loc[i]\n        if i == 0:\n            # coke zero and toothpase have same number of occurrences\n            assert set(true.values) == set(actual.values)\n        else:\n            for i1, i2 in zip(true, actual):\n                assert (pd.isnull(i1) and pd.isnull(i2)) or (i1 == i2)\n\n\ndef test_trend(es):\n    trend = Feature(\n        [Feature(es[\"log\"].ww[\"value\"]), Feature(es[\"log\"].ww[\"datetime\"])],\n        parent_dataframe_name=\"customers\",\n        primitive=Trend,\n    )\n    feature_set = FeatureSet([trend])\n\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n    df = calculator.run(np.array([0, 1, 2]))\n\n    true_results = [-0.812730, 4.870378, np.nan]\n\n    np.testing.assert_almost_equal(\n        df[trend.get_name()].tolist(),\n        true_results,\n        decimal=5,\n    )\n\n\ndef test_direct_squared(es):\n    feature = IdentityFeature(es[\"log\"].ww[\"value\"])\n    squared = feature * feature\n    feature_set = FeatureSet([feature, squared])\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n    df = calculator.run(np.array([0, 1, 2]))\n    for i, row in df.iterrows():\n        assert (row[0] * row[0]) == row[1]\n\n\ndef test_agg_empty_child(es):\n    customer_count_feat = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Count,\n    )\n    feature_set = FeatureSet([customer_count_feat])\n\n    # time last before the customer had any events, so child frame is empty\n    calculator = FeatureSetCalculator(\n        es,\n        time_last=datetime(2011, 4, 8),\n        feature_set=feature_set,\n    )\n    df = calculator.run(np.array([0]))\n\n    assert df[\"COUNT(log)\"].iloc[0] == 0\n\n\ndef test_diamond_entityset(diamond_es):\n    es = diamond_es\n\n    amount = IdentityFeature(es[\"transactions\"].ww[\"amount\"])\n    path = backward_path(es, [\"regions\", \"customers\", \"transactions\"])\n    through_customers = AggregationFeature(\n        amount,\n        \"regions\",\n        primitive=Sum,\n        relationship_path=path,\n    )\n    path = backward_path(es, [\"regions\", \"stores\", \"transactions\"])\n    through_stores = AggregationFeature(\n        amount,\n        \"regions\",\n        primitive=Sum,\n        relationship_path=path,\n    )\n\n    feature_set = FeatureSet([through_customers, through_stores])\n    calculator = FeatureSetCalculator(\n        es,\n        time_last=datetime(2011, 4, 8),\n        feature_set=feature_set,\n    )\n    df = calculator.run(np.array([0, 1, 2]))\n\n    assert (df[\"SUM(stores.transactions.amount)\"] == [94, 261, 128]).all()\n    assert (df[\"SUM(customers.transactions.amount)\"] == [72, 411, 0]).all()\n\n\ndef test_two_relationships_to_single_dataframe(games_es):\n    es = games_es\n    home_team, away_team = es.relationships\n    path = RelationshipPath([(False, home_team)])\n    mean_at_home = AggregationFeature(\n        Feature(es[\"games\"].ww[\"home_team_score\"]),\n        \"teams\",\n        relationship_path=path,\n        primitive=Mean,\n    )\n    path = RelationshipPath([(False, away_team)])\n    mean_at_away = AggregationFeature(\n        Feature(es[\"games\"].ww[\"away_team_score\"]),\n        \"teams\",\n        relationship_path=path,\n        primitive=Mean,\n    )\n    home_team_mean = DirectFeature(mean_at_home, \"games\", relationship=home_team)\n    away_team_mean = DirectFeature(mean_at_away, \"games\", relationship=away_team)\n\n    feature_set = FeatureSet([home_team_mean, away_team_mean])\n    calculator = FeatureSetCalculator(\n        es,\n        time_last=datetime(2011, 8, 28),\n        feature_set=feature_set,\n    )\n    df = calculator.run(np.array(range(3)))\n\n    assert (df[home_team_mean.get_name()] == [1.5, 1.5, 2.5]).all()\n    assert (df[away_team_mean.get_name()] == [1, 0.5, 2]).all()\n\n\n@pytest.fixture\ndef parent_child():\n    parent_df = pd.DataFrame({\"id\": [1]})\n    child_df = pd.DataFrame(\n        {\n            \"id\": [1, 2, 3],\n            \"parent_id\": [1, 1, 1],\n            \"time_index\": pd.date_range(start=\"1/1/2018\", periods=3),\n            \"value\": [10, 5, 2],\n            \"cat\": [\"a\", \"a\", \"b\"],\n        },\n    ).astype({\"cat\": \"category\"})\n    return (parent_df, child_df)\n\n\ndef test_empty_child_dataframe(parent_child):\n    parent_df, child_df = parent_child\n    child_ltypes = {\n        \"parent_id\": Integer,\n        \"time_index\": Datetime,\n        \"value\": Double,\n        \"cat\": Categorical,\n    }\n\n    es = EntitySet(id=\"blah\")\n    es.add_dataframe(dataframe_name=\"parent\", dataframe=parent_df, index=\"id\")\n    es.add_dataframe(\n        dataframe_name=\"child\",\n        dataframe=child_df,\n        index=\"id\",\n        time_index=\"time_index\",\n        logical_types=child_ltypes,\n    )\n    es.add_relationship(\"parent\", \"id\", \"child\", \"parent_id\")\n\n    # create regular agg\n    count = Feature(\n        es[\"child\"].ww[\"id\"],\n        parent_dataframe_name=\"parent\",\n        primitive=Count,\n    )\n\n    # create agg feature that requires multiple arguments\n    trend = Feature(\n        [Feature(es[\"child\"].ww[\"value\"]), Feature(es[\"child\"].ww[\"time_index\"])],\n        parent_dataframe_name=\"parent\",\n        primitive=Trend,\n    )\n\n    # create multi-output agg feature\n    n_most_common = Feature(\n        es[\"child\"].ww[\"cat\"],\n        parent_dataframe_name=\"parent\",\n        primitive=NMostCommon,\n    )\n\n    # create aggs with where\n    where = Feature(es[\"child\"].ww[\"value\"]) == 1\n    count_where = Feature(\n        es[\"child\"].ww[\"id\"],\n        parent_dataframe_name=\"parent\",\n        where=where,\n        primitive=Count,\n    )\n    trend_where = Feature(\n        [Feature(es[\"child\"].ww[\"value\"]), Feature(es[\"child\"].ww[\"time_index\"])],\n        parent_dataframe_name=\"parent\",\n        where=where,\n        primitive=Trend,\n    )\n    n_most_common_where = Feature(\n        es[\"child\"].ww[\"cat\"],\n        parent_dataframe_name=\"parent\",\n        where=where,\n        primitive=NMostCommon,\n    )\n\n    features = [\n        count,\n        count_where,\n        trend,\n        trend_where,\n        n_most_common,\n        n_most_common_where,\n    ]\n    data = {\n        count.get_name(): pd.Series([0], dtype=\"Int64\"),\n        count_where.get_name(): pd.Series([0], dtype=\"Int64\"),\n        trend.get_name(): pd.Series([np.nan], dtype=\"float\"),\n        trend_where.get_name(): pd.Series([np.nan], dtype=\"float\"),\n    }\n    for name in n_most_common.get_feature_names():\n        data[name] = pd.Series([np.nan], dtype=\"category\")\n    for name in n_most_common_where.get_feature_names():\n        data[name] = pd.Series([np.nan], dtype=\"category\")\n\n    answer = pd.DataFrame(data)\n\n    # cutoff time before all rows\n    fm = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        cutoff_time=pd.Timestamp(\"12/31/2017\"),\n    )\n\n    for column in data.keys():\n        pd.testing.assert_series_equal(\n            fm[column],\n            answer[column],\n            check_names=False,\n            check_index=False,\n        )\n\n    # cutoff time after all rows, but where clause filters all rows\n    data = {\n        count_where.get_name(): pd.Series([0], dtype=\"Int64\"),\n        trend_where.get_name(): pd.Series([np.nan], dtype=\"float\"),\n    }\n    for name in n_most_common_where.get_feature_names():\n        data[name] = pd.Series([np.nan], dtype=\"category\")\n    answer = pd.DataFrame(data)\n\n    for column in data.keys():\n        pd.testing.assert_series_equal(\n            fm[column],\n            answer[column],\n            check_names=False,\n            check_index=False,\n        )\n\n\ndef test_with_features_built_from_es_metadata(es):\n    metadata = es.metadata\n\n    agg_feat = Feature(\n        metadata[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Count,\n    )\n\n    feature_set = FeatureSet([agg_feat])\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n    df = calculator.run(np.array([0]))\n    v = df[agg_feat.get_name()].values[0]\n    assert v == 10\n\n\ndef test_handles_primitive_function_name_uniqueness(es):\n    class SumTimesN(AggregationPrimitive):\n        name = \"sum_times_n\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n        def __init__(self, n):\n            self.n = n\n\n        def get_function(self):\n            def my_function(values):\n                return values.sum() * self.n\n\n            return my_function\n\n    # works as expected\n    f1 = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"customers\",\n        primitive=SumTimesN(n=1),\n    )\n    fm = calculate_feature_matrix(features=[f1], entityset=es)\n\n    value_sum = pd.Series([56, 26, 0])\n    assert all(fm[f1.get_name()].sort_index() == value_sum)\n\n    # works as expected\n    f2 = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"customers\",\n        primitive=SumTimesN(n=2),\n    )\n    fm = calculate_feature_matrix(features=[f2], entityset=es)\n\n    double_value_sum = pd.Series([112, 52, 0])\n    assert all(fm[f2.get_name()].sort_index() == double_value_sum)\n\n    # same primitive, same column, different args\n    fm = calculate_feature_matrix(features=[f1, f2], entityset=es)\n\n    assert all(fm[f1.get_name()].sort_index() == value_sum)\n    assert all(fm[f2.get_name()].sort_index() == double_value_sum)\n\n    # different primitives, same function returned by get_function,\n    # different base features\n    f3 = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Sum,\n    )\n    f4 = Feature(\n        es[\"log\"].ww[\"purchased\"],\n        parent_dataframe_name=\"customers\",\n        primitive=NumTrue,\n    )\n    fm = calculate_feature_matrix(features=[f3, f4], entityset=es)\n\n    purchased_sum = pd.Series([10, 1, 1])\n    assert all(fm[f3.get_name()].sort_index() == value_sum)\n    assert all(fm[f4.get_name()].sort_index() == purchased_sum)\n\n    # different primitives, same function returned by get_function,\n    # same base feature\n    class Sum1(AggregationPrimitive):\n        \"\"\"Sums elements of a numeric or boolean feature.\"\"\"\n\n        name = \"sum1\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n        stack_on_self = False\n        stack_on_exclude = [Count]\n        default_value = 0\n\n        def get_function(self):\n            return np.sum\n\n    class Sum2(AggregationPrimitive):\n        \"\"\"Sums elements of a numeric or boolean feature.\"\"\"\n\n        name = \"sum2\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n        stack_on_self = False\n        stack_on_exclude = [Count]\n        default_value = 0\n\n        def get_function(self):\n            return np.sum\n\n    class Sum3(AggregationPrimitive):\n        \"\"\"Sums elements of a numeric or boolean feature.\"\"\"\n\n        name = \"sum3\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n        stack_on_self = False\n        stack_on_exclude = [Count]\n        default_value = 0\n\n        def get_function(self):\n            return np.sum\n\n    f5 = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Sum1,\n    )\n    f6 = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Sum2,\n    )\n    f7 = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Sum3,\n    )\n    fm = calculate_feature_matrix(features=[f5, f6, f7], entityset=es)\n    assert all(fm[f5.get_name()].sort_index() == value_sum)\n    assert all(fm[f6.get_name()].sort_index() == value_sum)\n    assert all(fm[f7.get_name()].sort_index() == value_sum)\n\n\ndef test_returns_order_of_instance_ids(es):\n    feature_set = FeatureSet([Feature(es[\"customers\"].ww[\"age\"])])\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n\n    instance_ids = [0, 1, 2]\n    assert list(es[\"customers\"][\"id\"]) != instance_ids\n\n    df = calculator.run(np.array(instance_ids))\n\n    assert list(df.index) == instance_ids\n\n\ndef test_calls_progress_callback(es):\n    # call with all feature types. make sure progress callback calls sum to 1\n    identity = Feature(es[\"customers\"].ww[\"age\"])\n    direct = Feature(es[\"cohorts\"].ww[\"cohort_name\"], \"customers\")\n    agg = Feature(\n        es[\"sessions\"].ww[\"id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Count,\n    )\n    agg_apply = Feature(\n        es[\"log\"].ww[\"datetime\"],\n        parent_dataframe_name=\"customers\",\n        primitive=TimeSinceLast,\n    )  # this feature is handle differently than simple features\n    trans = Feature(agg, primitive=Negate)\n    trans_full = Feature(agg, primitive=CumSum)\n    groupby_trans = Feature(\n        agg,\n        primitive=CumSum,\n        groupby=Feature(es[\"customers\"].ww[\"cohort\"]),\n    )\n\n    all_features = [\n        identity,\n        direct,\n        agg,\n        agg_apply,\n        trans,\n        trans_full,\n        groupby_trans,\n    ]\n\n    feature_set = FeatureSet(all_features)\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n\n    class MockProgressCallback:\n        def __init__(self):\n            self.total = 0\n\n        def __call__(self, update):\n            self.total += update\n\n    mock_progress_callback = MockProgressCallback()\n\n    instance_ids = [0, 1, 2]\n    calculator.run(np.array(instance_ids), mock_progress_callback)\n\n    assert np.isclose(mock_progress_callback.total, 1)\n\n    # testing again with a time_last with no data\n    feature_set = FeatureSet(all_features)\n    calculator = FeatureSetCalculator(\n        es,\n        time_last=pd.Timestamp(\"1950\"),\n        feature_set=feature_set,\n    )\n\n    mock_progress_callback = MockProgressCallback()\n    calculator.run(np.array(instance_ids), mock_progress_callback)\n\n    assert np.isclose(mock_progress_callback.total, 1)\n\n\n# precalculated_features is only used with approximate\ndef test_precalculated_features(es):\n    error_msg = (\n        \"This primitive should never be used because the features are precalculated\"\n    )\n\n    class ErrorPrim(AggregationPrimitive):\n        \"\"\"A primitive whose function raises an error.\"\"\"\n\n        name = \"error_prim\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n        def get_function(self):\n            def error(s):\n                raise RuntimeError(error_msg)\n\n            return error\n\n    value = Feature(es[\"log\"].ww[\"value\"])\n    agg = Feature(value, parent_dataframe_name=\"sessions\", primitive=ErrorPrim)\n    agg2 = Feature(agg, parent_dataframe_name=\"customers\", primitive=ErrorPrim)\n    direct = Feature(agg2, dataframe_name=\"sessions\")\n\n    # Set up a FeatureSet which knows which features are precalculated.\n    precalculated_feature_trie = Trie(default=set, path_constructor=RelationshipPath)\n    precalculated_feature_trie.get_node(direct.relationship_path).value.add(\n        agg2.unique_name(),\n    )\n    feature_set = FeatureSet(\n        [direct],\n        approximate_feature_trie=precalculated_feature_trie,\n    )\n\n    # Fake precalculated data.\n    values = [0, 1, 2]\n    parent_fm = pd.DataFrame({agg2.get_name(): values})\n    precalculated_fm_trie = Trie(path_constructor=RelationshipPath)\n    precalculated_fm_trie.get_node(direct.relationship_path).value = parent_fm\n\n    calculator = FeatureSetCalculator(\n        es,\n        feature_set=feature_set,\n        precalculated_features=precalculated_fm_trie,\n    )\n\n    instance_ids = [0, 2, 3, 5]\n    fm = calculator.run(np.array(instance_ids))\n\n    assert list(fm[direct.get_name()]) == [values[0], values[0], values[1], values[2]]\n\n    # Calculating without precalculated features should error.\n    with pytest.raises(RuntimeError, match=error_msg):\n        FeatureSetCalculator(es, feature_set=FeatureSet([direct])).run(instance_ids)\n\n\ndef test_nunique_nested_with_agg_bug(es):\n    \"\"\"Pandas 2.2.0 has a bug where pd.Series.nunique produces columns with\n    the category dtype instead of int64 dtype, causing an error when we attempt\n    another aggregation\"\"\"\n    num_unique_feature = AggregationFeature(\n        Feature(es[\"log\"].ww[\"priority_level\"]),\n        \"sessions\",\n        primitive=NumUnique,\n    )\n\n    mean_nunique_feature = AggregationFeature(\n        num_unique_feature,\n        \"customers\",\n        primitive=Mean,\n    )\n    feature_set = FeatureSet([mean_nunique_feature])\n    calculator = FeatureSetCalculator(es, time_last=None, feature_set=feature_set)\n    df = calculator.run(np.array([0]))\n\n    assert df.iloc[0, 0].round(4) == 1.6667\n"
  },
  {
    "path": "featuretools/tests/computational_backend/test_utils.py",
    "content": "import numpy as np\n\nfrom featuretools import dfs\nfrom featuretools.computational_backends import replace_inf_values\nfrom featuretools.primitives import DivideByFeature, DivideNumericScalar\n\n\ndef test_replace_inf_values(divide_by_zero_es):\n    div_by_scalar = DivideNumericScalar(value=0)\n    div_by_feature = DivideByFeature(value=1)\n    div_by_feature_neg = DivideByFeature(value=-1)\n    for primitive in [\n        \"divide_numeric\",\n        div_by_scalar,\n        div_by_feature,\n        div_by_feature_neg,\n    ]:\n        fm, _ = dfs(\n            entityset=divide_by_zero_es,\n            target_dataframe_name=\"zero\",\n            trans_primitives=[primitive],\n            max_depth=1,\n        )\n        assert np.inf in fm.values or -np.inf in fm.values\n        replaced_fm = replace_inf_values(fm)\n        assert np.inf not in replaced_fm.values\n        assert -np.inf not in replaced_fm.values\n\n        custom_value_fm = replace_inf_values(fm, replacement_value=\"custom_val\")\n        assert np.inf not in custom_value_fm.values\n        assert -np.inf not in replaced_fm.values\n        assert \"custom_val\" in custom_value_fm.values\n\n\ndef test_replace_inf_values_specify_cols(divide_by_zero_es):\n    div_by_scalar = DivideNumericScalar(value=0)\n    fm, _ = dfs(\n        entityset=divide_by_zero_es,\n        target_dataframe_name=\"zero\",\n        trans_primitives=[div_by_scalar],\n        max_depth=1,\n    )\n\n    assert np.inf in fm[\"col1 / 0\"].values\n    replaced_fm = replace_inf_values(fm, columns=[\"col1 / 0\"])\n    assert np.inf not in replaced_fm[\"col1 / 0\"].values\n    assert np.inf in replaced_fm[\"col2 / 0\"].values\n"
  },
  {
    "path": "featuretools/tests/config_tests/__init__.py",
    "content": ""
  },
  {
    "path": "featuretools/tests/config_tests/test_config.py",
    "content": "from featuretools import config\n\n\ndef test_get_default_config_does_not_change():\n    old_config = config.get_all()\n\n    key = \"primitive_data_folder\"\n    value = \"This is an example string\"\n    config.set({key: value})\n    config.set_to_default()\n\n    assert config.get(key) != value\n\n    config.set(old_config)\n\n\ndef test_set_and_get_config():\n    key = \"primitive_data_folder\"\n    old_value = config.get(key)\n    value = \"This is an example string\"\n\n    config.set({key: value})\n    assert config.get(key) == value\n\n    config.set({key: old_value})\n\n\ndef test_get_all():\n    assert config.get_all() == config._data\n"
  },
  {
    "path": "featuretools/tests/conftest.py",
    "content": "import contextlib\nimport copy\nimport os\n\nimport composeml as cp\nimport numpy as np\nimport pandas as pd\nimport pytest\nfrom packaging.version import parse\nfrom woodwork.column_schema import ColumnSchema\n\nfrom featuretools import EntitySet, demo\nfrom featuretools.primitives import AggregationPrimitive, TransformPrimitive\nfrom featuretools.tests.testing_utils import make_ecommerce_entityset\n\n\n@pytest.fixture()\ndef dask_cluster():\n    distributed = pytest.importorskip(\n        \"distributed\",\n        reason=\"Dask not installed, skipping\",\n    )\n    if distributed:\n        with distributed.LocalCluster() as cluster:\n            yield cluster\n\n\n@pytest.fixture()\ndef three_worker_dask_cluster():\n    distributed = pytest.importorskip(\n        \"distributed\",\n        reason=\"Dask not installed, skipping\",\n    )\n    if distributed:\n        with distributed.LocalCluster(n_workers=3) as cluster:\n            yield cluster\n\n\n@pytest.fixture(scope=\"session\")\ndef make_es():\n    return make_ecommerce_entityset()\n\n\n@pytest.fixture(scope=\"session\")\ndef make_int_es():\n    return make_ecommerce_entityset(with_integer_time_index=True)\n\n\n@pytest.fixture\ndef es(make_es):\n    return copy.deepcopy(make_es)\n\n\n@pytest.fixture\ndef int_es(make_int_es):\n    return copy.deepcopy(make_int_es)\n\n\n@pytest.fixture\ndef latlong_df():\n    df = pd.DataFrame({\"idx\": [0, 1, 2], \"latLong\": [pd.NA, (1, 2), (pd.NA, pd.NA)]})\n    return df\n\n\n@pytest.fixture\ndef diamond_es():\n    countries_df = pd.DataFrame({\"id\": range(2), \"name\": [\"US\", \"Canada\"]})\n    regions_df = pd.DataFrame(\n        {\n            \"id\": range(3),\n            \"country_id\": [0, 0, 1],\n            \"name\": [\"Northeast\", \"South\", \"Quebec\"],\n        },\n    ).astype({\"name\": \"category\"})\n    stores_df = pd.DataFrame(\n        {\n            \"id\": range(5),\n            \"region_id\": [0, 1, 2, 2, 1],\n            \"square_ft\": [2000, 3000, 1500, 2500, 2700],\n        },\n    )\n    customers_df = pd.DataFrame(\n        {\n            \"id\": range(5),\n            \"region_id\": [1, 0, 0, 1, 1],\n            \"name\": [\"A\", \"B\", \"C\", \"D\", \"E\"],\n        },\n    )\n    transactions_df = pd.DataFrame(\n        {\n            \"id\": range(8),\n            \"store_id\": [4, 4, 2, 3, 4, 0, 1, 1],\n            \"customer_id\": [3, 0, 2, 4, 3, 3, 2, 3],\n            \"amount\": [100, 40, 45, 83, 13, 94, 27, 81],\n        },\n    )\n\n    dataframes = {\n        \"countries\": (countries_df, \"id\"),\n        \"regions\": (regions_df, \"id\"),\n        \"stores\": (stores_df, \"id\"),\n        \"customers\": (customers_df, \"id\"),\n        \"transactions\": (transactions_df, \"id\"),\n    }\n    relationships = [\n        (\"countries\", \"id\", \"regions\", \"country_id\"),\n        (\"regions\", \"id\", \"stores\", \"region_id\"),\n        (\"regions\", \"id\", \"customers\", \"region_id\"),\n        (\"stores\", \"id\", \"transactions\", \"store_id\"),\n        (\"customers\", \"id\", \"transactions\", \"customer_id\"),\n    ]\n    return EntitySet(\n        id=\"ecommerce_diamond\",\n        dataframes=dataframes,\n        relationships=relationships,\n    )\n\n\n@pytest.fixture\ndef default_value_es():\n    transactions = pd.DataFrame(\n        {\"id\": [1, 2, 3, 4], \"session_id\": [\"a\", \"a\", \"b\", \"c\"], \"value\": [1, 1, 1, 1]},\n    )\n\n    sessions = pd.DataFrame({\"id\": [\"a\", \"b\"]})\n\n    es = EntitySet()\n    es.add_dataframe(dataframe_name=\"transactions\", dataframe=transactions, index=\"id\")\n    es.add_dataframe(dataframe_name=\"sessions\", dataframe=sessions, index=\"id\")\n\n    es.add_relationship(\"sessions\", \"id\", \"transactions\", \"session_id\")\n    return es\n\n\n@pytest.fixture\ndef home_games_es():\n    teams = pd.DataFrame({\"id\": range(3), \"name\": [\"Breakers\", \"Spirit\", \"Thorns\"]})\n    games = pd.DataFrame(\n        {\n            \"id\": range(5),\n            \"home_team_id\": [2, 2, 1, 0, 1],\n            \"away_team_id\": [1, 0, 2, 1, 0],\n            \"home_team_score\": [3, 0, 1, 0, 4],\n            \"away_team_score\": [2, 1, 2, 0, 0],\n        },\n    )\n    dataframes = {\"teams\": (teams, \"id\"), \"games\": (games, \"id\")}\n    relationships = [(\"teams\", \"id\", \"games\", \"home_team_id\")]\n    return EntitySet(dataframes=dataframes, relationships=relationships)\n\n\n@pytest.fixture\ndef games_es(home_games_es):\n    return home_games_es.add_relationship(\"teams\", \"id\", \"games\", \"away_team_id\")\n\n\n@pytest.fixture\ndef mock_customer():\n    return demo.load_mock_customer(return_entityset=True, random_seed=0)\n\n\n@pytest.fixture\ndef lt(es):\n    def label_func(df):\n        return df[\"value\"].sum() > 10\n\n    kwargs = {\n        \"time_index\": \"datetime\",\n        \"labeling_function\": label_func,\n        \"window_size\": \"1m\",\n    }\n    if parse(cp.__version__) >= parse(\"0.10.0\"):\n        kwargs[\"target_dataframe_index\"] = \"id\"\n    else:\n        kwargs[\"target_dataframe_name\"] = \"id\"  # pragma: no cover\n\n    lm = cp.LabelMaker(**kwargs)\n\n    df = es[\"log\"]\n    labels = lm.search(df, num_examples_per_instance=-1)\n    labels = labels.rename(columns={\"cutoff_time\": \"time\"})\n    return labels\n\n\n@pytest.fixture\ndef dataframes():\n    cards_df = pd.DataFrame({\"id\": [1, 2, 3, 4, 5]})\n    transactions_df = pd.DataFrame(\n        {\n            \"id\": [1, 2, 3, 4, 5, 6],\n            \"card_id\": [1, 2, 1, 3, 4, 5],\n            \"transaction_time\": [10, 12, 13, 20, 21, 20],\n            \"fraud\": [True, False, False, False, True, True],\n        },\n    )\n    dataframes = {\n        \"cards\": (cards_df, \"id\"),\n        \"transactions\": (transactions_df, \"id\", \"transaction_time\"),\n    }\n    return dataframes\n\n\n@pytest.fixture\ndef relationships():\n    return [(\"cards\", \"id\", \"transactions\", \"card_id\")]\n\n\n@pytest.fixture\ndef transform_es():\n    # Create dataframe\n    df = pd.DataFrame(\n        {\n            \"a\": [14, 12, 10],\n            \"b\": [False, False, True],\n            \"b1\": [True, True, False],\n            \"b12\": [4, 5, 6],\n            \"P\": [10, 15, 12],\n        },\n    )\n    es = EntitySet(id=\"test\")\n    # Add dataframe to entityset\n    es.add_dataframe(\n        dataframe_name=\"first\",\n        dataframe=df,\n        index=\"index\",\n        make_index=True,\n    )\n\n    return es\n\n\n@pytest.fixture\ndef divide_by_zero_es():\n    df = pd.DataFrame(\n        {\n            \"id\": [0, 1, 2, 3],\n            \"col1\": [1, 0, -3, 4],\n            \"col2\": [0, 0, 0, 4],\n        },\n    )\n    return EntitySet(\"data\", {\"zero\": (df, \"id\", None)})\n\n\n@pytest.fixture\ndef window_series():\n    return pd.Series(\n        range(20),\n        index=pd.date_range(start=\"2020-01-01\", end=\"2020-01-20\"),\n    )\n\n\n@pytest.fixture\ndef window_date_range():\n    return pd.date_range(start=\"2022-11-1\", end=\"2022-11-5\", periods=30)\n\n\n@pytest.fixture\ndef rolling_outlier_series():\n    return pd.Series(\n        [0] * 4 + [10] + [0] * 4 + [10] + [0] * 5,\n        index=pd.date_range(start=\"2020-01-01\", end=\"2020-01-15\", periods=15),\n    )\n\n\n@pytest.fixture\ndef postal_code_dataframe():\n    df = pd.DataFrame(\n        {\n            \"string_dtype\": pd.Series([\"90210\", \"60018\", \"10010\", \"92304-4201\"]),\n            \"int_dtype\": pd.Series([10000, 20000, 30000]).astype(\"category\"),\n            \"has_nulls\": pd.Series([np.nan, 20000, 30000]).astype(\"category\"),\n        },\n    )\n    df.ww.init(\n        logical_types={\n            \"string_dtype\": \"PostalCode\",\n            \"int_dtype\": \"PostalCode\",\n            \"has_nulls\": \"PostalCode\",\n        },\n    )\n    return df\n\n\ndef create_test_credentials(test_path):\n    with open(test_path, \"w+\") as f:\n        f.write(\"[test]\\n\")\n        f.write(\"aws_access_key_id=AKIAIOSFODNN7EXAMPLE\\n\")\n        f.write(\"aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY\\n\")\n\n\ndef create_test_config(test_path_config):\n    with open(test_path_config, \"w+\") as f:\n        f.write(\"[profile test]\\n\")\n        f.write(\"region=us-east-2\\n\")\n        f.write(\"output=text\\n\")\n\n\n@pytest.fixture\ndef setup_test_profile(monkeypatch, tmp_path):\n    cache = tmp_path.joinpath(\".cache\")\n    cache.mkdir()\n    test_path = str(cache.joinpath(\"test_credentials\"))\n    test_path_config = str(cache.joinpath(\"test_config\"))\n    monkeypatch.setenv(\"AWS_SHARED_CREDENTIALS_FILE\", test_path)\n    monkeypatch.setenv(\"AWS_CONFIG_FILE\", test_path_config)\n    monkeypatch.delenv(\"AWS_ACCESS_KEY_ID\", raising=False)\n    monkeypatch.delenv(\"AWS_SECRET_ACCESS_KEY\", raising=False)\n    monkeypatch.setenv(\"AWS_PROFILE\", \"test\")\n\n    with contextlib.suppress(OSError):\n        os.remove(test_path)\n        os.remove(test_path_config)  # pragma: no cover\n\n    create_test_credentials(test_path)\n    create_test_config(test_path_config)\n    yield\n    os.remove(test_path)\n    os.remove(test_path_config)\n\n\n@pytest.fixture\ndef test_aggregation_primitive():\n    class TestAgg(AggregationPrimitive):\n        name = \"test\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n        stack_on = []\n\n    return TestAgg\n\n\n@pytest.fixture\ndef test_transform_primitive():\n    class TestTransform(TransformPrimitive):\n        name = \"test\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n        stack_on = []\n\n    return TestTransform\n\n\n@pytest.fixture\ndef strings_that_have_triggered_errors_before():\n    return [\n        \"    \",\n        '\"This Borderlands game here\"\" is the perfect conclusion to the \"\"Borderlands 3\"\" line, which focuses on the fans \"\"favorite character and gives the players the opportunity to close for a long time some very important questions about\\'s character and the memorable scenery with which the players interact.',\n    ]\n"
  },
  {
    "path": "featuretools/tests/demo_tests/__init__.py",
    "content": ""
  },
  {
    "path": "featuretools/tests/demo_tests/test_demo_data.py",
    "content": "import urllib.request\n\nimport pandas as pd\nimport pytest\n\nfrom featuretools import EntitySet\nfrom featuretools.demo import load_flight, load_mock_customer, load_retail, load_weather\n\n\n@pytest.fixture(autouse=True)\ndef set_testing_headers():\n    opener = urllib.request.build_opener()\n    opener.addheaders = [(\"Testing\", \"True\")]\n    urllib.request.install_opener(opener)\n\n\ndef test_load_retail_diff():\n    nrows = 10\n    es_first = load_retail(nrows=nrows)\n    assert isinstance(es_first, EntitySet)\n    assert es_first[\"order_products\"].shape[0] == nrows\n    nrows_second = 11\n    es_second = load_retail(nrows=nrows_second)\n    assert es_second[\"order_products\"].shape[0] == nrows_second\n\n\ndef test_mock_customer():\n    n_customers = 4\n    n_products = 3\n    n_sessions = 30\n    n_transactions = 400\n    es = load_mock_customer(\n        n_customers=n_customers,\n        n_products=n_products,\n        n_sessions=n_sessions,\n        n_transactions=n_transactions,\n        random_seed=0,\n        return_entityset=True,\n    )\n    assert isinstance(es, EntitySet)\n    df_names = [df.ww.name for df in es.dataframes]\n    expected_names = [\"transactions\", \"products\", \"sessions\", \"customers\"]\n    assert set(expected_names) == set(df_names)\n    assert len(es[\"customers\"]) == 4\n    assert len(es[\"products\"]) == 3\n    assert len(es[\"sessions\"]) == 30\n    assert len(es[\"transactions\"]) == 400\n\n\ndef test_load_flight():\n    es = load_flight(\n        month_filter=[1],\n        categorical_filter={\"origin_city\": [\"Charlotte, NC\"]},\n        return_single_table=False,\n        nrows=1000,\n    )\n    assert isinstance(es, EntitySet)\n    dataframe_names = [\"airports\", \"flights\", \"trip_logs\", \"airlines\"]\n    realvals = [(11, 3), (13, 9), (103, 21), (1, 1)]\n    for i, name in enumerate(dataframe_names):\n        assert es[name].shape == realvals[i]\n\n\ndef test_weather():\n    es = load_weather()\n    assert isinstance(es, EntitySet)\n    dataframe_names = [\"temperatures\"]\n    realvals = [(3650, 3)]\n    for i, name in enumerate(dataframe_names):\n        assert es[name].shape == realvals[i]\n    es = load_weather(return_single_table=True)\n    assert isinstance(es, pd.DataFrame)\n"
  },
  {
    "path": "featuretools/tests/entityset_tests/__init__.py",
    "content": ""
  },
  {
    "path": "featuretools/tests/entityset_tests/test_es.py",
    "content": "import copy\nimport logging\nimport pickle\nimport re\nfrom datetime import datetime\nfrom unittest.mock import patch\n\nimport numpy as np\nimport pandas as pd\nimport pytest\nfrom woodwork.logical_types import (\n    URL,\n    Boolean,\n    Categorical,\n    CountryCode,\n    Datetime,\n    Double,\n    EmailAddress,\n    Integer,\n    LatLong,\n    NaturalLanguage,\n    Ordinal,\n    PostalCode,\n    SubRegionCode,\n)\n\nfrom featuretools import Relationship\nfrom featuretools.demo import load_retail\nfrom featuretools.entityset import EntitySet\nfrom featuretools.entityset.entityset import LTI_COLUMN_NAME, WW_SCHEMA_KEY\nfrom featuretools.tests.testing_utils import get_df_tags\n\n\ndef test_normalize_time_index_as_additional_column(es):\n    error_text = \"Not moving signup_date as it is the base time index column. Perhaps, move the column to the copy_columns.\"\n    with pytest.raises(ValueError, match=error_text):\n        assert \"signup_date\" in es[\"customers\"].columns\n        es.normalize_dataframe(\n            base_dataframe_name=\"customers\",\n            new_dataframe_name=\"cancellations\",\n            index=\"cancel_reason\",\n            make_time_index=\"signup_date\",\n            additional_columns=[\"signup_date\"],\n            copy_columns=[],\n        )\n\n\ndef test_normalize_time_index_as_copy_column(es):\n    assert \"signup_date\" in es[\"customers\"].columns\n    es.normalize_dataframe(\n        base_dataframe_name=\"customers\",\n        new_dataframe_name=\"cancellations\",\n        index=\"cancel_reason\",\n        make_time_index=\"signup_date\",\n        additional_columns=[],\n        copy_columns=[\"signup_date\"],\n    )\n\n    assert \"signup_date\" in es[\"customers\"].columns\n    assert es[\"customers\"].ww.time_index == \"signup_date\"\n    assert \"signup_date\" in es[\"cancellations\"].columns\n    assert es[\"cancellations\"].ww.time_index == \"signup_date\"\n\n\ndef test_normalize_time_index_as_copy_column_new_time_index(es):\n    assert \"signup_date\" in es[\"customers\"].columns\n    es.normalize_dataframe(\n        base_dataframe_name=\"customers\",\n        new_dataframe_name=\"cancellations\",\n        index=\"cancel_reason\",\n        make_time_index=True,\n        additional_columns=[],\n        copy_columns=[\"signup_date\"],\n    )\n\n    assert \"signup_date\" in es[\"customers\"].columns\n    assert es[\"customers\"].ww.time_index == \"signup_date\"\n    assert \"first_customers_time\" in es[\"cancellations\"].columns\n    assert \"signup_date\" not in es[\"cancellations\"].columns\n    assert es[\"cancellations\"].ww.time_index == \"first_customers_time\"\n\n\ndef test_normalize_time_index_as_copy_column_no_time_index(es):\n    assert \"signup_date\" in es[\"customers\"].columns\n    es.normalize_dataframe(\n        base_dataframe_name=\"customers\",\n        new_dataframe_name=\"cancellations\",\n        index=\"cancel_reason\",\n        make_time_index=False,\n        additional_columns=[],\n        copy_columns=[\"signup_date\"],\n    )\n\n    assert \"signup_date\" in es[\"customers\"].columns\n    assert es[\"customers\"].ww.time_index == \"signup_date\"\n    assert \"signup_date\" in es[\"cancellations\"].columns\n    assert es[\"cancellations\"].ww.time_index is None\n\n\ndef test_cannot_re_add_relationships_that_already_exists(es):\n    warn_text = \"Not adding duplicate relationship: \" + str(es.relationships[0])\n    before_len = len(es.relationships)\n    rel = es.relationships[0]\n    with pytest.warns(UserWarning, match=warn_text):\n        es.add_relationship(relationship=rel)\n    with pytest.warns(UserWarning, match=warn_text):\n        es.add_relationship(\n            rel._parent_dataframe_name,\n            rel._parent_column_name,\n            rel._child_dataframe_name,\n            rel._child_column_name,\n        )\n    after_len = len(es.relationships)\n    assert before_len == after_len\n\n\ndef test_add_relationships_convert_type(es):\n    for r in es.relationships:\n        parent_df = es[r.parent_dataframe.ww.name]\n        child_df = es[r.child_dataframe.ww.name]\n        assert parent_df.ww.index == r._parent_column_name\n        assert \"foreign_key\" in r.child_column.ww.semantic_tags\n        assert str(parent_df[r._parent_column_name].dtype) == str(\n            child_df[r._child_column_name].dtype,\n        )\n\n\ndef test_add_relationship_diff_param_logical_types(es):\n    ordinal_1 = Ordinal(order=[0, 1, 2, 3, 4, 5, 6])\n    ordinal_2 = Ordinal(order=[0, 1, 2, 3, 4, 5])\n    es[\"sessions\"].ww.set_types(logical_types={\"id\": ordinal_1})\n    log_2_df = es[\"log\"].copy()\n    log_logical_types = {\n        \"id\": Integer,\n        \"session_id\": ordinal_2,\n        \"product_id\": Categorical(),\n        \"datetime\": Datetime,\n        \"value\": Double,\n        \"value_2\": Double,\n        \"latlong\": LatLong,\n        \"latlong2\": LatLong,\n        \"zipcode\": PostalCode,\n        \"countrycode\": CountryCode,\n        \"subregioncode\": SubRegionCode,\n        \"value_many_nans\": Double,\n        \"priority_level\": Ordinal(order=[0, 1, 2]),\n        \"purchased\": Boolean,\n        \"comments\": NaturalLanguage,\n        \"url\": URL,\n        \"email_address\": EmailAddress,\n    }\n    log_semantic_tags = {\"session_id\": \"foreign_key\", \"product_id\": \"foreign_key\"}\n    assert set(log_logical_types) == set(log_2_df.columns)\n    es.add_dataframe(\n        dataframe_name=\"log2\",\n        dataframe=log_2_df,\n        index=\"id\",\n        logical_types=log_logical_types,\n        semantic_tags=log_semantic_tags,\n        time_index=\"datetime\",\n    )\n    assert \"log2\" in es.dataframe_dict\n    assert es[\"log2\"].ww.schema is not None\n    assert isinstance(es[\"log2\"].ww.logical_types[\"session_id\"], Ordinal)\n    assert isinstance(es[\"sessions\"].ww.logical_types[\"id\"], Ordinal)\n    assert (\n        es[\"sessions\"].ww.logical_types[\"id\"]\n        != es[\"log2\"].ww.logical_types[\"session_id\"]\n    )\n\n    warning_text = \"Changing child logical type to match parent.\"\n    with pytest.warns(UserWarning, match=warning_text):\n        es.add_relationship(\"sessions\", \"id\", \"log2\", \"session_id\")\n    assert isinstance(es[\"log2\"].ww.logical_types[\"product_id\"], Categorical)\n    assert isinstance(es[\"products\"].ww.logical_types[\"id\"], Categorical)\n\n\ndef test_add_relationship_different_logical_types_same_dtype(es):\n    log_2_df = es[\"log\"].copy()\n    log_logical_types = {\n        \"id\": Integer,\n        \"session_id\": Integer,\n        \"product_id\": CountryCode,\n        \"datetime\": Datetime,\n        \"value\": Double,\n        \"value_2\": Double,\n        \"latlong\": LatLong,\n        \"latlong2\": LatLong,\n        \"zipcode\": PostalCode,\n        \"countrycode\": CountryCode,\n        \"subregioncode\": SubRegionCode,\n        \"value_many_nans\": Double,\n        \"priority_level\": Ordinal(order=[0, 1, 2]),\n        \"purchased\": Boolean,\n        \"comments\": NaturalLanguage,\n        \"url\": URL,\n        \"email_address\": EmailAddress,\n    }\n    log_semantic_tags = {\"session_id\": \"foreign_key\", \"product_id\": \"foreign_key\"}\n    assert set(log_logical_types) == set(log_2_df.columns)\n    es.add_dataframe(\n        dataframe_name=\"log2\",\n        dataframe=log_2_df,\n        index=\"id\",\n        logical_types=log_logical_types,\n        semantic_tags=log_semantic_tags,\n        time_index=\"datetime\",\n    )\n    assert \"log2\" in es.dataframe_dict\n    assert es[\"log2\"].ww.schema is not None\n    assert isinstance(es[\"log2\"].ww.logical_types[\"product_id\"], CountryCode)\n    assert isinstance(es[\"products\"].ww.logical_types[\"id\"], Categorical)\n\n    warning_text = \"Logical type CountryCode for child column product_id does not match parent column id logical type Categorical. Changing child logical type to match parent.\"\n    with pytest.warns(UserWarning, match=warning_text):\n        es.add_relationship(\"products\", \"id\", \"log2\", \"product_id\")\n    assert isinstance(es[\"log2\"].ww.logical_types[\"product_id\"], Categorical)\n    assert isinstance(es[\"products\"].ww.logical_types[\"id\"], Categorical)\n    assert \"foreign_key\" in es[\"log2\"].ww.semantic_tags[\"product_id\"]\n\n\ndef test_add_relationship_different_compatible_dtypes(es):\n    log_2_df = es[\"log\"].copy()\n    log_logical_types = {\n        \"id\": Integer,\n        \"session_id\": Datetime,\n        \"product_id\": Categorical,\n        \"datetime\": Datetime,\n        \"value\": Double,\n        \"value_2\": Double,\n        \"latlong\": LatLong,\n        \"latlong2\": LatLong,\n        \"zipcode\": PostalCode,\n        \"countrycode\": CountryCode,\n        \"subregioncode\": SubRegionCode,\n        \"value_many_nans\": Double,\n        \"priority_level\": Ordinal(order=[0, 1, 2]),\n        \"purchased\": Boolean,\n        \"comments\": NaturalLanguage,\n        \"url\": URL,\n        \"email_address\": EmailAddress,\n    }\n    log_semantic_tags = {\"session_id\": \"foreign_key\", \"product_id\": \"foreign_key\"}\n    assert set(log_logical_types) == set(log_2_df.columns)\n    es.add_dataframe(\n        dataframe_name=\"log2\",\n        dataframe=log_2_df,\n        index=\"id\",\n        logical_types=log_logical_types,\n        semantic_tags=log_semantic_tags,\n        time_index=\"datetime\",\n    )\n    assert \"log2\" in es.dataframe_dict\n    assert es[\"log2\"].ww.schema is not None\n    assert isinstance(es[\"log2\"].ww.logical_types[\"session_id\"], Datetime)\n    assert isinstance(es[\"customers\"].ww.logical_types[\"id\"], Integer)\n\n    warning_text = \"Logical type Datetime for child column session_id does not match parent column id logical type Integer. Changing child logical type to match parent.\"\n    with pytest.warns(UserWarning, match=warning_text):\n        es.add_relationship(\"customers\", \"id\", \"log2\", \"session_id\")\n    assert isinstance(es[\"log2\"].ww.logical_types[\"session_id\"], Integer)\n    assert isinstance(es[\"customers\"].ww.logical_types[\"id\"], Integer)\n\n\ndef test_add_relationship_errors_child_v_index(es):\n    new_df = es[\"log\"].ww.copy()\n    new_df.ww._schema.name = \"log2\"\n    es.add_dataframe(dataframe=new_df)\n\n    to_match = \"Unable to add relationship because child column 'id' in 'log2' is also its index\"\n    with pytest.raises(ValueError, match=to_match):\n        es.add_relationship(\"log\", \"id\", \"log2\", \"id\")\n\n\ndef test_add_relationship_empty_child_convert_dtype(es):\n    relationship = Relationship(es, \"sessions\", \"id\", \"log\", \"session_id\")\n    empty_log_df = pd.DataFrame(columns=es[\"log\"].columns)\n\n    es.add_dataframe(empty_log_df, \"log\")\n\n    assert len(es[\"log\"]) == 0\n    # session_id will be Unknown logical type with dtype string\n    assert es[\"log\"][\"session_id\"].dtype == \"string\"\n\n    es.relationships.remove(relationship)\n    assert relationship not in es.relationships\n\n    es.add_relationship(relationship=relationship)\n    assert es[\"log\"][\"session_id\"].dtype == \"int64\"\n\n\ndef test_add_relationship_with_relationship_object(es):\n    relationship = Relationship(es, \"sessions\", \"id\", \"log\", \"session_id\")\n    es.add_relationship(relationship=relationship)\n    assert relationship in es.relationships\n\n\ndef test_add_relationships_with_relationship_object(es):\n    relationships = [Relationship(es, \"sessions\", \"id\", \"log\", \"session_id\")]\n    es.add_relationships(relationships)\n    assert relationships[0] in es.relationships\n\n\ndef test_add_relationship_error(es):\n    relationship = Relationship(es, \"sessions\", \"id\", \"log\", \"session_id\")\n    error_message = (\n        \"Cannot specify dataframe and column name values and also supply a Relationship\"\n    )\n    with pytest.raises(ValueError, match=error_message):\n        es.add_relationship(parent_dataframe_name=\"sessions\", relationship=relationship)\n\n\ndef test_query_by_values_returns_rows_in_given_order():\n    data = pd.DataFrame(\n        {\n            \"id\": [1, 2, 3, 4, 5],\n            \"value\": [\"a\", \"c\", \"b\", \"a\", \"a\"],\n            \"time\": [1000, 2000, 3000, 4000, 5000],\n        },\n    )\n\n    es = EntitySet()\n    es = es.add_dataframe(\n        dataframe=data,\n        dataframe_name=\"test\",\n        index=\"id\",\n        time_index=\"time\",\n        logical_types={\"value\": \"Categorical\"},\n    )\n    query = es.query_by_values(\"test\", [\"b\", \"a\"], column_name=\"value\")\n    assert np.array_equal(query[\"id\"], [1, 3, 4, 5])\n\n\ndef test_query_by_values_secondary_time_index(es):\n    end = np.datetime64(datetime(2011, 10, 1))\n    all_instances = [0, 1, 2]\n    result = es.query_by_values(\"customers\", all_instances, time_last=end)\n\n    for col in [\"cancel_date\", \"cancel_reason\"]:\n        nulls = result.loc[all_instances][col].isnull() == [False, True, True]\n        assert nulls.all(), \"Some instance has data it shouldn't for column %s\" % col\n\n\ndef test_query_by_id(es):\n    df = es.query_by_values(\"log\", instance_vals=[0])\n    assert df[\"id\"].values[0] == 0\n\n\ndef test_query_by_single_value(es):\n    df = es.query_by_values(\"log\", instance_vals=0)\n    assert df[\"id\"].values[0] == 0\n\n\ndef test_query_by_df(es):\n    instance_df = pd.DataFrame({\"id\": [1, 3], \"vals\": [0, 1]})\n    df = es.query_by_values(\"log\", instance_vals=instance_df)\n\n    assert np.array_equal(df[\"id\"], [1, 3])\n\n\ndef test_query_by_id_with_time(es):\n    df = es.query_by_values(\n        dataframe_name=\"log\",\n        instance_vals=[0, 1, 2, 3, 4],\n        time_last=datetime(2011, 4, 9, 10, 30, 2 * 6),\n    )\n\n    assert list(df[\"id\"].values) == [0, 1, 2]\n\n\ndef test_query_by_column_with_time(es):\n    df = es.query_by_values(\n        dataframe_name=\"log\",\n        instance_vals=[0, 1, 2],\n        column_name=\"session_id\",\n        time_last=datetime(2011, 4, 9, 10, 50, 0),\n    )\n\n    true_values = [i * 5 for i in range(5)] + [i * 1 for i in range(4)] + [0]\n\n    assert list(df[\"id\"].values) == list(range(10))\n    assert list(df[\"value\"].values) == true_values\n\n\ndef test_query_by_column_with_no_lti_and_training_window(es):\n    match = (\n        \"Using training_window but last_time_index is not set for dataframe customers\"\n    )\n    with pytest.warns(UserWarning, match=match):\n        df = es.query_by_values(\n            dataframe_name=\"customers\",\n            instance_vals=[0, 1, 2],\n            column_name=\"cohort\",\n            time_last=datetime(2011, 4, 11),\n            training_window=\"3d\",\n        )\n\n    assert list(df[\"id\"].values) == [1]\n    assert list(df[\"age\"].values) == [25]\n\n\ndef test_query_by_column_with_lti_and_training_window(es):\n    es.add_last_time_indexes()\n    df = es.query_by_values(\n        dataframe_name=\"customers\",\n        instance_vals=[0, 1, 2],\n        column_name=\"cohort\",\n        time_last=datetime(2011, 4, 11),\n        training_window=\"3d\",\n    )\n    df = df.reset_index(drop=True).sort_values(\"id\")\n    assert list(df[\"id\"].values) == [0, 1, 2]\n    assert list(df[\"age\"].values) == [33, 25, 56]\n\n\ndef test_query_by_indexed_column(es):\n    df = es.query_by_values(\n        dataframe_name=\"log\",\n        instance_vals=[\"taco clock\"],\n        column_name=\"product_id\",\n    )\n    df = df.reset_index(drop=True).sort_values(\"id\")\n    assert list(df[\"id\"].values) == [15, 16]\n\n\n@pytest.fixture\ndef df():\n    return pd.DataFrame({\"id\": [0, 1, 2], \"category\": [\"a\", \"b\", \"c\"]})\n\n\ndef test_check_columns_and_dataframe(df):\n    # matches\n    logical_types = {\"id\": Integer, \"category\": Categorical}\n    es = EntitySet(id=\"test\")\n    es.add_dataframe(\n        df,\n        dataframe_name=\"test_dataframe\",\n        index=\"id\",\n        logical_types=logical_types,\n    )\n    assert isinstance(\n        es.dataframe_dict[\"test_dataframe\"].ww.logical_types[\"category\"],\n        Categorical,\n    )\n    assert es.dataframe_dict[\"test_dataframe\"].ww.semantic_tags[\"category\"] == {\n        \"category\",\n    }\n\n\ndef test_make_index_any_location(df):\n    logical_types = {\"id\": Integer, \"category\": Categorical}\n\n    es = EntitySet(id=\"test\")\n    es.add_dataframe(\n        dataframe_name=\"test_dataframe\",\n        index=\"id1\",\n        make_index=True,\n        logical_types=logical_types,\n        dataframe=df,\n    )\n    assert es.dataframe_dict[\"test_dataframe\"].columns[0] == \"id1\"\n    assert es.dataframe_dict[\"test_dataframe\"].ww.index == \"id1\"\n\n\ndef test_replace_dataframe_and_create_index(es):\n    df = pd.DataFrame({\"ints\": [3, 4, 5], \"category\": [\"a\", \"b\", \"a\"]})\n    final_df = df.copy()\n    final_df[\"id\"] = [0, 1, 2]\n    needs_idx_df = df.copy()\n\n    logical_types = {\"ints\": Integer, \"category\": Categorical}\n    es.add_dataframe(\n        dataframe=df,\n        dataframe_name=\"test_df\",\n        index=\"id\",\n        make_index=True,\n        logical_types=logical_types,\n    )\n\n    assert es[\"test_df\"].ww.index == \"id\"\n\n    # DataFrame that needs the index column added\n    assert \"id\" not in needs_idx_df.columns\n    es.replace_dataframe(\"test_df\", needs_idx_df)\n\n    assert es[\"test_df\"].ww.index == \"id\"\n    df = es[\"test_df\"].sort_values(by=\"id\")\n    assert all(df[\"id\"] == final_df[\"id\"])\n    assert all(df[\"ints\"] == final_df[\"ints\"])\n\n\ndef test_replace_dataframe_created_index_present(es):\n    df = pd.DataFrame({\"ints\": [3, 4, 5], \"category\": [\"a\", \"b\", \"a\"]})\n\n    logical_types = {\"ints\": Integer, \"category\": Categorical}\n    es.add_dataframe(\n        dataframe=df,\n        dataframe_name=\"test_df\",\n        index=\"id\",\n        make_index=True,\n        logical_types=logical_types,\n    )\n\n    # DataFrame that already has the index column\n    has_idx_df = es[\"test_df\"].replace({0: 100})\n    has_idx_df.set_index(\"id\", drop=False, inplace=True)\n\n    assert \"id\" in has_idx_df.columns\n\n    es.replace_dataframe(\"test_df\", has_idx_df)\n    assert es[\"test_df\"].ww.index == \"id\"\n    df = es[\"test_df\"].sort_values(by=\"ints\")\n    assert all(df[\"id\"] == [100, 1, 2])\n\n\ndef test_index_any_location(df):\n    logical_types = {\"id\": Integer, \"category\": Categorical}\n\n    es = EntitySet(id=\"test\")\n    es.add_dataframe(\n        dataframe_name=\"test_dataframe\",\n        index=\"category\",\n        logical_types=logical_types,\n        dataframe=df,\n    )\n    assert es.dataframe_dict[\"test_dataframe\"].columns[1] == \"category\"\n    assert es.dataframe_dict[\"test_dataframe\"].ww.index == \"category\"\n\n\ndef test_extra_column_type(df):\n    # more columns\n    logical_types = {\"id\": Integer, \"category\": Categorical, \"category2\": Categorical}\n\n    error_text = re.escape(\n        \"logical_types contains columns that are not present in dataframe: ['category2']\",\n    )\n    with pytest.raises(LookupError, match=error_text):\n        es = EntitySet(id=\"test\")\n        es.add_dataframe(\n            dataframe_name=\"test_dataframe\",\n            index=\"id\",\n            logical_types=logical_types,\n            dataframe=df,\n        )\n\n\ndef test_add_parent_not_index_column(es):\n    error_text = \"Parent column 'language' is not the index of dataframe régions\"\n    with pytest.raises(AttributeError, match=error_text):\n        es.add_relationship(\"régions\", \"language\", \"customers\", \"région_id\")\n\n\n@pytest.fixture\ndef df2():\n    return pd.DataFrame({\"category\": [1, 2, 3], \"category2\": [\"1\", \"2\", \"3\"]})\n\n\ndef test_none_index(df2):\n    es = EntitySet(id=\"test\")\n\n    copy_df = df2.copy()\n    copy_df.ww.init(name=\"test_dataframe\")\n    error_msg = \"Cannot add Woodwork DataFrame to EntitySet without index\"\n    with pytest.raises(ValueError, match=error_msg):\n        es.add_dataframe(dataframe=copy_df)\n\n    warn_text = (\n        \"Using first column as index. To change this, specify the index parameter\"\n    )\n    with pytest.warns(UserWarning, match=warn_text):\n        es.add_dataframe(\n            dataframe_name=\"test_dataframe\",\n            logical_types={\"category\": \"Categorical\"},\n            dataframe=df2,\n        )\n    assert es[\"test_dataframe\"].ww.index == \"category\"\n    assert es[\"test_dataframe\"].ww.semantic_tags[\"category\"] == {\"index\"}\n    assert isinstance(es[\"test_dataframe\"].ww.logical_types[\"category\"], Categorical)\n\n\n@pytest.fixture\ndef df3():\n    return pd.DataFrame({\"category\": [1, 2, 3]})\n\n\ndef test_unknown_index(df3):\n    warn_text = \"index id not found in dataframe, creating new integer column\"\n    es = EntitySet(id=\"test\")\n    with pytest.warns(UserWarning, match=warn_text):\n        es.add_dataframe(\n            dataframe_name=\"test_dataframe\",\n            dataframe=df3,\n            index=\"id\",\n            logical_types={\"category\": \"Categorical\"},\n        )\n    assert es[\"test_dataframe\"].ww.index == \"id\"\n    assert list(es[\"test_dataframe\"][\"id\"]) == list(\n        range(3),\n    )\n\n\ndef test_doesnt_remake_index(df):\n    logical_types = {\"id\": \"Integer\", \"category\": \"Categorical\"}\n    error_text = \"Cannot make index: column with name id already present\"\n    with pytest.raises(RuntimeError, match=error_text):\n        es = EntitySet(id=\"test\")\n        es.add_dataframe(\n            dataframe_name=\"test_dataframe\",\n            index=\"id\",\n            make_index=True,\n            dataframe=df,\n            logical_types=logical_types,\n        )\n\n\ndef test_bad_time_index_column(df3):\n    logical_types = {\"category\": \"Categorical\"}\n    error_text = \"Specified time index column `time` not found in dataframe\"\n    with pytest.raises(LookupError, match=error_text):\n        es = EntitySet(id=\"test\")\n        es.add_dataframe(\n            dataframe_name=\"test_dataframe\",\n            dataframe=df3,\n            index=\"category\",\n            time_index=\"time\",\n            logical_types=logical_types,\n        )\n\n\n@pytest.fixture\ndef df4():\n    df = pd.DataFrame(\n        {\n            \"id\": [0, 1, 2],\n            \"category\": [\"a\", \"b\", \"a\"],\n            \"category_int\": [1, 2, 3],\n            \"ints\": [\"1\", \"2\", \"3\"],\n            \"floats\": [\"1\", \"2\", \"3.0\"],\n        },\n    )\n    df[\"category_int\"] = df[\"category_int\"].astype(\"category\")\n    return df\n\n\ndef test_converts_dtype_on_init(df4):\n    logical_types = {\"id\": Integer, \"ints\": Integer, \"floats\": Double}\n    es = EntitySet(id=\"test\")\n    df4.ww.init(name=\"test_dataframe\", index=\"id\", logical_types=logical_types)\n    es.add_dataframe(dataframe=df4)\n\n    df = es[\"test_dataframe\"]\n    assert df[\"ints\"].dtype.name == \"int64\"\n    assert df[\"floats\"].dtype.name == \"float64\"\n\n    # this is infer from pandas dtype\n    df = es[\"test_dataframe\"]\n    assert isinstance(df.ww.logical_types[\"category_int\"], Categorical)\n\n\ndef test_converts_dtype_after_init(df4):\n    category_dtype = \"category\"\n\n    df4[\"category\"] = df4[\"category\"].astype(category_dtype)\n\n    es = EntitySet(id=\"test\")\n    es.add_dataframe(\n        dataframe_name=\"test_dataframe\",\n        index=\"id\",\n        dataframe=df4,\n        logical_types=None,\n    )\n    df = es[\"test_dataframe\"]\n\n    df.ww.set_types(logical_types={\"ints\": \"Integer\"})\n    assert isinstance(df.ww.logical_types[\"ints\"], Integer)\n    assert df[\"ints\"].dtype == \"int64\"\n\n    df.ww.set_types(logical_types={\"ints\": \"Categorical\"})\n    assert isinstance(df.ww.logical_types[\"ints\"], Categorical)\n    assert df[\"ints\"].dtype == category_dtype\n\n    df.ww.set_types(logical_types={\"ints\": Ordinal(order=[1, 2, 3])})\n    assert df.ww.logical_types[\"ints\"] == Ordinal(order=[1, 2, 3])\n    assert df[\"ints\"].dtype == category_dtype\n\n    df.ww.set_types(logical_types={\"ints\": \"NaturalLanguage\"})\n    assert isinstance(df.ww.logical_types[\"ints\"], NaturalLanguage)\n    assert df[\"ints\"].dtype == \"string\"\n\n\n@pytest.fixture\ndef datetime1():\n    times = pd.date_range(\"1/1/2011\", periods=3, freq=\"H\")\n    time_strs = times.strftime(\"%Y-%m-%d\")\n    return pd.DataFrame({\"id\": [0, 1, 2], \"time\": time_strs})\n\n\ndef test_converts_datetime(datetime1):\n    # string converts to datetime correctly\n    # This test fails without defining logical types.\n    # Entityset infers time column should be numeric type\n    logical_types = {\"id\": Integer, \"time\": Datetime}\n\n    es = EntitySet(id=\"test\")\n    es.add_dataframe(\n        dataframe_name=\"test_dataframe\",\n        index=\"id\",\n        time_index=\"time\",\n        logical_types=logical_types,\n        dataframe=datetime1,\n    )\n    pd_col = es[\"test_dataframe\"][\"time\"]\n    assert isinstance(es[\"test_dataframe\"].ww.logical_types[\"time\"], Datetime)\n    assert type(pd_col[0]) == pd.Timestamp\n\n\n@pytest.fixture\ndef datetime2():\n    datetime_format = \"%d-%m-%Y\"\n    actual = pd.Timestamp(\"Jan 2, 2011\")\n    time_strs = [actual.strftime(datetime_format)] * 3\n    return pd.DataFrame(\n        {\"id\": [0, 1, 2], \"time_format\": time_strs, \"time_no_format\": time_strs},\n    )\n\n\ndef test_handles_datetime_format(datetime2):\n    # check if we load according to the format string\n    # pass in an ambiguous date\n    datetime_format = \"%d-%m-%Y\"\n    actual = pd.Timestamp(\"Jan 2, 2011\")\n\n    logical_types = {\n        \"id\": Integer,\n        \"time_format\": (Datetime(datetime_format=datetime_format)),\n        \"time_no_format\": Datetime,\n    }\n\n    es = EntitySet(id=\"test\")\n    es.add_dataframe(\n        dataframe_name=\"test_dataframe\",\n        index=\"id\",\n        logical_types=logical_types,\n        dataframe=datetime2,\n    )\n\n    col_format = es[\"test_dataframe\"][\"time_format\"]\n    col_no_format = es[\"test_dataframe\"][\"time_no_format\"]\n    # without formatting pandas gets it wrong\n    assert (col_no_format != actual).all()\n\n    # with formatting we correctly get jan2\n    assert (col_format == actual).all()\n\n\ndef test_handles_datetime_mismatch():\n    # can't convert arbitrary strings\n    df = pd.DataFrame({\"id\": [0, 1, 2], \"time\": [\"a\", \"b\", \"tomorrow\"]})\n    logical_types = {\"id\": Integer, \"time\": Datetime}\n\n    error_text = \"Time index column must contain datetime or numeric values\"\n    with pytest.raises(TypeError, match=error_text):\n        es = EntitySet(id=\"test\")\n        es.add_dataframe(\n            df,\n            dataframe_name=\"test_dataframe\",\n            index=\"id\",\n            time_index=\"time\",\n            logical_types=logical_types,\n        )\n\n\ndef test_dataframe_init(es):\n    df = pd.DataFrame(\n        {\n            \"id\": [\"0\", \"1\", \"2\"],\n            \"time\": [datetime(2011, 4, 9, 10, 31, 3 * i) for i in range(3)],\n            \"category\": [\"a\", \"b\", \"a\"],\n            \"number\": [4, 5, 6],\n        },\n    )\n    logical_types = {\"id\": Categorical, \"time\": Datetime}\n    es.add_dataframe(\n        df.copy(),\n        dataframe_name=\"test_dataframe\",\n        index=\"id\",\n        time_index=\"time\",\n        logical_types=logical_types,\n    )\n    df_shape = df.shape\n\n    es_df_shape = es[\"test_dataframe\"].shape\n    assert es_df_shape == df_shape\n    assert es[\"test_dataframe\"].ww.index == \"id\"\n    assert es[\"test_dataframe\"].ww.time_index == \"time\"\n    assert set([v for v in es[\"test_dataframe\"].ww.columns]) == set(df.columns)\n\n    assert es[\"test_dataframe\"][\"time\"].dtype == df[\"time\"].dtype\n    assert set(es[\"test_dataframe\"][\"id\"]) == set(df[\"id\"])\n\n\n@pytest.fixture\ndef bad_df():\n    return pd.DataFrame({\"a\": [1, 2, 3], \"b\": [4, 5, 6], 3: [\"a\", \"b\", \"c\"]})\n\n\ndef test_nonstr_column_names(bad_df):\n    es = EntitySet(id=\"Failure\")\n    error_text = r\"All column names must be strings \\(Columns \\[3\\] are not strings\\)\"\n    with pytest.raises(ValueError, match=error_text):\n        es.add_dataframe(dataframe_name=\"str_cols\", dataframe=bad_df, index=\"a\")\n\n    bad_df.ww.init()\n    with pytest.raises(ValueError, match=error_text):\n        es.add_dataframe(dataframe_name=\"str_cols\", dataframe=bad_df)\n\n\ndef test_sort_time_id():\n    transactions_df = pd.DataFrame(\n        {\n            \"id\": [1, 2, 3, 4, 5, 6],\n            \"transaction_time\": pd.date_range(start=\"10:00\", periods=6, freq=\"10s\")[\n                ::-1\n            ],\n        },\n    )\n\n    es = EntitySet(\n        \"test\",\n        dataframes={\"t\": (transactions_df.copy(), \"id\", \"transaction_time\")},\n    )\n    assert es[\"t\"] is not transactions_df\n    times = list(es[\"t\"].transaction_time)\n    assert times == sorted(list(transactions_df.transaction_time))\n\n\ndef test_already_sorted_parameter():\n    transactions_df = pd.DataFrame(\n        {\n            \"id\": [1, 2, 3, 4, 5, 6],\n            \"transaction_time\": [\n                datetime(2014, 4, 6),\n                datetime(2012, 4, 8),\n                datetime(2012, 4, 8),\n                datetime(2013, 4, 8),\n                datetime(2015, 4, 8),\n                datetime(2016, 4, 9),\n            ],\n        },\n    )\n\n    es = EntitySet(id=\"test\")\n    es.add_dataframe(\n        transactions_df.copy(),\n        dataframe_name=\"t\",\n        index=\"id\",\n        time_index=\"transaction_time\",\n        already_sorted=True,\n    )\n\n    assert es[\"t\"] is not transactions_df\n    times = list(es[\"t\"].transaction_time)\n    assert times == list(transactions_df.transaction_time)\n\n\ndef test_concat_not_inplace(es):\n    first_es = copy.deepcopy(es)\n    for df in first_es.dataframes:\n        new_df = df.loc[[], :]\n        first_es.replace_dataframe(df.ww.name, new_df)\n\n    second_es = copy.deepcopy(es)\n\n    # set the data description\n    first_es.metadata\n\n    new_es = first_es.concat(second_es)\n\n    assert new_es == es\n    assert new_es._data_description is None\n    assert first_es._data_description is not None\n\n\ndef test_concat_inplace(es):\n    first_es = copy.deepcopy(es)\n    second_es = copy.deepcopy(es)\n    for df in first_es.dataframes:\n        new_df = df.loc[[], :]\n        first_es.replace_dataframe(df.ww.name, new_df)\n\n    # set the data description\n    es.metadata\n\n    es.concat(first_es, inplace=True)\n\n    assert second_es == es\n    assert es._data_description is None\n\n\ndef test_concat_with_lti(es):\n    first_es = copy.deepcopy(es)\n    for df in first_es.dataframes:\n        new_df = df.loc[[], :]\n        first_es.replace_dataframe(df.ww.name, new_df)\n\n    second_es = copy.deepcopy(es)\n\n    first_es.add_last_time_indexes()\n    second_es.add_last_time_indexes()\n    es.add_last_time_indexes()\n\n    new_es = first_es.concat(second_es)\n\n    assert new_es == es\n\n    first_es[\"stores\"].ww.pop(LTI_COLUMN_NAME)\n    first_es[\"stores\"].ww.metadata.pop(\"last_time_index\")\n    second_es[\"stores\"].ww.pop(LTI_COLUMN_NAME)\n    second_es[\"stores\"].ww.metadata.pop(\"last_time_index\")\n\n    assert not first_es.__eq__(es, deep=False)\n    assert not second_es.__eq__(es, deep=False)\n    assert LTI_COLUMN_NAME not in first_es[\"stores\"]\n    assert LTI_COLUMN_NAME not in second_es[\"stores\"]\n\n    new_es = first_es.concat(second_es)\n\n    assert new_es.__eq__(es, deep=True)\n    # stores will get last time index re-added because it has children that will get lti calculated\n    assert LTI_COLUMN_NAME in new_es[\"stores\"]\n\n\ndef test_concat_errors(es):\n    # entitysets are not equal\n    copy_es = copy.deepcopy(es)\n    copy_es[\"customers\"].ww.pop(\"phone_number\")\n\n    error = (\n        \"Entitysets must have the same dataframes, relationships\" \", and column names\"\n    )\n    with pytest.raises(ValueError, match=error):\n        es.concat(copy_es)\n\n\ndef test_concat_sort_index_with_time_index(es):\n    # only pandas dataframes sort on the index and time index\n    es1 = copy.deepcopy(es)\n    es1.replace_dataframe(\n        dataframe_name=\"customers\",\n        df=es[\"customers\"].loc[[0, 1], :],\n        already_sorted=True,\n    )\n    es2 = copy.deepcopy(es)\n    es2.replace_dataframe(\n        dataframe_name=\"customers\",\n        df=es[\"customers\"].loc[[2], :],\n        already_sorted=True,\n    )\n\n    combined_es_order_1 = es1.concat(es2)\n    combined_es_order_2 = es2.concat(es1)\n\n    assert list(combined_es_order_1[\"customers\"].index) == [2, 0, 1]\n    assert list(combined_es_order_2[\"customers\"].index) == [2, 0, 1]\n    assert combined_es_order_1.__eq__(es, deep=True)\n    assert combined_es_order_2.__eq__(es, deep=True)\n    assert combined_es_order_2.__eq__(combined_es_order_1, deep=True)\n\n\ndef test_concat_sort_index_without_time_index(es):\n    # Sorting is only performed on DataFrames with time indices\n    es1 = copy.deepcopy(es)\n    es1.replace_dataframe(\n        dataframe_name=\"products\",\n        df=es[\"products\"].iloc[[0, 1, 2], :],\n        already_sorted=True,\n    )\n    es2 = copy.deepcopy(es)\n    es2.replace_dataframe(\n        dataframe_name=\"products\",\n        df=es[\"products\"].iloc[[3, 4, 5], :],\n        already_sorted=True,\n    )\n\n    combined_es_order_1 = es1.concat(es2)\n    combined_es_order_2 = es2.concat(es1)\n\n    # order matters when we don't sort\n    assert list(combined_es_order_1[\"products\"].index) == [\n        \"Haribo sugar-free gummy bears\",\n        \"car\",\n        \"toothpaste\",\n        \"brown bag\",\n        \"coke zero\",\n        \"taco clock\",\n    ]\n    assert list(combined_es_order_2[\"products\"].index) == [\n        \"brown bag\",\n        \"coke zero\",\n        \"taco clock\",\n        \"Haribo sugar-free gummy bears\",\n        \"car\",\n        \"toothpaste\",\n    ]\n    assert combined_es_order_1.__eq__(es, deep=True)\n    assert not combined_es_order_2.__eq__(es, deep=True)\n    assert combined_es_order_2.__eq__(es, deep=False)\n    assert not combined_es_order_2.__eq__(combined_es_order_1, deep=True)\n\n\ndef test_concat_with_make_index(es):\n    df = pd.DataFrame({\"id\": [0, 1, 2], \"category\": [\"a\", \"b\", \"a\"]})\n    logical_types = {\"id\": Categorical, \"category\": Categorical}\n    es.add_dataframe(\n        dataframe=df,\n        dataframe_name=\"test_df\",\n        index=\"id1\",\n        make_index=True,\n        logical_types=logical_types,\n    )\n\n    es_1 = copy.deepcopy(es)\n    es_2 = copy.deepcopy(es)\n\n    assert es.__eq__(es_1, deep=True)\n    assert es.__eq__(es_2, deep=True)\n\n    # map of what rows to take from es_1 and es_2 for each dataframe\n    emap = {\n        \"log\": [list(range(10)) + [14, 15, 16], list(range(10, 14)) + [15, 16]],\n        \"sessions\": [[0, 1, 2], [1, 3, 4, 5]],\n        \"customers\": [[0, 2], [1, 2]],\n        \"test_df\": [[0, 1], [0, 2]],\n    }\n\n    for i, _es in enumerate([es_1, es_2]):\n        for df_name, rows in emap.items():\n            df = _es[df_name]\n            _es.replace_dataframe(dataframe_name=df_name, df=df.loc[rows[i]])\n\n    assert es.__eq__(es_1, deep=False)\n    assert es.__eq__(es_2, deep=False)\n    assert not es.__eq__(es_1, deep=True)\n    assert not es.__eq__(es_2, deep=True)\n\n    old_es_1 = copy.deepcopy(es_1)\n    old_es_2 = copy.deepcopy(es_2)\n    es_3 = es_1.concat(es_2)\n\n    assert old_es_1.__eq__(es_1, deep=True)\n    assert old_es_2.__eq__(es_2, deep=True)\n\n    assert es_3.__eq__(es, deep=True)\n\n\n@pytest.fixture\ndef transactions_df():\n    return pd.DataFrame(\n        {\n            \"id\": [1, 2, 3, 4, 5, 6],\n            \"card_id\": [1, 2, 1, 3, 4, 5],\n            \"transaction_time\": [10, 12, 13, 20, 21, 20],\n            \"fraud\": [True, False, False, False, True, True],\n        },\n    )\n\n\ndef test_set_time_type_on_init(transactions_df):\n    # create cards dataframe\n    cards_df = pd.DataFrame({\"id\": [1, 2, 3, 4, 5]})\n    cards_logical_types = None\n    transactions_logical_types = None\n    dataframes = {\n        \"cards\": (cards_df, \"id\", None, cards_logical_types),\n        \"transactions\": (\n            transactions_df,\n            \"id\",\n            \"transaction_time\",\n            transactions_logical_types,\n        ),\n    }\n    relationships = [(\"cards\", \"id\", \"transactions\", \"card_id\")]\n    es = EntitySet(\"fraud\", dataframes, relationships)\n    # assert time_type is set\n    assert es.time_type == \"numeric\"\n\n\ndef test_sets_time_when_adding_dataframe(transactions_df):\n    accounts_df = pd.DataFrame(\n        {\n            \"id\": [3, 4, 5],\n            \"signup_date\": [\n                datetime(2002, 5, 1),\n                datetime(2006, 3, 20),\n                datetime(2011, 11, 11),\n            ],\n        },\n    )\n    accounts_df_string = pd.DataFrame(\n        {\"id\": [3, 4, 5], \"signup_date\": [\"element\", \"exporting\", \"editable\"]},\n    )\n    accounts_logical_types = None\n    transactions_logical_types = None\n\n    # create empty entityset\n    es = EntitySet(\"fraud\")\n    # assert it's not set\n    assert getattr(es, \"time_type\", None) is None\n    # add dataframe\n    es.add_dataframe(\n        transactions_df,\n        dataframe_name=\"transactions\",\n        index=\"id\",\n        time_index=\"transaction_time\",\n        logical_types=transactions_logical_types,\n    )\n    # assert time_type is set\n    assert es.time_type == \"numeric\"\n    # add another dataframe\n    es.normalize_dataframe(\"transactions\", \"cards\", \"card_id\", make_time_index=True)\n    # assert time_type unchanged\n    assert es.time_type == \"numeric\"\n    # add wrong time type dataframe\n    error_text = \"accounts time index is Datetime type which differs from other entityset time indexes\"\n    with pytest.raises(TypeError, match=error_text):\n        es.add_dataframe(\n            accounts_df,\n            dataframe_name=\"accounts\",\n            index=\"id\",\n            time_index=\"signup_date\",\n            logical_types=accounts_logical_types,\n        )\n\n    error_text = \"Time index column must contain datetime or numeric values\"\n    with pytest.raises(TypeError, match=error_text):\n        es.add_dataframe(\n            accounts_df_string,\n            dataframe_name=\"accounts\",\n            index=\"id\",\n            time_index=\"signup_date\",\n        )\n\n\ndef test_secondary_time_index_no_primary_time_index(es):\n    es[\"products\"].ww.set_types(logical_types={\"rating\": \"Datetime\"})\n    assert es[\"products\"].ww.time_index is None\n\n    error = (\n        \"Cannot set secondary time index on a DataFrame that has no primary time index.\"\n    )\n    with pytest.raises(ValueError, match=error):\n        es.set_secondary_time_index(\"products\", {\"rating\": [\"url\"]})\n\n    assert \"secondary_time_index\" not in es[\"products\"].ww.metadata\n    assert es[\"products\"].ww.time_index is None\n\n\ndef test_set_non_valid_time_index_type(es):\n    error_text = \"Time index column must be a Datetime or numeric column.\"\n    with pytest.raises(TypeError, match=error_text):\n        es[\"log\"].ww.set_time_index(\"purchased\")\n\n\ndef test_checks_time_type_setting_secondary_time_index(es):\n    # entityset is timestamp time type\n    assert es.time_type == Datetime\n    # add secondary index that is timestamp type\n    new_2nd_ti = {\n        \"upgrade_date\": [\"upgrade_date\", \"favorite_quote\"],\n        \"cancel_date\": [\"cancel_date\", \"cancel_reason\"],\n    }\n    es.set_secondary_time_index(\"customers\", new_2nd_ti)\n    assert es.time_type == Datetime\n    # add secondary index that is numeric type\n    new_2nd_ti = {\"age\": [\"age\", \"loves_ice_cream\"]}\n\n    error_text = \"customers time index is numeric type which differs from other entityset time indexes\"\n    with pytest.raises(TypeError, match=error_text):\n        es.set_secondary_time_index(\"customers\", new_2nd_ti)\n    # add secondary index that is non-time type\n    new_2nd_ti = {\"favorite_quote\": [\"favorite_quote\", \"loves_ice_cream\"]}\n\n    error_text = \"customers time index not recognized as numeric or datetime\"\n    with pytest.raises(TypeError, match=error_text):\n        es.set_secondary_time_index(\"customers\", new_2nd_ti)\n    # add mismatched pair of secondary time indexes\n    new_2nd_ti = {\n        \"upgrade_date\": [\"upgrade_date\", \"favorite_quote\"],\n        \"age\": [\"age\", \"loves_ice_cream\"],\n    }\n\n    error_text = \"customers time index is numeric type which differs from other entityset time indexes\"\n    with pytest.raises(TypeError, match=error_text):\n        es.set_secondary_time_index(\"customers\", new_2nd_ti)\n\n    # create entityset with numeric time type\n    cards_df = pd.DataFrame({\"id\": [1, 2, 3, 4, 5]})\n    transactions_df = pd.DataFrame(\n        {\n            \"id\": [1, 2, 3, 4, 5, 6],\n            \"card_id\": [1, 2, 1, 3, 4, 5],\n            \"transaction_time\": [10, 12, 13, 20, 21, 20],\n            \"fraud_decision_time\": [11, 14, 15, 21, 22, 21],\n            \"transaction_city\": [\"City A\"] * 6,\n            \"transaction_date\": [datetime(1989, 2, i) for i in range(1, 7)],\n            \"fraud\": [True, False, False, False, True, True],\n        },\n    )\n    dataframes = {\n        \"cards\": (cards_df, \"id\"),\n        \"transactions\": (transactions_df, \"id\", \"transaction_time\"),\n    }\n    relationships = [(\"cards\", \"id\", \"transactions\", \"card_id\")]\n    card_es = EntitySet(\"fraud\", dataframes, relationships)\n    assert card_es.time_type == \"numeric\"\n    # add secondary index that is numeric time type\n    new_2nd_ti = {\"fraud_decision_time\": [\"fraud_decision_time\", \"fraud\"]}\n    card_es.set_secondary_time_index(\"transactions\", new_2nd_ti)\n    assert card_es.time_type == \"numeric\"\n    # add secondary index that is timestamp type\n    new_2nd_ti = {\"transaction_date\": [\"transaction_date\", \"fraud\"]}\n\n    error_text = \"transactions time index is Datetime type which differs from other entityset time indexes\"\n    with pytest.raises(TypeError, match=error_text):\n        card_es.set_secondary_time_index(\"transactions\", new_2nd_ti)\n    # add secondary index that is non-time type\n    new_2nd_ti = {\"transaction_city\": [\"transaction_city\", \"fraud\"]}\n\n    error_text = \"transactions time index not recognized as numeric or datetime\"\n    with pytest.raises(TypeError, match=error_text):\n        card_es.set_secondary_time_index(\"transactions\", new_2nd_ti)\n    # add mixed secondary time indexes\n    new_2nd_ti = {\n        \"transaction_city\": [\"transaction_city\", \"fraud\"],\n        \"fraud_decision_time\": [\"fraud_decision_time\", \"fraud\"],\n    }\n    with pytest.raises(TypeError, match=error_text):\n        card_es.set_secondary_time_index(\"transactions\", new_2nd_ti)\n\n    # add bool secondary time index\n    error_text = \"transactions time index not recognized as numeric or datetime\"\n    with pytest.raises(TypeError, match=error_text):\n        card_es.set_secondary_time_index(\"transactions\", {\"fraud\": [\"fraud\"]})\n\n\ndef test_normalize_dataframe(es):\n    error_text = \"'additional_columns' must be a list, but received type.*\"\n    with pytest.raises(TypeError, match=error_text):\n        es.normalize_dataframe(\n            \"sessions\",\n            \"device_types\",\n            \"device_type\",\n            additional_columns=\"log\",\n        )\n\n    error_text = \"'copy_columns' must be a list, but received type.*\"\n    with pytest.raises(TypeError, match=error_text):\n        es.normalize_dataframe(\n            \"sessions\",\n            \"device_types\",\n            \"device_type\",\n            copy_columns=\"log\",\n        )\n\n    es.normalize_dataframe(\n        \"sessions\",\n        \"device_types\",\n        \"device_type\",\n        additional_columns=[\"device_name\"],\n        make_time_index=False,\n    )\n\n    assert len(es.get_forward_relationships(\"sessions\")) == 2\n    assert (\n        es.get_forward_relationships(\"sessions\")[1].parent_dataframe.ww.name\n        == \"device_types\"\n    )\n    assert \"device_name\" in es[\"device_types\"].columns\n    assert \"device_name\" not in es[\"sessions\"].columns\n    assert \"device_type\" in es[\"device_types\"].columns\n\n\ndef test_normalize_dataframe_add_index_as_column(es):\n    error_text = \"Not adding device_type as both index and column in additional_columns\"\n    with pytest.raises(ValueError, match=error_text):\n        es.normalize_dataframe(\n            \"sessions\",\n            \"device_types\",\n            \"device_type\",\n            additional_columns=[\"device_name\", \"device_type\"],\n            make_time_index=False,\n        )\n\n    error_text = \"Not adding device_type as both index and column in copy_columns\"\n    with pytest.raises(ValueError, match=error_text):\n        es.normalize_dataframe(\n            \"sessions\",\n            \"device_types\",\n            \"device_type\",\n            copy_columns=[\"device_name\", \"device_type\"],\n            make_time_index=False,\n        )\n\n\ndef test_normalize_dataframe_new_time_index_in_base_dataframe_error_check(es):\n    error_text = \"'make_time_index' must be a column in the base dataframe\"\n    with pytest.raises(ValueError, match=error_text):\n        es.normalize_dataframe(\n            base_dataframe_name=\"customers\",\n            new_dataframe_name=\"cancellations\",\n            index=\"cancel_reason\",\n            make_time_index=\"non-existent\",\n        )\n\n\ndef test_normalize_dataframe_new_time_index_in_column_list_error_check(es):\n    error_text = (\n        \"'make_time_index' must be specified in 'additional_columns' or 'copy_columns'\"\n    )\n    with pytest.raises(ValueError, match=error_text):\n        es.normalize_dataframe(\n            base_dataframe_name=\"customers\",\n            new_dataframe_name=\"cancellations\",\n            index=\"cancel_reason\",\n            make_time_index=\"cancel_date\",\n        )\n\n\ndef test_normalize_dataframe_new_time_index_copy_success_check(es):\n    es.normalize_dataframe(\n        base_dataframe_name=\"customers\",\n        new_dataframe_name=\"cancellations\",\n        index=\"cancel_reason\",\n        make_time_index=\"cancel_date\",\n        additional_columns=[],\n        copy_columns=[\"cancel_date\"],\n    )\n\n\ndef test_normalize_dataframe_new_time_index_additional_success_check(es):\n    es.normalize_dataframe(\n        base_dataframe_name=\"customers\",\n        new_dataframe_name=\"cancellations\",\n        index=\"cancel_reason\",\n        make_time_index=\"cancel_date\",\n        additional_columns=[\"cancel_date\"],\n        copy_columns=[],\n    )\n\n\n@pytest.fixture\ndef normalize_es():\n    df = pd.DataFrame(\n        {\n            \"id\": [0, 1, 2, 3],\n            \"A\": [5, 4, 2, 3],\n            \"time\": [\n                datetime(2020, 6, 3),\n                (datetime(2020, 3, 12)),\n                datetime(2020, 5, 1),\n                datetime(2020, 4, 22),\n            ],\n        },\n    )\n    es = EntitySet(\"es\")\n    return es.add_dataframe(dataframe_name=\"data\", dataframe=df, index=\"id\")\n\n\ndef test_normalize_time_index_from_none(normalize_es):\n    assert normalize_es[\"data\"].ww.time_index is None\n\n    normalize_es.normalize_dataframe(\n        base_dataframe_name=\"data\",\n        new_dataframe_name=\"normalized\",\n        index=\"A\",\n        make_time_index=\"time\",\n        copy_columns=[\"time\"],\n    )\n    assert normalize_es[\"normalized\"].ww.time_index == \"time\"\n    df = normalize_es[\"normalized\"]\n\n    assert df[\"time\"].is_monotonic_increasing\n\n\ndef test_raise_error_if_dupicate_additional_columns_passed(es):\n    error_text = (\n        \"'additional_columns' contains duplicate columns. All columns must be unique.\"\n    )\n    with pytest.raises(ValueError, match=error_text):\n        es.normalize_dataframe(\n            \"sessions\",\n            \"device_types\",\n            \"device_type\",\n            additional_columns=[\"device_name\", \"device_name\"],\n        )\n\n\ndef test_raise_error_if_dupicate_copy_columns_passed(es):\n    error_text = (\n        \"'copy_columns' contains duplicate columns. All columns must be unique.\"\n    )\n    with pytest.raises(ValueError, match=error_text):\n        es.normalize_dataframe(\n            \"sessions\",\n            \"device_types\",\n            \"device_type\",\n            copy_columns=[\"device_name\", \"device_name\"],\n        )\n\n\ndef test_normalize_dataframe_copies_logical_types(es):\n    es[\"log\"].ww.set_types(\n        logical_types={\n            \"value\": Ordinal(\n                order=[0.0, 1.0, 2.0, 3.0, 5.0, 7.0, 10.0, 14.0, 15.0, 20.0],\n            ),\n        },\n    )\n\n    assert isinstance(es[\"log\"].ww.logical_types[\"value\"], Ordinal)\n    assert len(es[\"log\"].ww.logical_types[\"value\"].order) == 10\n    assert isinstance(es[\"log\"].ww.logical_types[\"priority_level\"], Ordinal)\n    assert len(es[\"log\"].ww.logical_types[\"priority_level\"].order) == 3\n    es.normalize_dataframe(\n        \"log\",\n        \"values_2\",\n        \"value_2\",\n        additional_columns=[\"priority_level\"],\n        copy_columns=[\"value\"],\n        make_time_index=False,\n    )\n\n    assert len(es.get_forward_relationships(\"log\")) == 3\n    assert es.get_forward_relationships(\"log\")[2].parent_dataframe.ww.name == \"values_2\"\n    assert \"priority_level\" in es[\"values_2\"].columns\n    assert \"value\" in es[\"values_2\"].columns\n    assert \"priority_level\" not in es[\"log\"].columns\n    assert \"value\" in es[\"log\"].columns\n    assert \"value_2\" in es[\"values_2\"].columns\n    assert isinstance(es[\"values_2\"].ww.logical_types[\"priority_level\"], Ordinal)\n    assert len(es[\"values_2\"].ww.logical_types[\"priority_level\"].order) == 3\n    assert isinstance(es[\"values_2\"].ww.logical_types[\"value\"], Ordinal)\n    assert len(es[\"values_2\"].ww.logical_types[\"value\"].order) == 10\n\n\ndef test_make_time_index_keeps_original_sorting():\n    trips = {\n        \"trip_id\": [999 - i for i in range(1000)],\n        \"flight_time\": [datetime(1997, 4, 1) for i in range(1000)],\n        \"flight_id\": [1 for i in range(350)] + [2 for i in range(650)],\n    }\n    order = [i for i in range(1000)]\n    df = pd.DataFrame.from_dict(trips)\n    es = EntitySet(\"flights\")\n    es.add_dataframe(\n        dataframe=df,\n        dataframe_name=\"trips\",\n        index=\"trip_id\",\n        time_index=\"flight_time\",\n    )\n    assert (es[\"trips\"][\"trip_id\"] == order).all()\n    es.normalize_dataframe(\n        base_dataframe_name=\"trips\",\n        new_dataframe_name=\"flights\",\n        index=\"flight_id\",\n        make_time_index=True,\n    )\n    assert (es[\"trips\"][\"trip_id\"] == order).all()\n\n\ndef test_normalize_dataframe_new_time_index(es):\n    new_time_index = \"value_time\"\n    es.normalize_dataframe(\n        \"log\",\n        \"values\",\n        \"value\",\n        make_time_index=True,\n        new_dataframe_time_index=new_time_index,\n    )\n\n    assert es[\"values\"].ww.time_index == new_time_index\n    assert new_time_index in es[\"values\"].columns\n    assert len(es[\"values\"].columns) == 2\n    df = es[\"values\"]\n    assert df[new_time_index].is_monotonic_increasing\n\n\ndef test_normalize_dataframe_same_index(es):\n    transactions_df = pd.DataFrame(\n        {\n            \"id\": [1, 2, 3],\n            \"transaction_time\": pd.date_range(start=\"10:00\", periods=3, freq=\"10s\"),\n            \"first_df_time\": [1, 2, 3],\n        },\n    )\n    es = EntitySet(\"example\")\n    es.add_dataframe(\n        dataframe_name=\"df\",\n        index=\"id\",\n        time_index=\"transaction_time\",\n        dataframe=transactions_df,\n    )\n\n    error_text = \"'index' must be different from the index column of the base dataframe\"\n    with pytest.raises(ValueError, match=error_text):\n        es.normalize_dataframe(\n            base_dataframe_name=\"df\",\n            new_dataframe_name=\"new_dataframe\",\n            index=\"id\",\n            make_time_index=True,\n        )\n\n\ndef test_secondary_time_index(es):\n    es.normalize_dataframe(\n        \"log\",\n        \"values\",\n        \"value\",\n        make_time_index=True,\n        make_secondary_time_index={\"datetime\": [\"comments\"]},\n        new_dataframe_time_index=\"value_time\",\n        new_dataframe_secondary_time_index=\"second_ti\",\n    )\n\n    assert isinstance(es[\"values\"].ww.logical_types[\"second_ti\"], Datetime)\n    assert es[\"values\"].ww.semantic_tags[\"second_ti\"] == set()\n    assert es[\"values\"].ww.metadata[\"secondary_time_index\"] == {\n        \"second_ti\": [\"comments\", \"second_ti\"],\n    }\n\n\ndef test_sizeof(es):\n    es.add_last_time_indexes()\n    total_size = 0\n    for df in es.dataframes:\n        total_size += df.__sizeof__()\n\n    assert es.__sizeof__() == total_size\n\n\ndef test_construct_without_id():\n    assert EntitySet().id is None\n\n\ndef test_repr_without_id():\n    match = \"Entityset: None\\n  DataFrames:\\n  Relationships:\\n    No relationships\"\n    assert repr(EntitySet()) == match\n\n\ndef test_getitem_without_id():\n    error_text = \"DataFrame test does not exist in entity set\"\n    with pytest.raises(KeyError, match=error_text):\n        EntitySet()[\"test\"]\n\n\ndef test_metadata_without_id():\n    es = EntitySet()\n    assert es.metadata.id is None\n\n\n@pytest.fixture\ndef datetime3():\n    return pd.DataFrame({\"id\": [0, 1, 2], \"ints\": [\"1\", \"2\", \"1\"]})\n\n\ndef test_datetime64_conversion(datetime3):\n    df = datetime3\n    df[\"time\"] = pd.Timestamp.now()\n    df[\"time\"] = df[\"time\"].dt.tz_localize(\"UTC\")\n\n    es = EntitySet(id=\"test\")\n    es.add_dataframe(\n        dataframe_name=\"test_dataframe\",\n        index=\"id\",\n        dataframe=df,\n        logical_types=None,\n    )\n    es[\"test_dataframe\"].ww.set_time_index(\"time\")\n    assert es[\"test_dataframe\"].ww.time_index == \"time\"\n\n\n@pytest.fixture\ndef index_df():\n    return pd.DataFrame(\n        {\n            \"id\": [1, 2, 3, 4, 5, 6],\n            \"transaction_time\": pd.date_range(start=\"10:00\", periods=6, freq=\"10s\"),\n            \"first_dataframe_time\": [1, 2, 3, 5, 6, 6],\n        },\n    )\n\n\ndef test_same_index_values(index_df):\n    es = EntitySet(\"example\")\n\n    error_text = (\n        '\"id\" is already set as the index. An index cannot also be the time index.'\n    )\n    with pytest.raises(ValueError, match=error_text):\n        es.add_dataframe(\n            dataframe_name=\"dataframe\",\n            index=\"id\",\n            time_index=\"id\",\n            dataframe=index_df,\n            logical_types=None,\n        )\n\n    es.add_dataframe(\n        dataframe_name=\"dataframe\",\n        index=\"id\",\n        time_index=\"transaction_time\",\n        dataframe=index_df,\n        logical_types=None,\n    )\n\n    error_text = \"time_index and index cannot be the same value, first_dataframe_time\"\n    with pytest.raises(ValueError, match=error_text):\n        es.normalize_dataframe(\n            base_dataframe_name=\"dataframe\",\n            new_dataframe_name=\"new_dataframe\",\n            index=\"first_dataframe_time\",\n            make_time_index=True,\n        )\n\n\ndef test_use_time_index(index_df):\n    bad_ltypes = {\"transaction_time\": Datetime}\n    bad_semantic_tags = {\"transaction_time\": \"time_index\"}\n    logical_types = None\n\n    es = EntitySet()\n\n    error_text = re.escape(\n        \"Cannot add 'time_index' tag directly for column transaction_time. To set a column as the time index, use DataFrame.ww.set_time_index() instead.\",\n    )\n    with pytest.raises(ValueError, match=error_text):\n        es.add_dataframe(\n            dataframe_name=\"dataframe\",\n            index=\"id\",\n            logical_types=bad_ltypes,\n            semantic_tags=bad_semantic_tags,\n            dataframe=index_df,\n        )\n\n    es.add_dataframe(\n        dataframe_name=\"dataframe\",\n        index=\"id\",\n        time_index=\"transaction_time\",\n        logical_types=logical_types,\n        dataframe=index_df,\n    )\n\n\ndef test_normalize_with_datetime_time_index(es):\n    es.normalize_dataframe(\n        base_dataframe_name=\"customers\",\n        new_dataframe_name=\"cancel_reason\",\n        index=\"cancel_reason\",\n        make_time_index=False,\n        copy_columns=[\"signup_date\", \"upgrade_date\"],\n    )\n\n    assert isinstance(es[\"cancel_reason\"].ww.logical_types[\"signup_date\"], Datetime)\n    assert isinstance(es[\"cancel_reason\"].ww.logical_types[\"upgrade_date\"], Datetime)\n\n\ndef test_normalize_with_numeric_time_index(int_es):\n    int_es.normalize_dataframe(\n        base_dataframe_name=\"customers\",\n        new_dataframe_name=\"cancel_reason\",\n        index=\"cancel_reason\",\n        make_time_index=False,\n        copy_columns=[\"signup_date\", \"upgrade_date\"],\n    )\n\n    assert int_es[\"cancel_reason\"].ww.semantic_tags[\"signup_date\"] == {\"numeric\"}\n\n\ndef test_normalize_with_invalid_time_index(es):\n    error_text = \"Time index column must contain datetime or numeric values\"\n    with pytest.raises(TypeError, match=error_text):\n        es.normalize_dataframe(\n            base_dataframe_name=\"customers\",\n            new_dataframe_name=\"cancel_reason\",\n            index=\"cancel_reason\",\n            copy_columns=[\"upgrade_date\", \"favorite_quote\"],\n            make_time_index=\"favorite_quote\",\n        )\n\n\ndef test_entityset_init():\n    cards_df = pd.DataFrame({\"id\": [1, 2, 3, 4, 5]})\n    transactions_df = pd.DataFrame(\n        {\n            \"id\": [1, 2, 3, 4, 5, 6],\n            \"card_id\": [1, 2, 1, 3, 4, 5],\n            \"transaction_time\": [10, 12, 13, 20, 21, 20],\n            \"upgrade_date\": [51, 23, 45, 12, 22, 53],\n            \"fraud\": [True, False, False, False, True, True],\n        },\n    )\n    logical_types = {\"fraud\": \"boolean\", \"card_id\": \"integer\"}\n    dataframes = {\n        \"cards\": (cards_df.copy(), \"id\", None, {\"id\": \"Integer\"}),\n        \"transactions\": (\n            transactions_df.copy(),\n            \"id\",\n            \"transaction_time\",\n            logical_types,\n            None,\n            False,\n        ),\n    }\n    relationships = [(\"cards\", \"id\", \"transactions\", \"card_id\")]\n    es = EntitySet(id=\"fraud_data\", dataframes=dataframes, relationships=relationships)\n    assert es[\"transactions\"].ww.index == \"id\"\n    assert es[\"transactions\"].ww.time_index == \"transaction_time\"\n    es_copy = EntitySet(id=\"fraud_data\")\n    es_copy.add_dataframe(dataframe_name=\"cards\", dataframe=cards_df.copy(), index=\"id\")\n    es_copy.add_dataframe(\n        dataframe_name=\"transactions\",\n        dataframe=transactions_df.copy(),\n        index=\"id\",\n        logical_types=logical_types,\n        make_index=False,\n        time_index=\"transaction_time\",\n    )\n    es_copy.add_relationship(\"cards\", \"id\", \"transactions\", \"card_id\")\n\n    assert es[\"cards\"].ww == es_copy[\"cards\"].ww\n    assert es[\"transactions\"].ww == es_copy[\"transactions\"].ww\n\n\ndef test_add_interesting_values_specified_vals(es):\n    product_vals = [\"coke zero\", \"taco clock\"]\n    country_vals = [\"AL\", \"US\"]\n    interesting_values = {\n        \"product_id\": product_vals,\n        \"countrycode\": country_vals,\n    }\n    es.add_interesting_values(dataframe_name=\"log\", values=interesting_values)\n\n    assert es[\"log\"].ww[\"product_id\"].ww.metadata[\"interesting_values\"] == product_vals\n    assert es[\"log\"].ww[\"countrycode\"].ww.metadata[\"interesting_values\"] == country_vals\n\n\ndef test_add_interesting_values_vals_specified_without_dataframe_name(es):\n    interesting_values = {\n        \"countrycode\": [\"AL\", \"US\"],\n    }\n    error_msg = \"dataframe_name must be specified if values are provided\"\n    with pytest.raises(ValueError, match=error_msg):\n        es.add_interesting_values(values=interesting_values)\n\n\ndef test_add_interesting_values_single_dataframe(es):\n    es.add_interesting_values(dataframe_name=\"log\")\n\n    expected_vals = {\n        \"zipcode\": [\"02116\", \"02116-3899\", \"12345-6789\", \"1234567890\", \"0\"],\n        \"countrycode\": [\"US\", \"AL\", \"ALB\", \"USA\"],\n        \"subregioncode\": [\"US-AZ\", \"US-MT\", \"ZM-06\", \"UG-219\"],\n        \"priority_level\": [0, 1, 2],\n    }\n\n    for col in es[\"log\"].columns:\n        if col in expected_vals:\n            assert (\n                es[\"log\"].ww.columns[col].metadata.get(\"interesting_values\")\n                == expected_vals[col]\n            )\n        else:\n            assert es[\"log\"].ww.columns[col].metadata.get(\"interesting_values\") is None\n\n\ndef test_add_interesting_values_multiple_dataframes(es):\n    es.add_interesting_values()\n    expected_cols_with_vals = {\n        \"régions\": {\"language\"},\n        \"stores\": {},\n        \"products\": {\"department\"},\n        \"customers\": {\"cancel_reason\", \"engagement_level\"},\n        \"sessions\": {\"device_type\", \"device_name\"},\n        \"log\": {\"zipcode\", \"countrycode\", \"subregioncode\", \"priority_level\"},\n        \"cohorts\": {\"cohort_name\"},\n    }\n    for df_id, df in es.dataframe_dict.items():\n        expected_cols = expected_cols_with_vals[df_id]\n        for col in df.columns:\n            if col in expected_cols:\n                assert df.ww.columns[col].metadata.get(\"interesting_values\") is not None\n            else:\n                assert df.ww.columns[col].metadata.get(\"interesting_values\") is None\n\n\ndef test_add_interesting_values_verbose_output(caplog):\n    es = load_retail(nrows=200)\n    es[\"order_products\"].ww.set_types({\"quantity\": \"Categorical\"})\n    es[\"orders\"].ww.set_types({\"country\": \"Categorical\"})\n    logger = logging.getLogger(\"featuretools\")\n    logger.propagate = True\n    logger_es = logging.getLogger(\"featuretools.entityset\")\n    logger_es.propagate = True\n    es.add_interesting_values(verbose=True, max_values=10)\n    logger.propagate = False\n    logger_es.propagate = False\n    assert (\n        \"Column country: Marking United Kingdom as an interesting value\" in caplog.text\n    )\n    assert \"Column quantity: Marking 6 as an interesting value\" in caplog.text\n\n\ndef test_entityset_equality(es):\n    first_es = EntitySet()\n    second_es = EntitySet()\n    assert first_es == second_es\n\n    first_es.add_dataframe(\n        dataframe_name=\"customers\",\n        dataframe=es[\"customers\"].copy(),\n        index=\"id\",\n        time_index=\"signup_date\",\n        logical_types=es[\"customers\"].ww.logical_types,\n        semantic_tags=get_df_tags(es[\"customers\"]),\n    )\n    assert first_es != second_es\n\n    second_es.add_dataframe(\n        dataframe_name=\"sessions\",\n        dataframe=es[\"sessions\"].copy(),\n        index=\"id\",\n        logical_types=es[\"sessions\"].ww.logical_types,\n        semantic_tags=get_df_tags(es[\"sessions\"]),\n    )\n    assert first_es != second_es\n\n    first_es.add_dataframe(\n        dataframe_name=\"sessions\",\n        dataframe=es[\"sessions\"].copy(),\n        index=\"id\",\n        logical_types=es[\"sessions\"].ww.logical_types,\n        semantic_tags=get_df_tags(es[\"sessions\"]),\n    )\n    second_es.add_dataframe(\n        dataframe_name=\"customers\",\n        dataframe=es[\"customers\"].copy(),\n        index=\"id\",\n        time_index=\"signup_date\",\n        logical_types=es[\"customers\"].ww.logical_types,\n        semantic_tags=get_df_tags(es[\"customers\"]),\n    )\n    assert first_es == second_es\n\n    first_es.add_relationship(\"customers\", \"id\", \"sessions\", \"customer_id\")\n    assert first_es != second_es\n    assert second_es != first_es\n\n    second_es.add_relationship(\"customers\", \"id\", \"sessions\", \"customer_id\")\n    assert first_es == second_es\n\n\ndef test_entityset_dataframe_dict_and_relationship_equality(es):\n    first_es = EntitySet()\n    second_es = EntitySet()\n\n    first_es.add_dataframe(\n        dataframe_name=\"sessions\",\n        dataframe=es[\"sessions\"].copy(),\n        index=\"id\",\n        logical_types=es[\"sessions\"].ww.logical_types,\n        semantic_tags=get_df_tags(es[\"sessions\"]),\n    )\n\n    # Tests if two entity sets are not equal if they have a different\n    # number of dataframes attached.\n    # first_es has 1 dataframe, second_es has 0 dataframes attached.\n    assert first_es != second_es\n\n    second_es.add_dataframe(\n        dataframe_name=\"customers\",\n        dataframe=es[\"customers\"].copy(),\n        index=\"id\",\n        logical_types=es[\"customers\"].ww.logical_types,\n        semantic_tags=get_df_tags(es[\"customers\"]),\n    )\n\n    # Tests if two entity sets are not equal if they have a different\n    # dataframes attached.\n    # first_es has the sessions dataframe attached,\n    # second_es has the customers dataframe attached.\n    assert first_es != second_es\n\n    first_es.add_dataframe(\n        dataframe_name=\"customers\",\n        dataframe=es[\"customers\"].copy(),\n        index=\"id\",\n        logical_types=es[\"customers\"].ww.logical_types,\n        semantic_tags=get_df_tags(es[\"customers\"]),\n    )\n    first_es.add_dataframe(\n        dataframe_name=\"stores\",\n        dataframe=es[\"stores\"].copy(),\n        index=\"id\",\n        logical_types=es[\"stores\"].ww.logical_types,\n        semantic_tags=get_df_tags(es[\"stores\"]),\n    )\n    first_es.add_dataframe(\n        dataframe_name=\"régions\",\n        dataframe=es[\"régions\"].copy(),\n        index=\"id\",\n        logical_types=es[\"régions\"].ww.logical_types,\n        semantic_tags=get_df_tags(es[\"régions\"]),\n    )\n\n    second_es.add_dataframe(\n        dataframe_name=\"sessions\",\n        dataframe=es[\"sessions\"].copy(),\n        index=\"id\",\n        logical_types=es[\"sessions\"].ww.logical_types,\n        semantic_tags=get_df_tags(es[\"sessions\"]),\n    )\n    second_es.add_dataframe(\n        dataframe_name=\"stores\",\n        dataframe=es[\"stores\"].copy(),\n        index=\"id\",\n        logical_types=es[\"stores\"].ww.logical_types,\n        semantic_tags=get_df_tags(es[\"stores\"]),\n    )\n    second_es.add_dataframe(\n        dataframe_name=\"régions\",\n        dataframe=es[\"régions\"].copy(),\n        index=\"id\",\n        logical_types=es[\"régions\"].ww.logical_types,\n        semantic_tags=get_df_tags(es[\"régions\"]),\n    )\n\n    # Now the two entity sets should be equal,\n    # since they have the same dataframes.\n    assert first_es == second_es\n\n    first_es.add_relationship(\"customers\", \"id\", \"sessions\", \"customer_id\")\n    second_es.add_relationship(\"régions\", \"id\", \"stores\", \"région_id\")\n\n    # Test if two entity sets are not equal\n    # if they have different relationships.\n    assert first_es != second_es\n\n\ndef test_entityset_id_equality():\n    first_es = EntitySet(id=\"first\")\n    first_es_copy = EntitySet(id=\"first\")\n    second_es = EntitySet(id=\"second\")\n\n    assert first_es != second_es\n    assert first_es == first_es_copy\n\n\ndef test_entityset_time_type_equality():\n    first_es = EntitySet()\n    second_es = EntitySet()\n    assert first_es == second_es\n\n    first_es.time_type = \"numeric\"\n    assert first_es != second_es\n\n    second_es.time_type = Datetime\n    assert first_es != second_es\n\n    second_es.time_type = \"numeric\"\n    assert first_es == second_es\n\n\ndef test_entityset_deep_equality(es):\n    first_es = EntitySet()\n    second_es = EntitySet()\n\n    first_es.add_dataframe(\n        dataframe_name=\"customers\",\n        dataframe=es[\"customers\"].copy(),\n        index=\"id\",\n        time_index=\"signup_date\",\n        logical_types=es[\"customers\"].ww.logical_types,\n        semantic_tags=get_df_tags(es[\"customers\"]),\n    )\n    first_es.add_dataframe(\n        dataframe_name=\"sessions\",\n        dataframe=es[\"sessions\"].copy(),\n        index=\"id\",\n        logical_types=es[\"sessions\"].ww.logical_types,\n        semantic_tags=get_df_tags(es[\"sessions\"]),\n    )\n\n    second_es.add_dataframe(\n        dataframe_name=\"sessions\",\n        dataframe=es[\"sessions\"].copy(),\n        index=\"id\",\n        logical_types=es[\"sessions\"].ww.logical_types,\n        semantic_tags=get_df_tags(es[\"sessions\"]),\n    )\n    second_es.add_dataframe(\n        dataframe_name=\"customers\",\n        dataframe=es[\"customers\"].copy(),\n        index=\"id\",\n        time_index=\"signup_date\",\n        logical_types=es[\"customers\"].ww.logical_types,\n        semantic_tags=get_df_tags(es[\"customers\"]),\n    )\n\n    assert first_es.__eq__(second_es, deep=False)\n    assert first_es.__eq__(second_es, deep=True)\n\n    # Woodwork metadata only gets included in deep equality check\n    first_es[\"sessions\"].ww.metadata[\"created_by\"] = \"user0\"\n\n    assert first_es.__eq__(second_es, deep=False)\n    assert not first_es.__eq__(second_es, deep=True)\n\n    second_es[\"sessions\"].ww.metadata[\"created_by\"] = \"user0\"\n\n    assert first_es.__eq__(second_es, deep=False)\n    assert first_es.__eq__(second_es, deep=True)\n\n    updated_df = first_es[\"customers\"].loc[[2, 0], :]\n    first_es.replace_dataframe(\"customers\", updated_df)\n\n    assert first_es.__eq__(second_es, deep=False)\n    assert not first_es.__eq__(second_es, deep=True)\n\n\ndef test_deepcopy_entityset(make_es):\n    # Uses make_es since the es fixture uses deepcopy\n    copied_es = copy.deepcopy(make_es)\n\n    assert copied_es == make_es\n    assert copied_es is not make_es\n\n    for df_name in make_es.dataframe_dict.keys():\n        original_df = make_es[df_name]\n        new_df = copied_es[df_name]\n\n        assert new_df.ww.schema == original_df.ww.schema\n        assert new_df.ww._schema is not original_df.ww._schema\n\n        pd.testing.assert_frame_equal(new_df, original_df)\n        assert new_df is not original_df\n\n\ndef test_deepcopy_entityset_woodwork_changes(es):\n    copied_es = copy.deepcopy(es)\n\n    assert copied_es == es\n    assert copied_es is not es\n\n    copied_es[\"products\"].ww.add_semantic_tags({\"id\": \"new_tag\"})\n\n    assert copied_es[\"products\"].ww.semantic_tags[\"id\"] == {\"index\", \"new_tag\"}\n    assert es[\"products\"].ww.semantic_tags[\"id\"] == {\"index\"}\n    assert copied_es != es\n\n\ndef test_deepcopy_entityset_featuretools_changes(es):\n    copied_es = copy.deepcopy(es)\n\n    assert copied_es == es\n    assert copied_es is not es\n\n    copied_es.set_secondary_time_index(\n        \"customers\",\n        {\"upgrade_date\": [\"engagement_level\"]},\n    )\n    assert copied_es[\"customers\"].ww.metadata[\"secondary_time_index\"] == {\n        \"upgrade_date\": [\"engagement_level\", \"upgrade_date\"],\n    }\n    assert es[\"customers\"].ww.metadata[\"secondary_time_index\"] == {\n        \"cancel_date\": [\"cancel_reason\", \"cancel_date\"],\n    }\n\n\ndef test_es__getstate__key_unique(es):\n    assert not hasattr(es, WW_SCHEMA_KEY)\n\n\ndef test_es_pickling(es):\n    pkl = pickle.dumps(es)\n    unpickled = pickle.loads(pkl)\n\n    assert es.__eq__(unpickled, deep=True)\n    assert not hasattr(unpickled, WW_SCHEMA_KEY)\n\n\ndef test_empty_es_pickling():\n    es = EntitySet(id=\"empty\")\n    pkl = pickle.dumps(es)\n    unpickled = pickle.loads(pkl)\n\n    assert es.__eq__(unpickled, deep=True)\n\n\n@patch(\"featuretools.entityset.entityset.EntitySet.add_dataframe\")\ndef test_setitem(add_dataframe):\n    es = EntitySet()\n    df = pd.DataFrame()\n    es[\"new_df\"] = df\n    assert add_dataframe.called\n    add_dataframe.assert_called_with(dataframe=df, dataframe_name=\"new_df\")\n\n\ndef test_latlong_nan_normalization(latlong_df):\n    latlong_df.ww.init(\n        name=\"latLong\",\n        index=\"idx\",\n        logical_types={\"latLong\": \"LatLong\"},\n    )\n\n    dataframes = {\"latLong\": (latlong_df,)}\n\n    relationships = []\n\n    es = EntitySet(\"latlong-test\", dataframes, relationships)\n\n    normalized_df = es[\"latLong\"]\n\n    expected_df = pd.DataFrame(\n        {\"idx\": [0, 1, 2], \"latLong\": [(np.nan, np.nan), (1, 2), (np.nan, np.nan)]},\n    )\n\n    pd.testing.assert_frame_equal(normalized_df, expected_df)\n\n\ndef test_latlong_nan_normalization_add_dataframe(latlong_df):\n    latlong_df.ww.init(\n        name=\"latLong\",\n        index=\"idx\",\n        logical_types={\"latLong\": \"LatLong\"},\n    )\n\n    es = EntitySet(\"latlong-test\")\n\n    es.add_dataframe(latlong_df)\n\n    normalized_df = es[\"latLong\"]\n\n    expected_df = pd.DataFrame(\n        {\"idx\": [0, 1, 2], \"latLong\": [(np.nan, np.nan), (1, 2), (np.nan, np.nan)]},\n    )\n\n    pd.testing.assert_frame_equal(normalized_df, expected_df)\n"
  },
  {
    "path": "featuretools/tests/entityset_tests/test_es_metadata.py",
    "content": "import pandas as pd\nimport pytest\n\nfrom featuretools import EntitySet\nfrom featuretools.tests.testing_utils import backward_path, forward_path\n\n\ndef test_cannot_re_add_relationships_that_already_exists(es):\n    before_len = len(es.relationships)\n    es.add_relationship(relationship=es.relationships[0])\n    after_len = len(es.relationships)\n    assert before_len == after_len\n\n\ndef test_add_relationships_convert_type(es):\n    for r in es.relationships:\n        assert r.parent_dataframe.ww.index == r._parent_column_name\n        assert \"foreign_key\" in r.child_column.ww.semantic_tags\n        assert r.child_column.ww.logical_type == r.parent_column.ww.logical_type\n\n\ndef test_get_forward_dataframes(es):\n    dataframes = es.get_forward_dataframes(\"log\")\n    path_to_sessions = forward_path(es, [\"log\", \"sessions\"])\n    path_to_products = forward_path(es, [\"log\", \"products\"])\n    assert list(dataframes) == [\n        (\"sessions\", path_to_sessions),\n        (\"products\", path_to_products),\n    ]\n\n\ndef test_get_backward_dataframes(es):\n    dataframes = es.get_backward_dataframes(\"customers\")\n    path_to_sessions = backward_path(es, [\"customers\", \"sessions\"])\n    assert list(dataframes) == [(\"sessions\", path_to_sessions)]\n\n\ndef test_get_forward_dataframes_deep(es):\n    dataframes = es.get_forward_dataframes(\"log\", deep=True)\n    path_to_sessions = forward_path(es, [\"log\", \"sessions\"])\n    path_to_products = forward_path(es, [\"log\", \"products\"])\n    path_to_customers = forward_path(es, [\"log\", \"sessions\", \"customers\"])\n    path_to_regions = forward_path(es, [\"log\", \"sessions\", \"customers\", \"régions\"])\n    path_to_cohorts = forward_path(es, [\"log\", \"sessions\", \"customers\", \"cohorts\"])\n    assert list(dataframes) == [\n        (\"sessions\", path_to_sessions),\n        (\"customers\", path_to_customers),\n        (\"cohorts\", path_to_cohorts),\n        (\"régions\", path_to_regions),\n        (\"products\", path_to_products),\n    ]\n\n\ndef test_get_backward_dataframes_deep(es):\n    dataframes = es.get_backward_dataframes(\"customers\", deep=True)\n    path_to_log = backward_path(es, [\"customers\", \"sessions\", \"log\"])\n    path_to_sessions = backward_path(es, [\"customers\", \"sessions\"])\n    assert list(dataframes) == [(\"sessions\", path_to_sessions), (\"log\", path_to_log)]\n\n\ndef test_get_forward_relationships(es):\n    relationships = es.get_forward_relationships(\"log\")\n    assert len(relationships) == 2\n    assert relationships[0]._parent_dataframe_name == \"sessions\"\n    assert relationships[0]._child_dataframe_name == \"log\"\n    assert relationships[1]._parent_dataframe_name == \"products\"\n    assert relationships[1]._child_dataframe_name == \"log\"\n\n    relationships = es.get_forward_relationships(\"sessions\")\n    assert len(relationships) == 1\n    assert relationships[0]._parent_dataframe_name == \"customers\"\n    assert relationships[0]._child_dataframe_name == \"sessions\"\n\n\ndef test_get_backward_relationships(es):\n    relationships = es.get_backward_relationships(\"sessions\")\n    assert len(relationships) == 1\n    assert relationships[0]._parent_dataframe_name == \"sessions\"\n    assert relationships[0]._child_dataframe_name == \"log\"\n\n    relationships = es.get_backward_relationships(\"customers\")\n    assert len(relationships) == 1\n    assert relationships[0]._parent_dataframe_name == \"customers\"\n    assert relationships[0]._child_dataframe_name == \"sessions\"\n\n\ndef test_find_forward_paths(es):\n    paths = list(es.find_forward_paths(\"log\", \"customers\"))\n    assert len(paths) == 1\n\n    path = paths[0]\n\n    assert len(path) == 2\n    assert path[0]._child_dataframe_name == \"log\"\n    assert path[0]._parent_dataframe_name == \"sessions\"\n    assert path[1]._child_dataframe_name == \"sessions\"\n    assert path[1]._parent_dataframe_name == \"customers\"\n\n\ndef test_find_forward_paths_multiple_paths(diamond_es):\n    paths = list(diamond_es.find_forward_paths(\"transactions\", \"regions\"))\n    assert len(paths) == 2\n\n    path1, path2 = paths\n\n    r1, r2 = path1\n    assert r1._child_dataframe_name == \"transactions\"\n    assert r1._parent_dataframe_name == \"stores\"\n    assert r2._child_dataframe_name == \"stores\"\n    assert r2._parent_dataframe_name == \"regions\"\n\n    r1, r2 = path2\n    assert r1._child_dataframe_name == \"transactions\"\n    assert r1._parent_dataframe_name == \"customers\"\n    assert r2._child_dataframe_name == \"customers\"\n    assert r2._parent_dataframe_name == \"regions\"\n\n\ndef test_find_forward_paths_multiple_relationships(games_es):\n    paths = list(games_es.find_forward_paths(\"games\", \"teams\"))\n    assert len(paths) == 2\n\n    path1, path2 = paths\n    assert len(path1) == 1\n    assert len(path2) == 1\n    r1 = path1[0]\n    r2 = path2[0]\n\n    assert r1._child_dataframe_name == \"games\"\n    assert r2._child_dataframe_name == \"games\"\n    assert r1._parent_dataframe_name == \"teams\"\n    assert r2._parent_dataframe_name == \"teams\"\n\n    assert r1._child_column_name == \"home_team_id\"\n    assert r2._child_column_name == \"away_team_id\"\n    assert r1._parent_column_name == \"id\"\n    assert r2._parent_column_name == \"id\"\n\n\n@pytest.fixture\ndef employee_df():\n    return pd.DataFrame({\"id\": [0], \"manager_id\": [0]})\n\n\ndef test_find_forward_paths_ignores_loops(employee_df):\n    dataframes = {\"employees\": (employee_df, \"id\")}\n    relationships = [(\"employees\", \"id\", \"employees\", \"manager_id\")]\n    es = EntitySet(dataframes=dataframes, relationships=relationships)\n\n    paths = list(es.find_forward_paths(\"employees\", \"employees\"))\n    assert len(paths) == 1\n    assert paths[0] == []\n\n\ndef test_find_backward_paths(es):\n    paths = list(es.find_backward_paths(\"customers\", \"log\"))\n    assert len(paths) == 1\n\n    path = paths[0]\n\n    assert len(path) == 2\n    assert path[0]._child_dataframe_name == \"sessions\"\n    assert path[0]._parent_dataframe_name == \"customers\"\n    assert path[1]._child_dataframe_name == \"log\"\n    assert path[1]._parent_dataframe_name == \"sessions\"\n\n\ndef test_find_backward_paths_multiple_paths(diamond_es):\n    paths = list(diamond_es.find_backward_paths(\"regions\", \"transactions\"))\n    assert len(paths) == 2\n\n    path1, path2 = paths\n\n    r1, r2 = path1\n    assert r1._child_dataframe_name == \"stores\"\n    assert r1._parent_dataframe_name == \"regions\"\n    assert r2._child_dataframe_name == \"transactions\"\n    assert r2._parent_dataframe_name == \"stores\"\n\n    r1, r2 = path2\n    assert r1._child_dataframe_name == \"customers\"\n    assert r1._parent_dataframe_name == \"regions\"\n    assert r2._child_dataframe_name == \"transactions\"\n    assert r2._parent_dataframe_name == \"customers\"\n\n\ndef test_find_backward_paths_multiple_relationships(games_es):\n    paths = list(games_es.find_backward_paths(\"teams\", \"games\"))\n    assert len(paths) == 2\n\n    path1, path2 = paths\n    assert len(path1) == 1\n    assert len(path2) == 1\n    r1 = path1[0]\n    r2 = path2[0]\n\n    assert r1._child_dataframe_name == \"games\"\n    assert r2._child_dataframe_name == \"games\"\n    assert r1._parent_dataframe_name == \"teams\"\n    assert r2._parent_dataframe_name == \"teams\"\n\n    assert r1._child_column_name == \"home_team_id\"\n    assert r2._child_column_name == \"away_team_id\"\n    assert r1._parent_column_name == \"id\"\n    assert r2._parent_column_name == \"id\"\n\n\ndef test_has_unique_path(diamond_es):\n    assert diamond_es.has_unique_forward_path(\"customers\", \"regions\")\n    assert not diamond_es.has_unique_forward_path(\"transactions\", \"regions\")\n\n\ndef test_raise_key_error_missing_dataframe(es):\n    error_text = \"DataFrame testing does not exist in ecommerce\"\n    with pytest.raises(KeyError, match=error_text):\n        es[\"testing\"]\n\n    es_without_id = EntitySet()\n    error_text = \"DataFrame testing does not exist in entity set\"\n    with pytest.raises(KeyError, match=error_text):\n        es_without_id[\"testing\"]\n\n\ndef test_add_parent_not_index_column(es):\n    error_text = \"Parent column 'language' is not the index of dataframe régions\"\n    with pytest.raises(AttributeError, match=error_text):\n        es.add_relationship(\"régions\", \"language\", \"customers\", \"région_id\")\n"
  },
  {
    "path": "featuretools/tests/entityset_tests/test_last_time_index.py",
    "content": "from datetime import datetime\n\nimport pandas as pd\nimport pytest\nfrom woodwork.logical_types import Categorical, Datetime, Integer\n\nfrom featuretools.entityset.entityset import LTI_COLUMN_NAME\n\n\n@pytest.fixture\ndef values_es(es):\n    es.normalize_dataframe(\n        \"log\",\n        \"values\",\n        \"value\",\n        make_time_index=True,\n        new_dataframe_time_index=\"value_time\",\n    )\n    return es\n\n\n@pytest.fixture\ndef true_values_lti():\n    true_values_lti = pd.Series(\n        [\n            datetime(2011, 4, 10, 10, 41, 0),\n            datetime(2011, 4, 9, 10, 31, 9),\n            datetime(2011, 4, 9, 10, 31, 18),\n            datetime(2011, 4, 9, 10, 31, 27),\n            datetime(2011, 4, 10, 10, 40, 1),\n            datetime(2011, 4, 10, 10, 41, 3),\n            datetime(2011, 4, 9, 10, 30, 12),\n            datetime(2011, 4, 10, 10, 41, 6),\n            datetime(2011, 4, 9, 10, 30, 18),\n            datetime(2011, 4, 9, 10, 30, 24),\n            datetime(2011, 4, 10, 11, 10, 3),\n        ],\n    )\n    return true_values_lti\n\n\n@pytest.fixture\ndef true_sessions_lti():\n    sessions_lti = pd.Series(\n        [\n            datetime(2011, 4, 9, 10, 30, 24),\n            datetime(2011, 4, 9, 10, 31, 27),\n            datetime(2011, 4, 9, 10, 40, 0),\n            datetime(2011, 4, 10, 10, 40, 1),\n            datetime(2011, 4, 10, 10, 41, 6),\n            datetime(2011, 4, 10, 11, 10, 3),\n        ],\n    )\n    return sessions_lti\n\n\n@pytest.fixture\ndef wishlist_df():\n    wishlist_df = pd.DataFrame(\n        {\n            \"session_id\": [0, 1, 2, 2, 3, 4, 5],\n            \"datetime\": [\n                datetime(2011, 4, 9, 10, 30, 15),\n                datetime(2011, 4, 9, 10, 31, 30),\n                datetime(2011, 4, 9, 10, 30, 30),\n                datetime(2011, 4, 9, 10, 35, 30),\n                datetime(2011, 4, 10, 10, 41, 0),\n                datetime(2011, 4, 10, 10, 39, 59),\n                datetime(2011, 4, 10, 11, 10, 2),\n            ],\n            \"product_id\": [\n                \"coke zero\",\n                \"taco clock\",\n                \"coke zero\",\n                \"car\",\n                \"toothpaste\",\n                \"brown bag\",\n                \"coke zero\",\n            ],\n        },\n    )\n    return wishlist_df\n\n\n@pytest.fixture\ndef extra_session_df(es):\n    row_values = {\"customer_id\": 2, \"device_name\": \"PC\", \"device_type\": 0, \"id\": 6}\n    row = pd.DataFrame(row_values, index=pd.Index([6], name=\"id\"))\n    df = es[\"sessions\"]\n    df = pd.concat([df, row]).sort_index()\n    return df\n\n\nclass TestLastTimeIndex(object):\n    def test_leaf(self, es):\n        es.add_last_time_indexes()\n        log = es[\"log\"]\n        lti_name = log.ww.metadata.get(\"last_time_index\")\n\n        assert lti_name == LTI_COLUMN_NAME\n        assert len(log[lti_name]) == 17\n\n        log_df = log\n\n        for v1, v2 in zip(log_df[lti_name], log_df[\"datetime\"]):\n            assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2\n\n    def test_leaf_no_time_index(self, es):\n        es.add_last_time_indexes()\n        stores = es[\"stores\"]\n        true_lti = pd.Series([None for x in range(6)], dtype=\"datetime64[ns]\")\n\n        assert len(true_lti) == len(stores[LTI_COLUMN_NAME])\n\n        stores_lti = stores[LTI_COLUMN_NAME]\n\n        for v1, v2 in zip(stores_lti, true_lti):\n            assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2\n\n    # TODO: possible issue with either normalize_dataframe or add_last_time_indexes\n    def test_parent(self, values_es, true_values_lti):\n        # test dataframe with time index and all instances in child dataframe\n        values_es.add_last_time_indexes()\n        values = values_es[\"values\"]\n        lti_name = values.ww.metadata.get(\"last_time_index\")\n        assert len(values[lti_name]) == 10\n        sorted_lti = values[lti_name].sort_index()\n        for v1, v2 in zip(sorted_lti, true_values_lti):\n            assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2\n\n    def test_parent_some_missing(self, values_es, true_values_lti):\n        # test dataframe with time index and not all instances have children\n        values = values_es[\"values\"]\n\n        # add extra value instance with no children\n        row_values = {\n            \"value\": [21.0],\n            \"value_time\": [pd.Timestamp(\"2011-04-10 11:10:02\")],\n        }\n        # make sure index doesn't have same name as column to suppress pandas warning\n        row = pd.DataFrame(row_values, index=pd.Index([21]))\n        df = pd.concat([values, row])\n        df = df.sort_values(by=\"value\")\n        df.index.name = None\n\n        values_es.replace_dataframe(dataframe_name=\"values\", df=df)\n        values_es.add_last_time_indexes()\n        # lti value should default to instance's time index\n        true_values_lti[10] = pd.Timestamp(\"2011-04-10 11:10:02\")\n        true_values_lti[11] = pd.Timestamp(\"2011-04-10 11:10:03\")\n\n        values = values_es[\"values\"]\n        lti_name = values.ww.metadata.get(\"last_time_index\")\n        assert len(values[lti_name]) == 11\n        sorted_lti = values[lti_name].sort_index()\n        for v1, v2 in zip(sorted_lti, true_values_lti):\n            assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2\n\n    def test_parent_no_time_index(self, es, true_sessions_lti):\n        # test dataframe without time index and all instances have children\n        es.add_last_time_indexes()\n        sessions = es[\"sessions\"]\n        lti_name = sessions.ww.metadata.get(\"last_time_index\")\n        assert len(sessions[lti_name]) == 6\n        sorted_lti = sessions[lti_name].sort_index()\n        for v1, v2 in zip(sorted_lti, true_sessions_lti):\n            assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2\n\n    def test_parent_no_time_index_missing(\n        self,\n        es,\n        extra_session_df,\n        true_sessions_lti,\n    ):\n        # test dataframe without time index and not all instance have children\n\n        # add session instance with no associated log instances\n        es.replace_dataframe(dataframe_name=\"sessions\", df=extra_session_df)\n        es.add_last_time_indexes()\n        # since sessions has no time index, default value is NaT\n        true_sessions_lti[6] = pd.NaT\n        sessions = es[\"sessions\"]\n\n        lti_name = sessions.ww.metadata.get(\"last_time_index\")\n        assert len(sessions[lti_name]) == 7\n        sorted_lti = sessions[lti_name].sort_index()\n        for v1, v2 in zip(sorted_lti, true_sessions_lti):\n            assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2\n\n    def test_multiple_children(self, es, wishlist_df, true_sessions_lti):\n        # test all instances in both children\n        logical_types = {\n            \"session_id\": Integer,\n            \"datetime\": Datetime,\n            \"product_id\": Categorical,\n        }\n        es.add_dataframe(\n            dataframe_name=\"wishlist_log\",\n            dataframe=wishlist_df,\n            index=\"id\",\n            make_index=True,\n            time_index=\"datetime\",\n            logical_types=logical_types,\n        )\n        es.add_relationship(\"sessions\", \"id\", \"wishlist_log\", \"session_id\")\n        es.add_last_time_indexes()\n        sessions = es[\"sessions\"]\n        # wishlist df has more recent events for two session ids\n        true_sessions_lti[1] = pd.Timestamp(\"2011-4-9 10:31:30\")\n        true_sessions_lti[3] = pd.Timestamp(\"2011-4-10 10:41:00\")\n\n        lti_name = sessions.ww.metadata.get(\"last_time_index\")\n        assert len(sessions[lti_name]) == 6\n        sorted_lti = sessions[lti_name].sort_index()\n        for v1, v2 in zip(sorted_lti, true_sessions_lti):\n            assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2\n\n    def test_multiple_children_right_missing(self, es, wishlist_df, true_sessions_lti):\n        # test all instances in left child\n\n        # drop wishlist instance related to id 3 so it's only in log\n        wishlist_df.drop(4, inplace=True)\n        logical_types = {\n            \"session_id\": Integer,\n            \"datetime\": Datetime,\n            \"product_id\": Categorical,\n        }\n        es.add_dataframe(\n            dataframe_name=\"wishlist_log\",\n            dataframe=wishlist_df,\n            index=\"id\",\n            make_index=True,\n            time_index=\"datetime\",\n            logical_types=logical_types,\n        )\n        es.add_relationship(\"sessions\", \"id\", \"wishlist_log\", \"session_id\")\n        es.add_last_time_indexes()\n        sessions = es[\"sessions\"]\n\n        # now only session id 1 has newer event in wishlist_log\n        true_sessions_lti[1] = pd.Timestamp(\"2011-4-9 10:31:30\")\n\n        lti_name = sessions.ww.metadata.get(\"last_time_index\")\n        assert len(sessions[lti_name]) == 6\n        sorted_lti = sessions[lti_name].sort_index()\n        for v1, v2 in zip(sorted_lti, true_sessions_lti):\n            assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2\n\n    def test_multiple_children_left_missing(\n        self,\n        es,\n        extra_session_df,\n        wishlist_df,\n        true_sessions_lti,\n    ):\n        # add row to sessions so not all session instances are in log\n        es.replace_dataframe(dataframe_name=\"sessions\", df=extra_session_df)\n\n        # add row to wishlist df so new session instance in in wishlist_log\n        row_values = {\n            \"session_id\": [6],\n            \"datetime\": [pd.Timestamp(\"2011-04-11 11:11:11\")],\n            \"product_id\": [\"toothpaste\"],\n        }\n        row = pd.DataFrame(row_values, index=pd.RangeIndex(start=7, stop=8))\n        df = pd.concat([wishlist_df, row])\n        logical_types = {\n            \"session_id\": Integer,\n            \"datetime\": Datetime,\n            \"product_id\": Categorical,\n        }\n        es.add_dataframe(\n            dataframe_name=\"wishlist_log\",\n            dataframe=df,\n            index=\"id\",\n            make_index=True,\n            time_index=\"datetime\",\n            logical_types=logical_types,\n        )\n        es.add_relationship(\"sessions\", \"id\", \"wishlist_log\", \"session_id\")\n        es.add_last_time_indexes()\n\n        # test all instances in right child\n        sessions = es[\"sessions\"]\n\n        # now wishlist_log has newer events for 3 session ids\n        true_sessions_lti[1] = pd.Timestamp(\"2011-4-9 10:31:30\")\n        true_sessions_lti[3] = pd.Timestamp(\"2011-4-10 10:41:00\")\n        true_sessions_lti[6] = pd.Timestamp(\"2011-04-11 11:11:11\")\n\n        lti_name = sessions.ww.metadata.get(\"last_time_index\")\n        assert len(sessions[lti_name]) == 7\n        sorted_lti = sessions[lti_name].sort_index()\n        for v1, v2 in zip(sorted_lti, true_sessions_lti):\n            assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2\n\n    def test_multiple_children_all_combined(\n        self,\n        es,\n        extra_session_df,\n        wishlist_df,\n        true_sessions_lti,\n    ):\n        # add row to sessions so not all session instances are in log\n        es.replace_dataframe(dataframe_name=\"sessions\", df=extra_session_df)\n\n        # add row to wishlist_log so extra session has child instance\n        row_values = {\n            \"session_id\": [6],\n            \"datetime\": [pd.Timestamp(\"2011-04-11 11:11:11\")],\n            \"product_id\": [\"toothpaste\"],\n        }\n        row = pd.DataFrame(row_values, index=pd.RangeIndex(start=7, stop=8))\n        df = pd.concat([wishlist_df, row])\n\n        # drop instance 4 so wishlist_log does not have session id 3 instance\n        df.drop(4, inplace=True)\n        logical_types = {\n            \"session_id\": Integer,\n            \"datetime\": Datetime,\n            \"product_id\": Categorical,\n        }\n        es.add_dataframe(\n            dataframe_name=\"wishlist_log\",\n            dataframe=df,\n            index=\"id\",\n            make_index=True,\n            time_index=\"datetime\",\n            logical_types=logical_types,\n        )\n        es.add_relationship(\"sessions\", \"id\", \"wishlist_log\", \"session_id\")\n        es.add_last_time_indexes()\n\n        # test some instances in right, some in left, all when combined\n        sessions = es[\"sessions\"]\n\n        # wishlist has newer events for 2 sessions\n        true_sessions_lti[1] = pd.Timestamp(\"2011-4-9 10:31:30\")\n        true_sessions_lti[6] = pd.Timestamp(\"2011-04-11 11:11:11\")\n\n        lti_name = sessions.ww.metadata.get(\"last_time_index\")\n        assert len(sessions[lti_name]) == 7\n        sorted_lti = sessions[lti_name].sort_index()\n        for v1, v2 in zip(sorted_lti, true_sessions_lti):\n            assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2\n\n    def test_multiple_children_both_missing(\n        self,\n        es,\n        extra_session_df,\n        wishlist_df,\n        true_sessions_lti,\n    ):\n        # test all instances in neither child\n        sessions = es[\"sessions\"]\n\n        logical_types = {\n            \"session_id\": Integer,\n            \"datetime\": Datetime,\n            \"product_id\": Categorical,\n        }\n        # add row to sessions to create session with no events\n        es.replace_dataframe(dataframe_name=\"sessions\", df=extra_session_df)\n\n        es.add_dataframe(\n            dataframe_name=\"wishlist_log\",\n            dataframe=wishlist_df,\n            index=\"id\",\n            make_index=True,\n            time_index=\"datetime\",\n            logical_types=logical_types,\n        )\n        es.add_relationship(\"sessions\", \"id\", \"wishlist_log\", \"session_id\")\n        es.add_last_time_indexes()\n        sessions = es[\"sessions\"]\n\n        # wishlist has 2 newer events and one is NaT\n        true_sessions_lti[1] = pd.Timestamp(\"2011-4-9 10:31:30\")\n        true_sessions_lti[3] = pd.Timestamp(\"2011-4-10 10:41:00\")\n        true_sessions_lti[6] = pd.NaT\n\n        lti_name = sessions.ww.metadata.get(\"last_time_index\")\n        assert len(sessions[lti_name]) == 7\n        sorted_lti = sessions[lti_name].sort_index()\n        for v1, v2 in zip(sorted_lti, true_sessions_lti):\n            assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2\n\n    def test_grandparent(self, es):\n        # test sorting by time works correctly across several generations\n        df = es[\"log\"]\n\n        # For one user, change a log event to be newer than the user's normal\n        # last time index. This event should be from a different session than\n        # the current last time index.\n        df[\"datetime\"][5] = pd.Timestamp(\"2011-4-09 10:40:01\")\n        df = (\n            df.set_index(\"datetime\", append=True)\n            .sort_index(level=[1, 0], kind=\"mergesort\")\n            .reset_index(\"datetime\", drop=False)\n        )\n        es.replace_dataframe(dataframe_name=\"log\", df=df)\n        es.add_last_time_indexes()\n        customers = es[\"customers\"]\n\n        true_customers_lti = pd.Series(\n            [\n                datetime(2011, 4, 9, 10, 40, 1),\n                datetime(2011, 4, 10, 10, 41, 6),\n                datetime(2011, 4, 10, 11, 10, 3),\n            ],\n        )\n\n        lti_name = customers.ww.metadata.get(\"last_time_index\")\n        assert len(customers[lti_name]) == 3\n        sorted_lti = customers.sort_values(\"id\")[lti_name]\n        for v1, v2 in zip(sorted_lti, true_customers_lti):\n            assert (pd.isnull(v1) and pd.isnull(v2)) or v1 == v2\n"
  },
  {
    "path": "featuretools/tests/entityset_tests/test_plotting.py",
    "content": "import os\nimport re\n\nimport graphviz\nimport pandas as pd\nimport pytest\n\nfrom featuretools import EntitySet\n\n\n@pytest.fixture\ndef simple_es():\n    es = EntitySet(\"test\")\n    df = pd.DataFrame({\"foo\": [1]})\n    es.add_dataframe(df, dataframe_name=\"test\", index=\"foo\")\n    return es\n\n\ndef test_returns_digraph_object(es):\n    graph = es.plot()\n\n    assert isinstance(graph, graphviz.Digraph)\n\n\ndef test_saving_png_file(es, tmp_path):\n    output_path = str(tmp_path.joinpath(\"test1.png\"))\n\n    es.plot(to_file=output_path)\n\n    assert os.path.isfile(output_path)\n    os.remove(output_path)\n\n\ndef test_missing_file_extension(es):\n    output_path = \"test1\"\n\n    with pytest.raises(ValueError) as excinfo:\n        es.plot(to_file=output_path)\n\n    assert str(excinfo.value).startswith(\"Please use a file extension\")\n\n\ndef test_invalid_format(es):\n    output_path = \"test1.xzy\"\n\n    with pytest.raises(ValueError) as excinfo:\n        es.plot(to_file=output_path)\n\n    assert str(excinfo.value).startswith(\"Unknown format\")\n\n\ndef test_multiple_rows(es):\n    plot_ = es.plot()\n    result = re.findall(r\"\\((\\d+\\srows?)\\)\", plot_.source)\n    expected = [\"{} rows\".format(str(i.shape[0])) for i in es.dataframes]\n    assert result == expected\n\n\ndef test_single_row(simple_es):\n    plot_ = simple_es.plot()\n    result = re.findall(r\"\\((\\d+\\srows?)\\)\", plot_.source)\n    expected = [\"1 row\"]\n    assert result == expected\n"
  },
  {
    "path": "featuretools/tests/entityset_tests/test_relationship.py",
    "content": "from featuretools.entityset.relationship import Relationship, RelationshipPath\n\n\ndef test_relationship_path(es):\n    log_to_sessions = Relationship(es, \"sessions\", \"id\", \"log\", \"session_id\")\n    sessions_to_customers = Relationship(\n        es,\n        \"customers\",\n        \"id\",\n        \"sessions\",\n        \"customer_id\",\n    )\n    path_list = [\n        (True, log_to_sessions),\n        (True, sessions_to_customers),\n        (False, sessions_to_customers),\n    ]\n    path = RelationshipPath(path_list)\n\n    for i, edge in enumerate(path_list):\n        assert path[i] == edge\n\n    assert [edge for edge in path] == path_list\n\n\ndef test_relationship_path_name(es):\n    assert RelationshipPath([]).name == \"\"\n\n    log_to_sessions = Relationship(es, \"sessions\", \"id\", \"log\", \"session_id\")\n    sessions_to_customers = Relationship(\n        es,\n        \"customers\",\n        \"id\",\n        \"sessions\",\n        \"customer_id\",\n    )\n\n    forward_path = [(True, log_to_sessions), (True, sessions_to_customers)]\n    assert RelationshipPath(forward_path).name == \"sessions.customers\"\n\n    backward_path = [(False, sessions_to_customers), (False, log_to_sessions)]\n    assert RelationshipPath(backward_path).name == \"sessions.log\"\n\n    mixed_path = [(True, log_to_sessions), (False, log_to_sessions)]\n    assert RelationshipPath(mixed_path).name == \"sessions.log\"\n\n\ndef test_relationship_path_dataframes(es):\n    assert list(RelationshipPath([]).dataframes()) == []\n\n    log_to_sessions = Relationship(es, \"sessions\", \"id\", \"log\", \"session_id\")\n    sessions_to_customers = Relationship(\n        es,\n        \"customers\",\n        \"id\",\n        \"sessions\",\n        \"customer_id\",\n    )\n\n    forward_path = [(True, log_to_sessions), (True, sessions_to_customers)]\n    assert list(RelationshipPath(forward_path).dataframes()) == [\n        \"log\",\n        \"sessions\",\n        \"customers\",\n    ]\n\n    backward_path = [(False, sessions_to_customers), (False, log_to_sessions)]\n    assert list(RelationshipPath(backward_path).dataframes()) == [\n        \"customers\",\n        \"sessions\",\n        \"log\",\n    ]\n\n    mixed_path = [(True, log_to_sessions), (False, log_to_sessions)]\n    assert list(RelationshipPath(mixed_path).dataframes()) == [\"log\", \"sessions\", \"log\"]\n\n\ndef test_names_when_multiple_relationships_between_dataframes(games_es):\n    relationship = Relationship(games_es, \"teams\", \"id\", \"games\", \"home_team_id\")\n    assert relationship.child_name == \"games[home_team_id]\"\n    assert relationship.parent_name == \"teams[home_team_id]\"\n\n\ndef test_names_when_no_other_relationship_between_dataframes(home_games_es):\n    relationship = Relationship(home_games_es, \"teams\", \"id\", \"games\", \"home_team_id\")\n    assert relationship.child_name == \"games\"\n    assert relationship.parent_name == \"teams\"\n\n\ndef test_relationship_serialization(es):\n    relationship = Relationship(es, \"sessions\", \"id\", \"log\", \"session_id\")\n\n    dictionary = {\n        \"parent_dataframe_name\": \"sessions\",\n        \"parent_column_name\": \"id\",\n        \"child_dataframe_name\": \"log\",\n        \"child_column_name\": \"session_id\",\n    }\n    assert relationship.to_dictionary() == dictionary\n    assert Relationship.from_dictionary(dictionary, es) == relationship\n"
  },
  {
    "path": "featuretools/tests/entityset_tests/test_serialization.py",
    "content": "import json\nimport logging\nimport os\nimport tempfile\nfrom unittest.mock import MagicMock, patch\nfrom urllib.request import urlretrieve\n\nimport boto3\nimport pandas as pd\nimport pytest\nimport woodwork.type_sys.type_system as ww_type_system\nfrom woodwork.logical_types import LogicalType, Ordinal\nfrom woodwork.serializers.serializer_base import typing_info_to_dict\nfrom woodwork.type_sys.utils import list_logical_types\n\nfrom featuretools.entityset import EntitySet, deserialize, serialize\nfrom featuretools.version import ENTITYSET_SCHEMA_VERSION\n\nBUCKET_NAME = \"test-bucket\"\nWRITE_KEY_NAME = \"test-key\"\nTEST_S3_URL = \"s3://{}/{}\".format(BUCKET_NAME, WRITE_KEY_NAME)\nTEST_FILE = \"test_serialization_data_entityset_schema_{}_2022_09_02.tar\".format(\n    ENTITYSET_SCHEMA_VERSION,\n)\nS3_URL = \"s3://featuretools-static/\" + TEST_FILE\nURL = \"https://featuretools-static.s3.amazonaws.com/\" + TEST_FILE\nTEST_KEY = \"test_access_key_es\"\n\n\ndef test_entityset_description(es):\n    description = serialize.entityset_to_description(es)\n    _es = deserialize.description_to_entityset(description)\n    assert es.metadata.__eq__(_es, deep=True)\n\n\ndef test_all_ww_logical_types():\n    logical_types = list_logical_types()[\"type_string\"].to_list()\n    dataframe = pd.DataFrame(columns=logical_types)\n    es = EntitySet()\n    ltype_dict = {ltype: ltype for ltype in logical_types}\n    ltype_dict[\"ordinal\"] = Ordinal(order=[])\n    es.add_dataframe(\n        dataframe=dataframe,\n        dataframe_name=\"all_types\",\n        index=\"integer\",\n        logical_types=ltype_dict,\n    )\n    description = serialize.entityset_to_description(es)\n    _es = deserialize.description_to_entityset(description)\n    assert es.__eq__(_es, deep=True)\n\n\ndef test_with_custom_ww_logical_type():\n    class CustomLogicalType(LogicalType):\n        pass\n\n    ww_type_system.add_type(CustomLogicalType)\n    columns = [\"integer\", \"natural_language\", \"custom_logical_type\"]\n    dataframe = pd.DataFrame(columns=columns)\n    es = EntitySet()\n    ltype_dict = {\n        \"integer\": \"integer\",\n        \"natural_language\": \"natural_language\",\n        \"custom_logical_type\": CustomLogicalType,\n    }\n    es.add_dataframe(\n        dataframe=dataframe,\n        dataframe_name=\"custom_type\",\n        index=\"integer\",\n        logical_types=ltype_dict,\n    )\n    description = serialize.entityset_to_description(es)\n    _es = deserialize.description_to_entityset(description)\n    assert isinstance(\n        _es[\"custom_type\"].ww.logical_types[\"custom_logical_type\"],\n        CustomLogicalType,\n    )\n    assert es.__eq__(_es, deep=True)\n\n\ndef test_serialize_invalid_formats(es, tmp_path):\n    error_text = \"must be one of the following formats: {}\"\n    error_text = error_text.format(\", \".join(serialize.FORMATS))\n    with pytest.raises(ValueError, match=error_text):\n        serialize.write_data_description(es, path=str(tmp_path), format=\"\")\n\n\ndef test_empty_dataframe(es):\n    for df in es.dataframes:\n        description = typing_info_to_dict(df)\n        dataframe = deserialize.empty_dataframe(description)\n        assert dataframe.empty\n        assert all(dataframe.columns == df.columns)\n\n\ndef test_to_csv(es, tmp_path):\n    es.to_csv(str(tmp_path), encoding=\"utf-8\", engine=\"python\")\n    new_es = deserialize.read_entityset(str(tmp_path))\n    assert es.__eq__(new_es, deep=True)\n    df = es[\"log\"]\n    new_df = new_es[\"log\"]\n    assert type(df[\"latlong\"][0]) in (tuple, list)\n    assert type(new_df[\"latlong\"][0]) in (tuple, list)\n\n\ndef test_to_csv_interesting_values(es, tmp_path):\n    es.add_interesting_values()\n    es.to_csv(str(tmp_path))\n    new_es = deserialize.read_entityset(str(tmp_path))\n    assert es.__eq__(new_es, deep=True)\n\n\ndef test_to_csv_manual_interesting_values(es, tmp_path):\n    es.add_interesting_values(\n        dataframe_name=\"log\",\n        values={\"product_id\": [\"coke_zero\"]},\n    )\n    es.to_csv(str(tmp_path))\n    new_es = deserialize.read_entityset(str(tmp_path))\n    assert es.__eq__(new_es, deep=True)\n    assert new_es[\"log\"].ww[\"product_id\"].ww.metadata[\"interesting_values\"] == [\n        \"coke_zero\",\n    ]\n\n\ndef test_to_pickle(es, tmp_path):\n    es.to_pickle(str(tmp_path))\n    new_es = deserialize.read_entityset(str(tmp_path))\n    assert es.__eq__(new_es, deep=True)\n    assert type(es[\"log\"][\"latlong\"][0]) == tuple\n    assert type(new_es[\"log\"][\"latlong\"][0]) == tuple\n\n\ndef test_to_pickle_interesting_values(es, tmp_path):\n    es.add_interesting_values()\n    es.to_pickle(str(tmp_path))\n    new_es = deserialize.read_entityset(str(tmp_path))\n    assert es.__eq__(new_es, deep=True)\n\n\ndef test_to_pickle_manual_interesting_values(es, tmp_path):\n    es.add_interesting_values(\n        dataframe_name=\"log\",\n        values={\"product_id\": [\"coke_zero\"]},\n    )\n    es.to_pickle(str(tmp_path))\n    new_es = deserialize.read_entityset(str(tmp_path))\n    assert es.__eq__(new_es, deep=True)\n    assert new_es[\"log\"].ww[\"product_id\"].ww.metadata[\"interesting_values\"] == [\n        \"coke_zero\",\n    ]\n\n\ndef test_to_parquet(es, tmp_path):\n    es.to_parquet(str(tmp_path))\n    new_es = deserialize.read_entityset(str(tmp_path))\n    assert es.__eq__(new_es, deep=True)\n    df = es[\"log\"]\n    new_df = new_es[\"log\"]\n    assert type(df[\"latlong\"][0]) in (tuple, list)\n    assert type(new_df[\"latlong\"][0]) in (tuple, list)\n\n\ndef test_to_parquet_manual_interesting_values(es, tmp_path):\n    es.add_interesting_values(\n        dataframe_name=\"log\",\n        values={\"product_id\": [\"coke_zero\"]},\n    )\n    es.to_parquet(str(tmp_path))\n    new_es = deserialize.read_entityset(str(tmp_path))\n    assert es.__eq__(new_es, deep=True)\n    assert new_es[\"log\"].ww[\"product_id\"].ww.metadata[\"interesting_values\"] == [\n        \"coke_zero\",\n    ]\n\n\ndef test_to_parquet_interesting_values(es, tmp_path):\n    es.add_interesting_values()\n    es.to_parquet(str(tmp_path))\n    new_es = deserialize.read_entityset(str(tmp_path))\n    assert es.__eq__(new_es, deep=True)\n\n\ndef test_to_parquet_with_lti(tmp_path, mock_customer):\n    es = mock_customer\n    es.to_parquet(str(tmp_path))\n    new_es = deserialize.read_entityset(str(tmp_path))\n    assert es.__eq__(new_es, deep=True)\n\n\ndef test_to_pickle_id_none(tmp_path):\n    es = EntitySet()\n    es.to_pickle(str(tmp_path))\n    new_es = deserialize.read_entityset(str(tmp_path))\n    assert es.__eq__(new_es, deep=True)\n\n\n# TODO: Fix Moto tests needing to explicitly set permissions for objects\n@pytest.fixture\ndef s3_client():\n    _environ = os.environ.copy()\n    from moto import mock_aws\n\n    with mock_aws():\n        s3 = boto3.resource(\"s3\")\n        yield s3\n    os.environ.clear()\n    os.environ.update(_environ)\n\n\n@pytest.fixture\ndef s3_bucket(s3_client, region=\"us-east-2\"):\n    location = {\"LocationConstraint\": region}\n    s3_client.create_bucket(\n        Bucket=BUCKET_NAME,\n        ACL=\"public-read-write\",\n        CreateBucketConfiguration=location,\n    )\n    s3_bucket = s3_client.Bucket(BUCKET_NAME)\n    yield s3_bucket\n\n\ndef make_public(s3_client, s3_bucket):\n    obj = list(s3_bucket.objects.all())[0].key\n    s3_client.ObjectAcl(BUCKET_NAME, obj).put(ACL=\"public-read-write\")\n\n\n@pytest.mark.parametrize(\"profile_name\", [None, False])\ndef test_serialize_s3_csv(es, s3_client, s3_bucket, profile_name):\n    es.to_csv(TEST_S3_URL, encoding=\"utf-8\", engine=\"python\", profile_name=profile_name)\n    make_public(s3_client, s3_bucket)\n    new_es = deserialize.read_entityset(TEST_S3_URL, profile_name=profile_name)\n    assert es.__eq__(new_es, deep=True)\n\n\n@pytest.mark.parametrize(\"profile_name\", [None, False])\ndef test_serialize_s3_pickle(es, s3_client, s3_bucket, profile_name):\n    es.to_pickle(TEST_S3_URL, profile_name=profile_name)\n    make_public(s3_client, s3_bucket)\n    new_es = deserialize.read_entityset(TEST_S3_URL, profile_name=profile_name)\n    assert es.__eq__(new_es, deep=True)\n\n\n@pytest.mark.parametrize(\"profile_name\", [None, False])\ndef test_serialize_s3_parquet(es, s3_client, s3_bucket, profile_name):\n    es.to_parquet(TEST_S3_URL, profile_name=profile_name)\n    make_public(s3_client, s3_bucket)\n    new_es = deserialize.read_entityset(TEST_S3_URL, profile_name=profile_name)\n    assert es.__eq__(new_es, deep=True)\n\n\ndef test_s3_test_profile(es, s3_client, s3_bucket, setup_test_profile):\n    es.to_csv(TEST_S3_URL, encoding=\"utf-8\", engine=\"python\", profile_name=\"test\")\n    make_public(s3_client, s3_bucket)\n    new_es = deserialize.read_entityset(TEST_S3_URL, profile_name=\"test\")\n    assert es.__eq__(new_es, deep=True)\n\n\ndef test_serialize_url_csv(es):\n    error_text = \"Writing to URLs is not supported\"\n    with pytest.raises(ValueError, match=error_text):\n        es.to_csv(URL, encoding=\"utf-8\", engine=\"python\")\n\n\ndef test_serialize_subdirs_not_removed(es, tmp_path):\n    write_path = tmp_path.joinpath(\"test\")\n    write_path.mkdir()\n    test_dir = write_path.joinpath(\"test_dir\")\n    test_dir.mkdir()\n    description_path = write_path.joinpath(\"data_description.json\")\n    with open(description_path, \"w\") as f:\n        json.dump(\"__SAMPLE_TEXT__\", f)\n    compression = None\n    serialize.write_data_description(\n        es,\n        path=str(write_path),\n        index=\"1\",\n        sep=\"\\t\",\n        encoding=\"utf-8\",\n        compression=compression,\n    )\n    assert os.path.exists(str(test_dir))\n    with open(description_path, \"r\") as f:\n        assert \"__SAMPLE_TEXT__\" not in json.load(f)\n\n\ndef test_deserialize_local_tar(es):\n    with tempfile.TemporaryDirectory() as tmp_path:\n        temp_tar_filepath = os.path.join(tmp_path, TEST_FILE)\n        urlretrieve(URL, filename=temp_tar_filepath)\n        new_es = deserialize.read_entityset(temp_tar_filepath)\n        assert es.__eq__(new_es, deep=True)\n\n\n@patch(\"featuretools.entityset.deserialize.getfullargspec\")\ndef test_deserialize_errors_if_python_version_unsafe(mock_inspect, es):\n    mock_response = MagicMock()\n    mock_response.kwonlyargs = []\n    mock_inspect.return_value = mock_response\n    with tempfile.TemporaryDirectory() as tmp_path:\n        temp_tar_filepath = os.path.join(tmp_path, TEST_FILE)\n        urlretrieve(URL, filename=temp_tar_filepath)\n        with pytest.raises(RuntimeError, match=\"\"):\n            deserialize.read_entityset(temp_tar_filepath)\n\n\ndef test_deserialize_url_csv(es):\n    new_es = deserialize.read_entityset(URL)\n    assert es.__eq__(new_es, deep=True)\n\n\ndef test_deserialize_s3_csv(es):\n    new_es = deserialize.read_entityset(S3_URL, profile_name=False)\n    assert es.__eq__(new_es, deep=True)\n\n\ndef test_operations_invalidate_metadata(es):\n    new_es = EntitySet(id=\"test\")\n    # test metadata gets created on access\n    assert new_es._data_description is None\n    assert new_es.metadata is not None  # generated after access\n    assert new_es._data_description is not None\n    customers_ltypes = None\n    new_es.add_dataframe(\n        es[\"customers\"],\n        \"customers\",\n        logical_types=customers_ltypes,\n    )\n    sessions_ltypes = None\n    new_es.add_dataframe(\n        es[\"sessions\"],\n        \"sessions\",\n        logical_types=sessions_ltypes,\n    )\n\n    assert new_es._data_description is None\n    assert new_es.metadata is not None\n    assert new_es._data_description is not None\n\n    new_es = new_es.add_relationship(\"customers\", \"id\", \"sessions\", \"customer_id\")\n    assert new_es._data_description is None\n    assert new_es.metadata is not None\n    assert new_es._data_description is not None\n\n    new_es = new_es.normalize_dataframe(\"customers\", \"cohort\", \"cohort\")\n    assert new_es._data_description is None\n    assert new_es.metadata is not None\n    assert new_es._data_description is not None\n\n    new_es.add_last_time_indexes()\n    assert new_es._data_description is None\n    assert new_es.metadata is not None\n    assert new_es._data_description is not None\n\n    new_es.add_interesting_values()\n    assert new_es._data_description is None\n    assert new_es.metadata is not None\n    assert new_es._data_description is not None\n\n\ndef test_reset_metadata(es):\n    assert es.metadata is not None\n    assert es._data_description is not None\n    es.reset_data_description()\n    assert es._data_description is None\n\n\n@patch(\"featuretools.utils.schema_utils.ENTITYSET_SCHEMA_VERSION\", \"1.1.1\")\n@pytest.mark.parametrize(\n    \"hardcoded_schema_version, warns\",\n    [(\"2.1.1\", True), (\"1.2.1\", True), (\"1.1.2\", True), (\"1.0.2\", False)],\n)\ndef test_later_schema_version(es, caplog, hardcoded_schema_version, warns):\n    def test_version(version, warns):\n        if warns:\n            warning_text = (\n                \"The schema version of the saved entityset\"\n                \"(%s) is greater than the latest supported (%s). \"\n                \"You may need to upgrade featuretools. Attempting to load entityset ...\"\n                % (version, \"1.1.1\")\n            )\n        else:\n            warning_text = None\n\n        _check_schema_version(version, es, warning_text, caplog, \"warn\")\n\n    test_version(hardcoded_schema_version, warns)\n\n\n@patch(\"featuretools.utils.schema_utils.ENTITYSET_SCHEMA_VERSION\", \"1.1.1\")\n@pytest.mark.parametrize(\n    \"hardcoded_schema_version, warns\",\n    [(\"0.1.1\", True), (\"1.0.1\", False), (\"1.1.0\", False)],\n)\ndef test_earlier_schema_version(\n    es,\n    caplog,\n    monkeypatch,\n    hardcoded_schema_version,\n    warns,\n):\n    def test_version(version, warns):\n        if warns:\n            warning_text = (\n                \"The schema version of the saved entityset\"\n                \"(%s) is no longer supported by this version \"\n                \"of featuretools. Attempting to load entityset ...\" % version\n            )\n        else:\n            warning_text = None\n\n        _check_schema_version(version, es, warning_text, caplog, \"log\")\n\n    test_version(hardcoded_schema_version, warns)\n\n\ndef _check_schema_version(version, es, warning_text, caplog, warning_type=None):\n    dataframes = {\n        dataframe.ww.name: typing_info_to_dict(dataframe) for dataframe in es.dataframes\n    }\n    relationships = [relationship.to_dictionary() for relationship in es.relationships]\n    dictionary = {\n        \"schema_version\": version,\n        \"id\": es.id,\n        \"dataframes\": dataframes,\n        \"relationships\": relationships,\n    }\n\n    if warning_type == \"warn\" and warning_text:\n        with pytest.warns(UserWarning) as record:\n            deserialize.description_to_entityset(dictionary)\n        assert record[0].message.args[0] == warning_text\n    elif warning_type == \"log\":\n        logger = logging.getLogger(\"featuretools\")\n        logger.propagate = True\n        deserialize.description_to_entityset(dictionary)\n        if warning_text:\n            assert warning_text in caplog.text\n        else:\n            assert not len(caplog.text)\n        logger.propagate = False\n"
  },
  {
    "path": "featuretools/tests/entityset_tests/test_timedelta.py",
    "content": "import pandas as pd\nimport pytest\nfrom dateutil.relativedelta import relativedelta\n\nfrom featuretools.entityset import Timedelta\nfrom featuretools.feature_base import Feature\nfrom featuretools.primitives import Count\nfrom featuretools.utils.wrangle import _check_timedelta\n\n\ndef test_timedelta_equality():\n    assert Timedelta(10, \"d\") == Timedelta(10, \"d\")\n    assert Timedelta(10, \"d\") != 1\n\n\ndef test_singular():\n    assert Timedelta.make_singular(\"Month\") == \"Month\"\n    assert Timedelta.make_singular(\"Months\") == \"Month\"\n\n\ndef test_delta_with_observations(es):\n    four_delta = Timedelta(4, \"observations\")\n    assert not four_delta.is_absolute()\n    assert four_delta.get_value(\"o\") == 4\n\n    neg_four_delta = -four_delta\n    assert not neg_four_delta.is_absolute()\n    assert neg_four_delta.get_value(\"o\") == -4\n\n    time = pd.to_datetime(\"2019-05-01\")\n\n    error_txt = \"Invalid unit\"\n    with pytest.raises(Exception, match=error_txt):\n        time + four_delta\n\n    with pytest.raises(Exception, match=error_txt):\n        time - four_delta\n\n\ndef test_delta_with_time_unit_matches_pandas(es):\n    customer_id = 0\n    sessions_df = es[\"sessions\"]\n    sessions_df = sessions_df[sessions_df[\"customer_id\"] == customer_id]\n    log_df = es[\"log\"]\n    log_df = log_df[log_df[\"session_id\"].isin(sessions_df[\"id\"])]\n    all_times = log_df[\"datetime\"].sort_values().tolist()\n\n    # 4 observation delta\n    value = 4\n    unit = \"h\"\n    delta = Timedelta(value, unit)\n    neg_delta = -delta\n    # first plus 4 obs is fifth\n    assert all_times[0] + delta == all_times[0] + pd.Timedelta(value, unit)\n    # using negative\n    assert all_times[0] - neg_delta == all_times[0] + pd.Timedelta(value, unit)\n\n    # fifth minus 4 obs is first\n    assert all_times[4] - delta == all_times[4] - pd.Timedelta(value, unit)\n    # using negative\n    assert all_times[4] + neg_delta == all_times[4] - pd.Timedelta(value, unit)\n\n\ndef test_check_timedelta(es):\n    time_units = list(Timedelta._readable_units.keys())\n    expanded_units = list(Timedelta._readable_units.values())\n    exp_to_standard_unit = {e: t for e, t in zip(expanded_units, time_units)}\n    singular_units = [u[:-1] for u in expanded_units]\n    sing_to_standard_unit = {s: t for s, t in zip(singular_units, time_units)}\n    to_standard_unit = {}\n    to_standard_unit.update(exp_to_standard_unit)\n    to_standard_unit.update(sing_to_standard_unit)\n    full_units = singular_units + expanded_units + time_units + time_units\n\n    strings = [\"2 {}\".format(u) for u in singular_units + expanded_units + time_units]\n    strings += [\"2{}\".format(u) for u in time_units]\n    for i, s in enumerate(strings):\n        unit = full_units[i]\n        standard_unit = unit\n        if unit in to_standard_unit:\n            standard_unit = to_standard_unit[unit]\n\n        td = _check_timedelta(s)\n        assert td.get_value(standard_unit) == 2\n\n\ndef test_check_pd_timedelta(es):\n    pdtd = pd.Timedelta(5, \"m\")\n    td = _check_timedelta(pdtd)\n    assert td.get_value(\"s\") == 300\n\n\ndef test_string_timedelta_args():\n    assert Timedelta(\"1 second\") == Timedelta(1, \"second\")\n    assert Timedelta(\"1 seconds\") == Timedelta(1, \"second\")\n    assert Timedelta(\"10 days\") == Timedelta(10, \"days\")\n    assert Timedelta(\"100 days\") == Timedelta(100, \"days\")\n    assert Timedelta(\"1001 days\") == Timedelta(1001, \"days\")\n    assert Timedelta(\"1001 weeks\") == Timedelta(1001, \"weeks\")\n\n\ndef test_feature_takes_timedelta_string(es):\n    feature = Feature(\n        Feature(es[\"log\"].ww[\"id\"]),\n        parent_dataframe_name=\"customers\",\n        use_previous=\"1 day\",\n        primitive=Count,\n    )\n    assert feature.use_previous == Timedelta(1, \"d\")\n\n\ndef test_deltas_week(es):\n    customer_id = 0\n    sessions_df = es[\"sessions\"]\n    sessions_df = sessions_df[sessions_df[\"customer_id\"] == customer_id]\n    log_df = es[\"log\"]\n    log_df = log_df[log_df[\"session_id\"].isin(sessions_df[\"id\"])]\n    all_times = log_df[\"datetime\"].sort_values().tolist()\n    delta_week = Timedelta(1, \"w\")\n    delta_days = Timedelta(7, \"d\")\n\n    assert all_times[0] + delta_days == all_times[0] + delta_week\n\n\ndef test_relative_year():\n    td_time = \"1 years\"\n    td = _check_timedelta(td_time)\n    assert td.get_value(\"Y\") == 1\n    assert isinstance(td.delta_obj, relativedelta)\n\n    time = pd.to_datetime(\"2020-02-29\")\n    assert time + td == pd.to_datetime(\"2021-02-28\")\n\n\ndef test_serialization():\n    times = [Timedelta(1, unit=\"w\"), Timedelta(3, unit=\"d\"), Timedelta(5, unit=\"o\")]\n\n    dictionaries = [\n        {\"value\": 1, \"unit\": \"w\"},\n        {\"value\": 3, \"unit\": \"d\"},\n        {\"value\": 5, \"unit\": \"o\"},\n    ]\n\n    for td, expected in zip(times, dictionaries):\n        assert expected == td.get_arguments()\n\n    for expected, dictionary in zip(times, dictionaries):\n        assert expected == Timedelta.from_dictionary(dictionary)\n\n    # Test multiple temporal parameters separately since it is not deterministic\n    mult_time = {\"years\": 4, \"months\": 3, \"days\": 2}\n    mult_td = Timedelta(mult_time)\n\n    # Serialize\n    td_units = mult_td.get_arguments()[\"unit\"]\n    td_values = mult_td.get_arguments()[\"value\"]\n    arg_list = list(zip(td_values, td_units))\n\n    assert (4, \"Y\") in arg_list\n    assert (3, \"mo\") in arg_list\n    assert (2, \"d\") in arg_list\n\n    # Deserialize\n    assert mult_td == Timedelta.from_dictionary(\n        {\"value\": [4, 3, 2], \"unit\": [\"Y\", \"mo\", \"d\"]},\n    )\n\n\ndef test_relative_month():\n    td_time = \"1 month\"\n    td = _check_timedelta(td_time)\n    assert td.get_value(\"mo\") == 1\n    assert isinstance(td.delta_obj, relativedelta)\n\n    time = pd.to_datetime(\"2020-01-31\")\n    assert time + td == pd.to_datetime(\"2020-02-29\")\n\n    td_time = \"6 months\"\n    td = _check_timedelta(td_time)\n    assert td.get_value(\"mo\") == 6\n    assert isinstance(td.delta_obj, relativedelta)\n\n    time = pd.to_datetime(\"2020-01-31\")\n    assert time + td == pd.to_datetime(\"2020-07-31\")\n\n\ndef test_has_multiple_units():\n    single_unit = pd.DateOffset(months=3)\n    multiple_units = pd.DateOffset(months=3, years=3, days=5)\n    single_td = _check_timedelta(single_unit)\n    multiple_td = _check_timedelta(multiple_units)\n    assert single_td.has_multiple_units() is False\n    assert multiple_td.has_multiple_units() is True\n\n\ndef test_pd_dateoffset_to_timedelta():\n    single_temporal = pd.DateOffset(months=3)\n    single_td = _check_timedelta(single_temporal)\n    assert single_td.get_value(\"mo\") == 3\n    assert single_td.delta_obj == pd.DateOffset(months=3)\n\n    mult_temporal = pd.DateOffset(years=10, months=3, days=5)\n    mult_td = _check_timedelta(mult_temporal)\n    expected = {\"Y\": 10, \"mo\": 3, \"d\": 5}\n    assert mult_td.get_value() == expected\n    assert mult_td.delta_obj == mult_temporal\n    # get_name() for multiple values is not deterministic\n    assert len(mult_td.get_name()) == len(\"10 Years 3 Months 5 Days\")\n\n    special_dateoffset = pd.offsets.BDay(100)\n    special_td = _check_timedelta(special_dateoffset)\n    assert special_td.get_value(\"businessdays\") == 100\n    assert special_td.delta_obj == special_dateoffset\n\n\ndef test_pd_dateoffset_to_timedelta_math():\n    base = pd.to_datetime(\"2020-01-31\")\n    add = _check_timedelta(pd.DateOffset(months=2))\n    res = base + add\n    assert res == pd.to_datetime(\"2020-03-31\")\n\n    base_2 = pd.to_datetime(\"2020-01-31\")\n    add_2 = _check_timedelta(pd.DateOffset(months=2, days=3))\n    res_2 = base_2 + add_2\n    assert res_2 == pd.to_datetime(\"2020-04-03\")\n\n    base_3 = pd.to_datetime(\"2019-09-20\")\n    sub = _check_timedelta(pd.offsets.BDay(10))\n    res_3 = base_3 - sub\n    assert res_3 == pd.to_datetime(\"2019-09-06\")\n"
  },
  {
    "path": "featuretools/tests/entityset_tests/test_ww_es.py",
    "content": "from datetime import datetime\n\nimport numpy as np\nimport pandas as pd\nimport pytest\nfrom woodwork.exceptions import TypeConversionError\nfrom woodwork.logical_types import (\n    Boolean,\n    Categorical,\n    Datetime,\n    Double,\n    Integer,\n    NaturalLanguage,\n)\n\nfrom featuretools.entityset.entityset import LTI_COLUMN_NAME, EntitySet\n\n\ndef test_empty_es():\n    es = EntitySet(\"es\")\n    assert es.id == \"es\"\n    assert es.dataframe_dict == {}\n    assert es.relationships == []\n    assert es.time_type is None\n\n\n@pytest.fixture\ndef df():\n    return pd.DataFrame({\"id\": [0, 1, 2], \"category\": [\"a\", \"b\", \"c\"]}).astype(\n        {\"category\": \"category\"},\n    )\n\n\ndef test_init_es_with_dataframe(df):\n    es = EntitySet(\"es\", dataframes={\"table\": (df, \"id\")})\n    assert es.id == \"es\"\n    assert len(es.dataframe_dict) == 1\n    assert es[\"table\"] is df\n\n    assert es[\"table\"].ww.schema is not None\n    assert isinstance(es[\"table\"].ww.logical_types[\"id\"], Integer)\n    assert isinstance(es[\"table\"].ww.logical_types[\"category\"], Categorical)\n\n\ndef test_init_es_with_woodwork_table_same_name(df):\n    df.ww.init(index=\"id\", name=\"table\")\n    es = EntitySet(\"es\", dataframes={\"table\": (df,)})\n\n    assert es.id == \"es\"\n    assert len(es.dataframe_dict) == 1\n    assert es[\"table\"] is df\n\n    assert es[\"table\"].ww.schema is not None\n\n    assert es[\"table\"].ww.index == \"id\"\n    assert es[\"table\"].ww.time_index is None\n\n    assert isinstance(es[\"table\"].ww.logical_types[\"id\"], Integer)\n    assert isinstance(es[\"table\"].ww.logical_types[\"category\"], Categorical)\n\n\ndef test_init_es_with_woodwork_table_diff_name_error(df):\n    df.ww.init(index=\"id\", name=\"table\")\n    error = \"Naming conflict in dataframes dictionary: dictionary key 'diff_name' does not match dataframe name 'table'\"\n    with pytest.raises(ValueError, match=error):\n        EntitySet(\"es\", dataframes={\"diff_name\": (df,)})\n\n\ndef test_init_es_with_dataframe_and_params(df):\n    logical_types = {\"id\": \"NaturalLanguage\", \"category\": NaturalLanguage}\n    semantic_tags = {\"category\": \"new_tag\"}\n    es = EntitySet(\n        \"es\",\n        dataframes={\"table\": (df, \"id\", None, logical_types, semantic_tags)},\n    )\n\n    assert es.id == \"es\"\n    assert len(es.dataframe_dict) == 1\n    assert es[\"table\"] is df\n\n    assert es[\"table\"].ww.schema is not None\n\n    assert es[\"table\"].ww.index == \"id\"\n    assert es[\"table\"].ww.time_index is None\n\n    assert isinstance(es[\"table\"].ww.logical_types[\"id\"], NaturalLanguage)\n    assert isinstance(es[\"table\"].ww.logical_types[\"category\"], NaturalLanguage)\n\n    assert es[\"table\"].ww.semantic_tags[\"id\"] == {\"index\"}\n    assert es[\"table\"].ww.semantic_tags[\"category\"] == {\"new_tag\"}\n\n\ndef test_init_es_with_multiple_dataframes(df):\n    second_df = pd.DataFrame({\"id\": [0, 1, 2, 3], \"first_table_id\": [1, 2, 2, 1]})\n\n    df.ww.init(name=\"first_table\", index=\"id\")\n\n    es = EntitySet(\n        \"es\",\n        dataframes={\n            \"first_table\": (df,),\n            \"second_table\": (\n                second_df,\n                \"id\",\n                None,\n                None,\n                {\"first_table_id\": \"foreign_key\"},\n            ),\n        },\n    )\n\n    assert len(es.dataframe_dict) == 2\n    assert es[\"first_table\"].ww.schema is not None\n    assert es[\"second_table\"].ww.schema is not None\n\n\ndef test_add_dataframe_to_es(df):\n    es1 = EntitySet(\"es\")\n    assert es1.dataframe_dict == {}\n    es1.add_dataframe(\n        df,\n        dataframe_name=\"table\",\n        index=\"id\",\n        semantic_tags={\"category\": \"new_tag\"},\n    )\n    assert len(es1.dataframe_dict) == 1\n\n    copy_df = df.ww.copy()\n\n    es2 = EntitySet(\"es\")\n    assert es2.dataframe_dict == {}\n    es2.add_dataframe(copy_df)\n    assert len(es2.dataframe_dict) == 1\n\n    assert es1[\"table\"].ww == es2[\"table\"].ww\n\n\ndef test_change_es_dataframe_schema(df):\n    df.ww.init(index=\"id\", name=\"table\")\n    es = EntitySet(\"es\", dataframes={\"table\": (df,)})\n\n    assert es[\"table\"].ww.index == \"id\"\n\n    es[\"table\"].ww.set_index(\"category\")\n    assert es[\"table\"].ww.index == \"category\"\n\n\ndef test_init_es_with_relationships(df):\n    second_df = pd.DataFrame({\"id\": [0, 1, 2, 3], \"first_table_id\": [1, 2, 2, 1]})\n\n    df.ww.init(name=\"first_table\", index=\"id\")\n    second_df.ww.init(name=\"second_table\", index=\"id\")\n\n    es = EntitySet(\n        \"es\",\n        dataframes={\"first_table\": (df,), \"second_table\": (second_df,)},\n        relationships=[(\"first_table\", \"id\", \"second_table\", \"first_table_id\")],\n    )\n\n    assert len(es.relationships) == 1\n\n    forward_dataframes = [name for name, _ in es.get_forward_dataframes(\"second_table\")]\n    assert forward_dataframes[0] == \"first_table\"\n\n    relationship = es.relationships[0]\n    assert \"foreign_key\" in relationship.child_column.ww.semantic_tags\n    assert \"index\" in relationship.parent_column.ww.semantic_tags\n\n\n@pytest.fixture\ndef dates_df():\n    return pd.DataFrame(\n        {\n            \"backwards_order\": [8, 7, 6, 5, 4, 3, 2, 1, 0],\n            \"dates_backwards\": [\n                \"2020-09-09\",\n                \"2020-09-08\",\n                \"2020-09-07\",\n                \"2020-09-06\",\n                \"2020-09-05\",\n                \"2020-09-04\",\n                \"2020-09-03\",\n                \"2020-09-02\",\n                \"2020-09-01\",\n            ],\n            \"random_order\": [7, 6, 8, 0, 2, 4, 3, 1, 5],\n            \"repeating_dates\": [\n                \"2020-08-01\",\n                \"2019-08-01\",\n                \"2020-08-01\",\n                \"2012-08-01\",\n                \"2019-08-01\",\n                \"2019-08-01\",\n                \"2019-08-01\",\n                \"2013-08-01\",\n                \"2019-08-01\",\n            ],\n            \"special\": [7, 8, 0, 1, 4, 2, 6, 3, 5],\n            \"special_dates\": [\n                \"2020-08-01\",\n                \"2019-08-01\",\n                \"2020-08-01\",\n                \"2012-08-01\",\n                \"2019-08-01\",\n                \"2019-08-01\",\n                \"2019-08-01\",\n                \"2013-08-01\",\n                \"2019-08-01\",\n            ],\n        },\n    )\n\n\ndef test_add_secondary_time_index(dates_df):\n    dates_df.ww.init(\n        name=\"dates_table\",\n        index=\"backwards_order\",\n        time_index=\"dates_backwards\",\n    )\n    es = EntitySet(\"es\")\n    es.add_dataframe(\n        dates_df,\n        secondary_time_index={\"repeating_dates\": [\"random_order\", \"special\"]},\n    )\n\n    assert dates_df.ww.metadata[\"secondary_time_index\"] == {\n        \"repeating_dates\": [\"random_order\", \"special\", \"repeating_dates\"],\n    }\n\n\ndef test_time_type_check_order(dates_df):\n    dates_df.ww.init(\n        name=\"dates_table\",\n        index=\"backwards_order\",\n        time_index=\"random_order\",\n    )\n    es = EntitySet(\"es\")\n\n    error = \"dates_table time index is Datetime type which differs from other entityset time indexes\"\n    with pytest.raises(TypeError, match=error):\n        es.add_dataframe(\n            dates_df,\n            secondary_time_index={\"repeating_dates\": [\"random_order\", \"special\"]},\n        )\n\n    assert \"secondary_time_index\" not in dates_df.ww.metadata\n\n\ndef test_add_time_index_through_woodwork_different_type(dates_df):\n    dates_df.ww.init(\n        name=\"dates_table\",\n        index=\"backwards_order\",\n        time_index=\"dates_backwards\",\n    )\n    es = EntitySet(\"es\")\n\n    es.add_dataframe(\n        dates_df,\n        secondary_time_index={\"repeating_dates\": [\"random_order\", \"special\"]},\n    )\n\n    assert dates_df.ww.metadata[\"secondary_time_index\"] == {\n        \"repeating_dates\": [\"random_order\", \"special\", \"repeating_dates\"],\n    }\n    assert es.time_type == Datetime\n\n    assert es._check_uniform_time_index(es[\"dates_table\"]) is None\n\n    dates_df.ww.set_time_index(\"random_order\")\n    assert dates_df.ww.time_index == \"random_order\"\n\n    error = \"dates_table time index is numeric type which differs from other entityset time indexes\"\n    with pytest.raises(TypeError, match=error):\n        es._check_uniform_time_index(es[\"dates_table\"])\n\n\ndef test_init_with_mismatched_time_types(dates_df):\n    dates_df.ww.init(\n        name=\"dates_table\",\n        index=\"backwards_order\",\n        time_index=\"repeating_dates\",\n    )\n    es = EntitySet(\"es\")\n    es.add_dataframe(dates_df, secondary_time_index={\"special_dates\": [\"special\"]})\n    assert es.time_type == Datetime\n\n    nums_df = pd.DataFrame({\"id\": [1, 2, 3], \"times\": [9, 8, 7]})\n    nums_df.ww.init(name=\"numerics_table\", index=\"id\", time_index=\"times\")\n\n    error = \"numerics_table time index is numeric type which differs from other entityset time indexes\"\n    with pytest.raises(TypeError, match=error):\n        es.add_dataframe(nums_df)\n\n\ndef test_int_double_time_type(dates_df):\n    dates_df.ww.init(\n        name=\"dates_table\",\n        index=\"backwards_order\",\n        time_index=\"random_order\",\n        logical_types={\"random_order\": \"Integer\", \"special\": \"Double\"},\n    )\n    es = EntitySet(\"es\")\n\n    # Both random_order and special are numeric, but they are different logical types\n    es.add_dataframe(dates_df, secondary_time_index={\"special\": [\"dates_backwards\"]})\n\n    assert isinstance(es[\"dates_table\"].ww.logical_types[\"random_order\"], Integer)\n    assert isinstance(es[\"dates_table\"].ww.logical_types[\"special\"], Double)\n\n    assert es[\"dates_table\"].ww.time_index == \"random_order\"\n    assert \"special\" in es[\"dates_table\"].ww.metadata[\"secondary_time_index\"]\n\n\ndef test_normalize_dataframe():\n    df = pd.DataFrame(\n        {\n            \"id\": range(4),\n            \"full_name\": [\n                \"Mr. John Doe\",\n                \"Doe, Mrs. Jane\",\n                \"James Brown\",\n                \"Ms. Paige Turner\",\n            ],\n            \"email\": [\n                \"john.smith@example.com\",\n                np.nan,\n                \"team@featuretools.com\",\n                \"junk@example.com\",\n            ],\n            \"phone_number\": [\n                \"5555555555\",\n                \"555-555-5555\",\n                \"1-(555)-555-5555\",\n                \"555-555-5555\",\n            ],\n            \"age\": pd.Series([33, None, 33, 57], dtype=\"Int64\"),\n            \"signup_date\": [pd.to_datetime(\"2020-09-01\")] * 4,\n            \"is_registered\": pd.Series([True, False, True, None], dtype=\"boolean\"),\n        },\n    )\n\n    df.ww.init(name=\"first_table\", index=\"id\", time_index=\"signup_date\")\n    es = EntitySet(\"es\")\n    es.add_dataframe(df)\n    es.normalize_dataframe(\n        \"first_table\",\n        \"second_table\",\n        \"age\",\n        additional_columns=[\"phone_number\", \"full_name\"],\n        make_time_index=True,\n    )\n    assert len(es.dataframe_dict) == 2\n    assert \"foreign_key\" in es[\"first_table\"].ww.semantic_tags[\"age\"]\n\n\ndef test_replace_dataframe():\n    df = pd.DataFrame(\n        {\n            \"id\": range(4),\n            \"full_name\": [\n                \"Mr. John Doe\",\n                \"Doe, Mrs. Jane\",\n                \"James Brown\",\n                \"Ms. Paige Turner\",\n            ],\n            \"email\": [\n                \"john.smith@example.com\",\n                np.nan,\n                \"team@featuretools.com\",\n                \"junk@example.com\",\n            ],\n            \"phone_number\": [\n                \"5555555555\",\n                \"555-555-5555\",\n                \"1-(555)-555-5555\",\n                \"555-555-5555\",\n            ],\n            \"age\": pd.Series([33, None, 33, 57], dtype=\"Int64\"),\n            \"signup_date\": [pd.to_datetime(\"2020-09-01\")] * 4,\n            \"is_registered\": pd.Series([True, False, True, None], dtype=\"boolean\"),\n        },\n    )\n\n    df.ww.init(name=\"table\", index=\"id\")\n    es = EntitySet(\"es\")\n    es.add_dataframe(df)\n    original_schema = es[\"table\"].ww.schema\n\n    new_df = df.iloc[2:]\n    es.replace_dataframe(\"table\", new_df)\n\n    assert len(es[\"table\"]) == 2\n    assert es[\"table\"].ww.schema == original_schema\n\n\ndef test_add_last_time_index(es):\n    es.add_last_time_indexes([\"products\"])\n\n    assert \"last_time_index\" in es[\"products\"].ww.metadata\n\n    assert es[\"products\"].ww.metadata[\"last_time_index\"] == LTI_COLUMN_NAME\n    assert LTI_COLUMN_NAME in es[\"products\"]\n    assert \"last_time_index\" in es[\"products\"].ww.semantic_tags[LTI_COLUMN_NAME]\n    assert isinstance(es[\"products\"].ww.logical_types[LTI_COLUMN_NAME], Datetime)\n\n\ndef test_lti_already_has_last_time_column_name(es):\n    col = es[\"customers\"].ww.pop(\"loves_ice_cream\")\n    col.name = LTI_COLUMN_NAME\n\n    es[\"customers\"].ww[LTI_COLUMN_NAME] = col\n\n    assert LTI_COLUMN_NAME in es[\"customers\"].columns\n    assert isinstance(es[\"customers\"].ww.logical_types[LTI_COLUMN_NAME], Boolean)\n\n    error = (\n        \"Cannot add a last time index on DataFrame with an existing \"\n        f\"'{LTI_COLUMN_NAME}' column. Please rename '{LTI_COLUMN_NAME}'.\"\n    )\n    with pytest.raises(ValueError, match=error):\n        es.add_last_time_indexes([\"customers\"])\n\n\ndef test_numeric_es_last_time_index_logical_type(int_es):\n    assert int_es.time_type == \"numeric\"\n\n    int_es.add_last_time_indexes()\n\n    for df in int_es.dataframes:\n        assert isinstance(df.ww.logical_types[LTI_COLUMN_NAME], Double)\n        int_es._check_uniform_time_index(df, LTI_COLUMN_NAME)\n\n\ndef test_datetime_es_last_time_index_logical_type(es):\n    assert es.time_type == Datetime\n\n    es.add_last_time_indexes()\n\n    for df in es.dataframes:\n        assert isinstance(df.ww.logical_types[LTI_COLUMN_NAME], Datetime)\n        es._check_uniform_time_index(df, LTI_COLUMN_NAME)\n\n\ndef test_dataframe_without_name(es):\n    new_es = EntitySet()\n\n    new_df = es[\"sessions\"].copy()\n\n    assert new_df.ww.schema is None\n\n    error = \"Cannot add dataframe to EntitySet without a name. Please provide a value for the dataframe_name parameter.\"\n    with pytest.raises(ValueError, match=error):\n        new_es.add_dataframe(new_df)\n\n\ndef test_dataframe_with_name_parameter(es):\n    new_es = EntitySet()\n\n    new_df = es[\"sessions\"][[\"id\"]]\n\n    assert new_df.ww.schema is None\n\n    new_es.add_dataframe(\n        new_df,\n        dataframe_name=\"df_name\",\n        index=\"id\",\n        logical_types={\"id\": \"Integer\"},\n    )\n    assert new_es[\"df_name\"].ww.name == \"df_name\"\n\n\ndef test_woodwork_dataframe_without_name_errors(es):\n    new_es = EntitySet()\n\n    new_df = es[\"sessions\"].ww.copy()\n    new_df.ww._schema.name = None\n\n    assert new_df.ww.name is None\n\n    error = \"Cannot add a Woodwork DataFrame to EntitySet without a name\"\n    with pytest.raises(ValueError, match=error):\n        new_es.add_dataframe(new_df)\n\n\ndef test_woodwork_dataframe_with_name(es):\n    new_es = EntitySet()\n\n    new_df = es[\"sessions\"].ww.copy()\n    new_df.ww._schema.name = \"df_name\"\n\n    assert new_df.ww.name == \"df_name\"\n\n    new_es.add_dataframe(new_df)\n\n    assert new_es[\"df_name\"].ww.name == \"df_name\"\n\n\ndef test_woodwork_dataframe_ignore_conflicting_name_parameter_warning(es):\n    new_es = EntitySet()\n\n    new_df = es[\"sessions\"].ww.copy()\n    new_df.ww._schema.name = \"df_name\"\n\n    assert new_df.ww.name == \"df_name\"\n\n    warning = \"A Woodwork-initialized DataFrame was provided, so the following parameters were ignored: dataframe_name\"\n    with pytest.warns(UserWarning, match=warning):\n        new_es.add_dataframe(new_df, dataframe_name=\"conflicting_name\")\n\n    assert new_es[\"df_name\"].ww.name == \"df_name\"\n\n\ndef test_woodwork_dataframe_same_name_parameter(es):\n    new_es = EntitySet()\n\n    new_df = es[\"sessions\"].ww.copy()\n    new_df.ww._schema.name = \"df_name\"\n\n    assert new_df.ww.name == \"df_name\"\n\n    new_es.add_dataframe(new_df, dataframe_name=\"df_name\")\n\n    assert new_es[\"df_name\"].ww.name == \"df_name\"\n\n\ndef test_extra_woodwork_params(es):\n    new_es = EntitySet()\n\n    sessions_df = es[\"sessions\"].ww.copy()\n\n    assert sessions_df.ww.index == \"id\"\n    assert sessions_df.ww.time_index is None\n    assert isinstance(sessions_df.ww.logical_types[\"id\"], Integer)\n\n    warning_msg = (\n        \"A Woodwork-initialized DataFrame was provided, so the following parameters were ignored: \"\n        \"index, time_index, logical_types, make_index, semantic_tags, already_sorted\"\n    )\n    with pytest.warns(UserWarning, match=warning_msg):\n        new_es.add_dataframe(\n            dataframe_name=\"sessions\",\n            dataframe=sessions_df,\n            index=\"filepath\",\n            time_index=\"customer_id\",\n            logical_types={\"id\": Categorical},\n            make_index=True,\n            already_sorted=True,\n            semantic_tags={\"id\": \"new_tag\"},\n        )\n    assert sessions_df.ww.index == \"id\"\n    assert sessions_df.ww.time_index is None\n    assert isinstance(sessions_df.ww.logical_types[\"id\"], Integer)\n    assert \"new_tag\" not in sessions_df.ww.semantic_tags\n\n\ndef test_replace_dataframe_errors(es):\n    df = es[\"customers\"].copy()\n    df[\"new\"] = pd.Series([1, 2, 3])\n\n    error_text = \"New dataframe is missing new cohort column\"\n    with pytest.raises(ValueError, match=error_text):\n        es.replace_dataframe(dataframe_name=\"customers\", df=df.drop(columns=[\"cohort\"]))\n\n    error_text = \"New dataframe contains 16 columns, expecting 15\"\n    with pytest.raises(ValueError, match=error_text):\n        es.replace_dataframe(dataframe_name=\"customers\", df=df)\n\n\ndef test_replace_dataframe_already_sorted(es):\n    # test already_sorted on dataframe without time index\n    df = es[\"sessions\"].copy()\n    updated_id = df[\"id\"]\n    updated_id.iloc[1] = 2\n    updated_id.iloc[2] = 1\n\n    df = df.set_index(\"id\", drop=False)\n    df.index.name = None\n    es.replace_dataframe(dataframe_name=\"sessions\", df=df.copy(), already_sorted=False)\n    sessions_df = es[\"sessions\"]\n    assert sessions_df[\"id\"].iloc[1] == 2  # no sorting since time index not defined\n    es.replace_dataframe(dataframe_name=\"sessions\", df=df.copy(), already_sorted=True)\n    sessions_df = es[\"sessions\"]\n    assert sessions_df[\"id\"].iloc[1] == 2\n\n    # test already_sorted on dataframe with time index\n    df = es[\"customers\"].copy()\n    updated_signup = df[\"signup_date\"]\n    updated_signup.iloc[0] = datetime(2011, 4, 11)\n\n    assert es[\"customers\"].ww.time_index == \"signup_date\"\n\n    df[\"signup_date\"] = updated_signup\n\n    es.replace_dataframe(dataframe_name=\"customers\", df=df.copy(), already_sorted=True)\n    customers_df = es[\"customers\"]\n    assert customers_df[\"id\"].iloc[0] == 2\n\n    es.replace_dataframe(dataframe_name=\"customers\", df=df.copy(), already_sorted=False)\n    updated_customers = es[\"customers\"]\n    assert updated_customers[\"id\"].iloc[0] == 0\n\n\ndef test_replace_dataframe_invalid_schema(es):\n    df = es[\"customers\"].copy()\n    df[\"id\"] = pd.Series([1, 1, 1])\n\n    error_text = \"Index column must be unique\"\n    with pytest.raises(IndexError, match=error_text):\n        es.replace_dataframe(dataframe_name=\"customers\", df=df)\n\n\ndef test_replace_dataframe_mismatched_index(es):\n    df = es[\"customers\"].copy()\n    df[\"id\"] = pd.Series([99, 88, 77])\n\n    es.replace_dataframe(dataframe_name=\"customers\", df=df)\n\n    assert all([77, 99, 88] == es[\"customers\"][\"id\"])\n    assert all([77, 99, 88] == (es[\"customers\"][\"id\"]).index)\n\n\ndef test_replace_dataframe_different_dtypes(es):\n    float_dtype_df = es[\"customers\"].copy()\n    float_dtype_df = float_dtype_df.astype({\"age\": \"float64\"})\n\n    es.replace_dataframe(dataframe_name=\"customers\", df=float_dtype_df)\n\n    assert es[\"customers\"][\"age\"].dtype == \"int64\"\n    assert isinstance(es[\"customers\"].ww.logical_types[\"age\"], Integer)\n\n    incompatible_dtype_df = es[\"customers\"].copy()\n    incompatible_list = [\"hi\", \"bye\", \"bye\"]\n    incompatible_dtype_df[\"age\"] = pd.Series(incompatible_list)\n\n    error_msg = \"Error converting datatype for age from type object to type int64. Please confirm the underlying data is consistent with logical type Integer.\"\n    with pytest.raises(TypeConversionError, match=error_msg):\n        es.replace_dataframe(dataframe_name=\"customers\", df=incompatible_dtype_df)\n\n\n@pytest.fixture()\ndef latlong_df():\n    latlong_df = pd.DataFrame(\n        {\n            \"tuples\": pd.Series([(1, 2), (3, 4)]),\n            \"string_tuple\": pd.Series([\"(1, 2)\", \"(3, 4)\"]),\n            \"bracketless_string_tuple\": pd.Series([\"1, 2\", \"3, 4\"]),\n            \"list_strings\": pd.Series([[\"1\", \"2\"], [\"3\", \"4\"]]),\n            \"combo_tuple_types\": pd.Series([\"[1, 2]\", \"(3, 4)\"]),\n        },\n    )\n    latlong_df.set_index(\"string_tuple\", drop=False, inplace=True)\n    latlong_df.index.name = None\n    return latlong_df\n\n\ndef test_replace_dataframe_data_transformation(latlong_df):\n    initial_df = latlong_df.copy()\n    initial_df.ww.init(\n        name=\"latlongs\",\n        index=\"string_tuple\",\n        logical_types={col_name: \"LatLong\" for col_name in initial_df.columns},\n    )\n    es = EntitySet()\n    es.add_dataframe(dataframe=initial_df)\n\n    df = es[\"latlongs\"]\n    expected_val = (1, 2)\n    for col in latlong_df.columns:\n        series = df[col]\n        assert series.iloc[0] == expected_val\n\n    es.replace_dataframe(\"latlongs\", latlong_df)\n    df = es[\"latlongs\"]\n    expected_val = (3, 4)\n    for col in latlong_df.columns:\n        series = df[col]\n        assert series.iloc[-1] == expected_val\n\n\ndef test_replace_dataframe_column_order(es):\n    original_column_order = es[\"customers\"].columns.copy()\n\n    df = es[\"customers\"].copy()\n    col = df.pop(\"cohort\")\n    df[col.name] = col\n\n    assert not df.columns.equals(original_column_order)\n    assert set(df.columns) == set(original_column_order)\n\n    es.replace_dataframe(dataframe_name=\"customers\", df=df)\n\n    assert es[\"customers\"].columns.equals(original_column_order)\n\n\ndef test_replace_dataframe_different_woodwork_initialized(es):\n    df = es[\"customers\"].copy()\n    df[\"age\"] = pd.Series([1, 2, 3])\n\n    # Initialize Woodwork on the new DataFrame and change the schema so it won't match the original DataFrame's schema\n    df.ww.init(schema=es[\"customers\"].ww.schema)\n    df.ww.set_types(\n        logical_types={\"id\": \"NaturalLanguage\", \"cancel_date\": \"NaturalLanguage\"},\n    )\n    assert df[\"id\"].dtype == \"string\"\n    assert df[\"cancel_date\"].dtype == \"string\"\n\n    assert es[\"customers\"][\"id\"].dtype == \"int64\"\n    assert es[\"customers\"][\"cancel_date\"].dtype == \"datetime64[ns]\"\n\n    original_schema = es[\"customers\"].ww.schema\n\n    warning = \"Woodwork typing information on new dataframe will be replaced with existing typing information from customers\"\n    with pytest.warns(UserWarning, match=warning):\n        es.replace_dataframe(\"customers\", df, already_sorted=True)\n\n    actual = es[\"customers\"][\"age\"].sort_values()\n    assert all(actual == [1, 2, 3])\n\n    assert es[\"customers\"].ww._schema == original_schema\n    assert es[\"customers\"][\"id\"].dtype == \"int64\"\n    assert es[\"customers\"][\"cancel_date\"].dtype == \"datetime64[ns]\"\n\n\ndef test_replace_dataframe_and_min_last_time_index(es):\n    es.add_last_time_indexes([\"products\"])\n\n    original_time_index = es[\"log\"][\"datetime\"].copy()\n    original_last_time_index = es[\"products\"][LTI_COLUMN_NAME].copy()\n\n    new_time_index = original_time_index + pd.Timedelta(days=1)\n    expected_last_time_index = original_last_time_index + pd.Timedelta(days=1)\n\n    new_dataframe = es[\"log\"].copy()\n    new_dataframe[\"datetime\"] = new_time_index\n    new_dataframe.pop(LTI_COLUMN_NAME)\n\n    es.replace_dataframe(\"log\", new_dataframe, recalculate_last_time_indexes=True)\n\n    pd.testing.assert_series_equal(\n        es[\"products\"][LTI_COLUMN_NAME].sort_index(),\n        expected_last_time_index.sort_index(),\n    )\n    pd.testing.assert_series_equal(\n        es[\"log\"][LTI_COLUMN_NAME].sort_index(),\n        new_time_index.sort_index(),\n        check_names=False,\n    )\n\n\ndef test_replace_dataframe_dont_recalculate_last_time_index_present(es):\n    es.add_last_time_indexes()\n\n    original_time_index = es[\"customers\"][\"signup_date\"].copy()\n    original_last_time_index = es[\"customers\"][LTI_COLUMN_NAME].copy()\n\n    new_time_index = original_time_index + pd.Timedelta(days=10)\n\n    new_dataframe = es[\"customers\"].copy()\n    new_dataframe[\"signup_date\"] = new_time_index\n\n    es.replace_dataframe(\n        \"customers\",\n        new_dataframe,\n        recalculate_last_time_indexes=False,\n    )\n    pd.testing.assert_series_equal(\n        es[\"customers\"][LTI_COLUMN_NAME],\n        original_last_time_index,\n    )\n\n\ndef test_replace_dataframe_dont_recalculate_last_time_index_not_present(es):\n    es.add_last_time_indexes()\n    original_lti_name = es[\"customers\"].ww.metadata.get(\"last_time_index\")\n    assert original_lti_name is not None\n\n    original_time_index = es[\"customers\"][\"signup_date\"].copy()\n\n    new_time_index = original_time_index + pd.Timedelta(days=10)\n\n    new_dataframe = es[\"customers\"].copy()\n    new_dataframe[\"signup_date\"] = new_time_index\n    new_dataframe.pop(LTI_COLUMN_NAME)\n\n    es.replace_dataframe(\n        \"customers\",\n        new_dataframe,\n        recalculate_last_time_indexes=False,\n    )\n    assert \"last_time_index\" not in es[\"customers\"].ww.metadata\n    assert original_lti_name not in es[\"customers\"].columns\n\n\ndef test_replace_dataframe_recalculate_last_time_index_not_present(es):\n    es.add_last_time_indexes()\n\n    original_time_index = es[\"log\"][\"datetime\"].copy()\n\n    new_time_index = original_time_index + pd.Timedelta(days=10)\n\n    new_dataframe = es[\"log\"].copy()\n    new_dataframe[\"datetime\"] = new_time_index\n    new_dataframe.pop(LTI_COLUMN_NAME)\n\n    es.replace_dataframe(\"log\", new_dataframe, recalculate_last_time_indexes=True)\n    pd.testing.assert_series_equal(\n        es[\"log\"][\"datetime\"].sort_index(),\n        new_time_index.sort_index(),\n        check_names=False,\n    )\n    pd.testing.assert_series_equal(\n        es[\"log\"][LTI_COLUMN_NAME].sort_index(),\n        new_time_index.sort_index(),\n        check_names=False,\n    )\n\n\ndef test_replace_dataframe_recalculate_last_time_index_present(es):\n    es.add_last_time_indexes()\n\n    original_time_index = es[\"log\"][\"datetime\"].copy()\n\n    new_time_index = original_time_index + pd.Timedelta(days=10)\n\n    new_dataframe = es[\"log\"].copy()\n    new_dataframe[\"datetime\"] = new_time_index\n    assert LTI_COLUMN_NAME in new_dataframe.columns\n\n    es.replace_dataframe(\"log\", new_dataframe, recalculate_last_time_indexes=True)\n    pd.testing.assert_series_equal(\n        es[\"log\"][\"datetime\"].sort_index(),\n        new_time_index.sort_index(),\n        check_names=False,\n    )\n    pd.testing.assert_series_equal(\n        es[\"log\"][LTI_COLUMN_NAME].sort_index(),\n        new_time_index.sort_index(),\n        check_names=False,\n    )\n\n\ndef test_normalize_dataframe_loses_column_metadata(es):\n    es[\"log\"].ww.columns[\"value\"].metadata[\"interesting_values\"] = [0.0, 1.0]\n    es[\"log\"].ww.columns[\"priority_level\"].metadata[\"interesting_values\"] = [1]\n\n    es[\"log\"].ww.columns[\"value\"].description = \"a value column\"\n    es[\"log\"].ww.columns[\"priority_level\"].description = \"a priority level column\"\n\n    assert \"interesting_values\" in es[\"log\"].ww.columns[\"priority_level\"].metadata\n    assert \"interesting_values\" in es[\"log\"].ww.columns[\"value\"].metadata\n    assert es[\"log\"].ww.columns[\"value\"].description == \"a value column\"\n    assert (\n        es[\"log\"].ww.columns[\"priority_level\"].description == \"a priority level column\"\n    )\n\n    es.normalize_dataframe(\n        \"log\",\n        \"values_2\",\n        \"value_2\",\n        additional_columns=[\"priority_level\"],\n        copy_columns=[\"value\"],\n        make_time_index=False,\n    )\n\n    # Metadata in the original dataframe and the new dataframe are maintained\n    assert \"interesting_values\" in es[\"log\"].ww.columns[\"value\"].metadata\n    assert \"interesting_values\" in es[\"values_2\"].ww.columns[\"value\"].metadata\n    assert \"interesting_values\" in es[\"values_2\"].ww.columns[\"priority_level\"].metadata\n    assert es[\"log\"].ww.columns[\"value\"].description == \"a value column\"\n    assert es[\"values_2\"].ww.columns[\"value\"].description == \"a value column\"\n    assert (\n        es[\"values_2\"].ww.columns[\"priority_level\"].description\n        == \"a priority level column\"\n    )\n\n\ndef test_normalize_ww_init():\n    es = EntitySet()\n    df = pd.DataFrame(\n        {\n            \"id\": [1, 2, 3, 4],\n            \"col\": [\"a\", \"b\", \"c\", \"d\"],\n            \"df2_id\": [1, 1, 2, 2],\n            \"df2_col\": [True, False, True, True],\n        },\n    )\n\n    df.ww.init(index=\"id\", name=\"test_name\")\n    es.add_dataframe(dataframe=df)\n\n    assert es[\"test_name\"].ww.name == \"test_name\"\n    assert es[\"test_name\"].ww.schema.name == \"test_name\"\n\n    es.normalize_dataframe(\n        \"test_name\",\n        \"new_df\",\n        \"df2_id\",\n        additional_columns=[\"df2_col\"],\n    )\n\n    assert es[\"test_name\"].ww.name == \"test_name\"\n    assert es[\"test_name\"].ww.schema.name == \"test_name\"\n\n    assert es[\"new_df\"].ww.name == \"new_df\"\n    assert es[\"new_df\"].ww.schema.name == \"new_df\"\n"
  },
  {
    "path": "featuretools/tests/entry_point_tests/__init__.py",
    "content": ""
  },
  {
    "path": "featuretools/tests/entry_point_tests/add-ons/__init__.py",
    "content": ""
  },
  {
    "path": "featuretools/tests/entry_point_tests/add-ons/featuretools_plugin/__init__.py",
    "content": ""
  },
  {
    "path": "featuretools/tests/entry_point_tests/add-ons/featuretools_plugin/featuretools_plugin/__init__.py",
    "content": "raise NotImplementedError(\"plugin not implemented\")\n"
  },
  {
    "path": "featuretools/tests/entry_point_tests/add-ons/featuretools_plugin/setup.py",
    "content": "from setuptools import setup\n\nsetup(\n    name=\"featuretools_plugin\",\n    packages=[\"featuretools_plugin\"],\n    entry_points={\n        \"featuretools_plugin\": [\n            \"module = featuretools_plugin\",\n        ],\n    },\n)\n"
  },
  {
    "path": "featuretools/tests/entry_point_tests/add-ons/featuretools_primitives/__init__.py",
    "content": ""
  },
  {
    "path": "featuretools/tests/entry_point_tests/add-ons/featuretools_primitives/featuretools_primitives/__init__.py",
    "content": ""
  },
  {
    "path": "featuretools/tests/entry_point_tests/add-ons/featuretools_primitives/featuretools_primitives/existing_primitive.py",
    "content": "from featuretools.primitives.base import AggregationPrimitive\n\n\nclass Sum(AggregationPrimitive):\n    \"\"\"A primitive that should currently exist for testing.\"\"\"\n\n    pass\n"
  },
  {
    "path": "featuretools/tests/entry_point_tests/add-ons/featuretools_primitives/featuretools_primitives/invalid_primitive.py",
    "content": "raise NotImplementedError(\"invalid primitive\")\n"
  },
  {
    "path": "featuretools/tests/entry_point_tests/add-ons/featuretools_primitives/featuretools_primitives/new_primitive.py",
    "content": "from featuretools.primitives.base import TransformPrimitive\n\n\nclass NewPrimitive(TransformPrimitive):\n    \"\"\"A primitive that should not currently exist for testing.\"\"\"\n\n    pass\n"
  },
  {
    "path": "featuretools/tests/entry_point_tests/add-ons/featuretools_primitives/setup.py",
    "content": "from setuptools import find_packages, setup\n\nsetup(\n    name=\"featuretools_primitives\",\n    packages=find_packages(),\n    entry_points={\n        \"featuretools_primitives\": [\n            \"new = featuretools_primitives.new_primitive\",\n            \"invalid = featuretools_primitives.invalid_primitive\",\n            \"existing = featuretools_primitives.existing_primitive\",\n        ],\n    },\n)\n"
  },
  {
    "path": "featuretools/tests/entry_point_tests/test_plugin.py",
    "content": "from featuretools.tests.entry_point_tests.utils import (\n    _import_featuretools,\n    _install_featuretools_plugin,\n    _uninstall_featuretools_plugin,\n)\n\n\ndef test_plugin_warning():\n    _install_featuretools_plugin()\n    warning = _import_featuretools(\"warning\").stdout.decode()\n    debug = _import_featuretools(\"debug\").stdout.decode()\n    _uninstall_featuretools_plugin()\n\n    message = (\n        \"Featuretools failed to load plugin module from library featuretools_plugin\"\n    )\n    traceback = \"NotImplementedError: plugin not implemented\"\n\n    assert message in warning\n    assert traceback not in warning\n    assert message in debug\n    assert traceback in debug\n"
  },
  {
    "path": "featuretools/tests/entry_point_tests/test_primitives.py",
    "content": "from featuretools.tests.entry_point_tests.utils import (\n    _import_featuretools,\n    _install_featuretools_primitives,\n    _python,\n    _uninstall_featuretools_primitives,\n)\n\n\ndef test_entry_point():\n    _install_featuretools_primitives()\n    featuretools_log = _import_featuretools(\"debug\").stdout.decode()\n    new_primitive = _python(\"-c\", \"from featuretools.primitives import NewPrimitive\")\n    _uninstall_featuretools_primitives()\n    assert new_primitive.returncode == 0\n\n    invalid_primitive = 'Featuretools failed to load \"invalid\" primitives from \"featuretools_primitives.invalid_primitive\". '\n    invalid_primitive += \"For a full stack trace, set logging to debug.\"\n    assert invalid_primitive in featuretools_log\n\n    existing_primitive = 'While loading primitives via \"existing\" entry point, '\n    existing_primitive += 'ignored primitive \"Sum\" from \"featuretools_primitives.existing_primitive\" because a primitive '\n    existing_primitive += 'with that name already exists in \"featuretools.primitives.standard.aggregation.sum_primitive\"'\n    assert existing_primitive in featuretools_log\n"
  },
  {
    "path": "featuretools/tests/entry_point_tests/utils.py",
    "content": "import os\nimport subprocess\nimport sys\n\n\ndef _get_path_to_add_ons(*args):\n    pwd = os.path.dirname(__file__)\n    return os.path.join(pwd, \"add-ons\", *args)\n\n\ndef _python(*args):\n    command = [sys.executable, *args]\n    return subprocess.run(command, stdout=subprocess.PIPE)\n\n\ndef _install_featuretools_plugin():\n    os.chdir(_get_path_to_add_ons(\"featuretools_plugin\"))\n    return _python(\"-m\", \"pip\", \"install\", \"-e\", \".\")\n\n\ndef _uninstall_featuretools_plugin():\n    return _python(\"-m\", \"pip\", \"uninstall\", \"featuretools_plugin\", \"-y\")\n\n\ndef _install_featuretools_primitives():\n    os.chdir(_get_path_to_add_ons(\"featuretools_primitives\"))\n    return _python(\"-m\", \"pip\", \"install\", \"-e\", \".\")\n\n\ndef _uninstall_featuretools_primitives():\n    return _python(\"-m\", \"pip\", \"uninstall\", \"featuretools_primitives\", \"-y\")\n\n\ndef _import_featuretools(level=None):\n    c = \"\"\n    if level:\n        c += \"import os;\"\n        c += 'os.environ[\"FEATURETOOLS_LOG_LEVEL\"] = \"%s\";' % level\n\n    c += \"import featuretools;\"\n    return _python(\"-c\", c)\n"
  },
  {
    "path": "featuretools/tests/feature_discovery/__init__.py",
    "content": ""
  },
  {
    "path": "featuretools/tests/feature_discovery/test_convertors.py",
    "content": "from woodwork.logical_types import Double, NaturalLanguage\n\nfrom featuretools.entityset.entityset import EntitySet\nfrom featuretools.feature_base.feature_base import (\n    FeatureBase,\n    IdentityFeature,\n    TransformFeature,\n)\nfrom featuretools.feature_discovery.convertors import (\n    _convert_feature_to_featurebase,\n    convert_feature_list_to_featurebase_list,\n    convert_featurebase_list_to_feature_list,\n)\nfrom featuretools.feature_discovery.feature_discovery import (\n    generate_features_from_primitives,\n    schema_to_features,\n)\nfrom featuretools.feature_discovery.LiteFeature import (\n    LiteFeature,\n)\nfrom featuretools.primitives import Absolute, AddNumeric, Lag\nfrom featuretools.synthesis import dfs\nfrom featuretools.tests.feature_discovery.test_feature_discovery import (\n    MultiOutputPrimitiveForTest,\n)\nfrom featuretools.tests.testing_utils.generate_fake_dataframe import (\n    generate_fake_dataframe,\n)\n\n\ndef test_convert_featurebase_list_to_feature_list():\n    col_defs = [\n        (\"idx\", \"Integer\", {\"index\"}),\n        (\"f_1\", \"Double\"),\n        (\"f_2\", \"Double\"),\n        (\"f_3\", \"NaturalLanguage\"),\n    ]\n\n    df = generate_fake_dataframe(\n        col_defs=col_defs,\n    )\n\n    es = EntitySet(id=\"es\")\n    es.add_dataframe(df, df.ww.name)\n\n    fdefs = dfs(\n        entityset=es,\n        target_dataframe_name=df.ww.name,\n        trans_primitives=[AddNumeric, MultiOutputPrimitiveForTest],\n        features_only=True,\n        max_depth=1,\n    )\n    assert isinstance(fdefs, list)\n    assert isinstance(fdefs[0], FeatureBase)\n\n    converted_features = set(convert_featurebase_list_to_feature_list(fdefs))\n\n    f1 = LiteFeature(\"f_1\", Double)\n    f2 = LiteFeature(\"f_2\", Double)\n    f3 = LiteFeature(\"f_3\", NaturalLanguage)\n    fadd = LiteFeature(\n        name=\"f_1 + f_2\",\n        tags={\"numeric\"},\n        primitive=AddNumeric(),\n        base_features=[f1, f2],\n    )\n    fmo0 = LiteFeature(\n        name=\"TEST_MO(f_3)[0]\",\n        tags={\"numeric\"},\n        primitive=MultiOutputPrimitiveForTest(),\n        base_features=[f3],\n        idx=0,\n    )\n    fmo1 = LiteFeature(\n        name=\"TEST_MO(f_3)[1]\",\n        tags={\"numeric\"},\n        primitive=MultiOutputPrimitiveForTest(),\n        base_features=[f3],\n        idx=1,\n    )\n    fmo0.related_features = {fmo1}\n    fmo1.related_features = {fmo0}\n\n    orig_features = set([f1, f2, fadd, fmo0, fmo1])\n\n    assert len(orig_features.symmetric_difference(converted_features)) == 0\n\n\ndef test_origin_feature_to_featurebase():\n    df = generate_fake_dataframe(\n        col_defs=[(\"idx\", \"Double\", {\"index\"}), (\"f_1\", \"Double\")],\n    )\n    es = EntitySet(id=\"test\")\n    es.add_dataframe(df, df.ww.name)\n\n    origin_features = schema_to_features(df.ww.schema)\n    f_1 = [f for f in origin_features if f.name == \"f_1\"][0]\n    fb = _convert_feature_to_featurebase(f_1, df, {})\n\n    assert isinstance(fb, IdentityFeature)\n    assert fb.get_name() == \"f_1\"\n\n    f_1.set_alias(\"new name\")\n    df.ww.rename({\"f_1\": \"new name\"}, inplace=True)\n    fb = _convert_feature_to_featurebase(f_1, df, {})\n\n    assert isinstance(fb, IdentityFeature)\n    assert fb.get_name() == \"new name\"\n\n\ndef test_stacked_feature_to_featurebase():\n    df = generate_fake_dataframe(\n        col_defs=[(\"idx\", \"Double\", {\"index\"}), (\"f_1\", \"Double\")],\n    )\n    es = EntitySet(id=\"test\")\n    es.add_dataframe(df, df.ww.name)\n\n    origin_features = schema_to_features(df.ww.schema)\n    f_1 = [f for f in origin_features if f.name == \"f_1\"][0]\n    features = generate_features_from_primitives([f_1], [Absolute()])\n\n    f_2 = [f for f in features if f.name == \"ABSOLUTE(f_1)\"][0]\n\n    fb = _convert_feature_to_featurebase(f_2, df, {})\n\n    assert isinstance(fb, TransformFeature)\n    assert fb.get_name() == \"ABSOLUTE(f_1)\"\n    assert len(fb.base_features) == 1\n    assert fb.base_features[0].get_name() == \"f_1\"\n\n    f_2.set_alias(\"f_2\")\n    fb = _convert_feature_to_featurebase(f_2, df, {})\n\n    assert isinstance(fb, TransformFeature)\n    assert fb.get_name() == \"f_2\"\n    assert len(fb.base_features) == 1\n    assert fb.base_features[0].get_name() == \"f_1\"\n\n\ndef test_multi_output_to_featurebase():\n    df = generate_fake_dataframe(\n        col_defs=[\n            (\"idx\", \"Double\", {\"index\"}),\n            (\"f_1\", \"NaturalLanguage\"),\n        ],\n    )\n    es = EntitySet(id=\"test\")\n    es.add_dataframe(df, df.ww.name)\n\n    origin_features = schema_to_features(df.ww.schema)\n    f_1 = [f for f in origin_features if f.name == \"f_1\"][0]\n    features = generate_features_from_primitives([f_1], [MultiOutputPrimitiveForTest()])\n\n    lsa_features = [f for f in features if f.get_primitive_name() == \"test_mo\"]\n    assert len(lsa_features) == 2\n\n    # Test Single LiteFeature\n    fb = _convert_feature_to_featurebase(lsa_features[0], df, {})\n    assert isinstance(fb, TransformFeature)\n    assert fb.get_name() == \"TEST_MO(f_1)\"\n    assert len(fb.base_features) == 1\n    assert set(fb.get_feature_names()) == set([\"TEST_MO(f_1)[0]\", \"TEST_MO(f_1)[1]\"])\n    assert fb.base_features[0].get_name() == \"f_1\"\n\n    # Test that feature gets consolidated\n    fb_list = convert_feature_list_to_featurebase_list(lsa_features, df)\n    assert len(fb_list) == 1\n    assert fb_list[0].get_name() == \"TEST_MO(f_1)\"\n    assert len(fb_list[0].base_features) == 1\n    assert set(fb_list[0].get_feature_names()) == set(\n        [\"TEST_MO(f_1)[0]\", \"TEST_MO(f_1)[1]\"],\n    )\n    assert fb_list[0].base_features[0].get_name() == \"f_1\"\n\n    lsa_features[0].set_alias(\"f_2\")\n    lsa_features[1].set_alias(\"f_3\")\n\n    fb = _convert_feature_to_featurebase(lsa_features[0], df, {})\n    assert isinstance(fb, TransformFeature)\n    assert len(fb.base_features) == 1\n    assert set(fb.get_feature_names()) == set([\"f_2\", \"f_3\"])\n    assert fb.base_features[0].get_name() == \"f_1\"\n\n    # Test that feature gets consolidated\n    fb_list = convert_feature_list_to_featurebase_list(lsa_features, df)\n    assert len(fb_list) == 1\n    assert len(fb_list[0].base_features) == 1\n    assert set(fb_list[0].get_feature_names()) == set([\"f_2\", \"f_3\"])\n    assert fb_list[0].base_features[0].get_name() == \"f_1\"\n\n\ndef test_stacking_on_multioutput_to_featurebase():\n    col_defs = [\n        (\"idx\", \"Double\", {\"index\"}),\n        (\"t_idx\", \"Datetime\", {\"time_index\"}),\n        (\"f_1\", \"NaturalLanguage\"),\n    ]\n    df = generate_fake_dataframe(\n        col_defs=col_defs,\n    )\n    es = EntitySet(id=\"test\")\n    es.add_dataframe(df, df.ww.name)\n\n    origin_features = schema_to_features(df.ww.schema)\n    time_index_feature = [f for f in origin_features if f.name == \"t_idx\"][0]\n    f_1 = [f for f in origin_features if f.name == \"f_1\"][0]\n\n    features = generate_features_from_primitives([f_1], [MultiOutputPrimitiveForTest()])\n    lsa_features = [f for f in features if f.get_primitive_name() == \"test_mo\"]\n    assert len(lsa_features) == 2\n\n    features = generate_features_from_primitives(\n        lsa_features + [time_index_feature],\n        [Lag(periods=2)],\n    )\n    lag_features = [f for f in features if f.get_primitive_name() == \"lag\"]\n    assert len(lag_features) == 2\n\n    fb_list = convert_feature_list_to_featurebase_list(lag_features, df)\n\n    assert len(fb_list) == 2\n    assert isinstance(fb_list[0], TransformFeature)\n    assert set([x.get_name() for x in fb_list]) == set(\n        [\n            \"LAG(TEST_MO(f_1)[0], t_idx, periods=2)\",\n            \"LAG(TEST_MO(f_1)[1], t_idx, periods=2)\",\n        ],\n    )\n\n    lsa_features[0].set_alias(\"f_2\")\n    lsa_features[1].set_alias(\"f_3\")\n    features = generate_features_from_primitives(\n        lsa_features + [time_index_feature],\n        [Lag(periods=2)],\n    )\n    lag_features = [f for f in features if f.get_primitive_name() == \"lag\"]\n    assert len(lag_features) == 2\n\n    fb_list = convert_feature_list_to_featurebase_list(lag_features, df)\n    assert len(fb_list) == 2\n    assert isinstance(fb_list[0], TransformFeature)\n    assert set([x.get_name() for x in fb_list]) == set(\n        [\"LAG(f_2, t_idx, periods=2)\", \"LAG(f_3, t_idx, periods=2)\"],\n    )\n"
  },
  {
    "path": "featuretools/tests/feature_discovery/test_feature_collection.py",
    "content": "import pytest\nfrom woodwork.logical_types import (\n    Boolean,\n    Double,\n    Ordinal,\n)\n\nfrom featuretools.feature_discovery.FeatureCollection import FeatureCollection\nfrom featuretools.feature_discovery.LiteFeature import LiteFeature\nfrom featuretools.primitives import Absolute, AddNumeric\n\n\n@pytest.mark.parametrize(\n    \"feature_args, expected\",\n    [\n        (\n            (\"idx\", Double),\n            [\"ANY\", \"Double\", \"Double,numeric\", \"numeric\"],\n        ),\n        (\n            (\"idx\", Double, {\"index\"}),\n            [\"ANY\", \"Double\", \"Double,index\", \"index\"],\n        ),\n        (\n            (\"idx\", Double, {\"other\"}),\n            [\n                \"ANY\",\n                \"Double\",\n                \"other\",\n                \"numeric\",\n                \"Double,other\",\n                \"Double,numeric\",\n                \"numeric,other\",\n                \"Double,numeric,other\",\n            ],\n        ),\n        (\n            (\"idx\", Ordinal, {\"other\"}),\n            [\n                \"ANY\",\n                \"Ordinal\",\n                \"other\",\n                \"category\",\n                \"Ordinal,other\",\n                \"Ordinal,category\",\n                \"category,other\",\n                \"Ordinal,category,other\",\n            ],\n        ),\n        (\n            (\"idx\", Double, {\"a\", \"b\", \"numeric\"}),\n            [\n                \"ANY\",\n                \"Double\",\n                \"a\",\n                \"b\",\n                \"numeric\",\n                \"Double,a\",\n                \"Double,b\",\n                \"Double,numeric\",\n                \"a,b\",\n                \"a,numeric\",\n                \"b,numeric\",\n                \"a,b,numeric\",\n                \"Double,a,b\",\n                \"Double,a,numeric\",\n                \"Double,b,numeric\",\n                \"Double,a,b,numeric\",\n            ],\n        ),\n    ],\n)\ndef test_to_keys_method(feature_args, expected):\n    feature = LiteFeature(*feature_args)\n\n    keys = FeatureCollection.feature_to_keys(feature)\n\n    assert set(keys) == set(expected)\n\n\ndef test_feature_collection_hashing():\n    f1 = LiteFeature(name=\"f1\", logical_type=Double)\n    f2 = LiteFeature(name=\"f2\", logical_type=Double, tags={\"index\"})\n    f3 = LiteFeature(name=\"f3\", logical_type=Boolean, tags={\"other\"})\n    f4 = LiteFeature(name=\"f4\", primitive=Absolute(), base_features=[f1])\n    f5 = LiteFeature(name=\"f5\", primitive=AddNumeric(), base_features=[f1, f2])\n\n    fc1 = FeatureCollection([f1, f2, f3, f4, f5])\n    fc2 = FeatureCollection([f1, f2, f3, f4, f5])\n\n    assert len(set([fc1, fc2])) == 1\n\n    fc1.reindex()\n    assert fc1.get_by_logical_type(Double) == set([f1, f2])\n\n    assert fc1.get_by_tag(\"index\") == set([f2])\n\n    assert fc1.get_by_origin_feature(f1) == set([f1, f4, f5])\n\n    assert fc1.get_dependencies_by_origin_name(\"f1\") == set([f1, f4, f5])\n\n    assert fc1.get_dependencies_by_origin_name(\"null\") == set()\n\n    assert fc1.get_by_origin_feature_name(\"f1\") == f1\n\n    assert fc1.get_by_origin_feature_name(\"null\") is None\n"
  },
  {
    "path": "featuretools/tests/feature_discovery/test_feature_discovery.py",
    "content": "from unittest.mock import patch\n\nimport pytest\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import (\n    Boolean,\n    BooleanNullable,\n    Datetime,\n    Double,\n    NaturalLanguage,\n    Ordinal,\n)\n\nfrom featuretools.entityset.entityset import EntitySet\nfrom featuretools.feature_discovery.feature_discovery import (\n    _get_features,\n    _get_matching_features,\n    _index_column_set,\n    generate_features_from_primitives,\n    schema_to_features,\n)\nfrom featuretools.feature_discovery.FeatureCollection import FeatureCollection\nfrom featuretools.feature_discovery.LiteFeature import (\n    LiteFeature,\n)\nfrom featuretools.feature_discovery.utils import column_schema_to_keys\nfrom featuretools.primitives import (\n    Absolute,\n    AddNumeric,\n    Count,\n    DateFirstEvent,\n    Equal,\n    Lag,\n    MultiplyNumericBoolean,\n    NumUnique,\n    TransformPrimitive,\n)\nfrom featuretools.primitives.utils import get_transform_primitives\nfrom featuretools.synthesis import dfs\nfrom featuretools.tests.testing_utils.generate_fake_dataframe import (\n    generate_fake_dataframe,\n)\n\nDEFAULT_LT_FOR_TAG = {\n    \"category\": Ordinal,\n    \"numeric\": Double,\n    \"time_index\": Datetime,\n}\n\n\nclass MultiOutputPrimitiveForTest(TransformPrimitive):\n    name = \"test_mo\"\n    input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n    number_output_features = 2\n\n\nclass DoublePrimitiveForTest(TransformPrimitive):\n    name = \"test_double\"\n    input_types = [ColumnSchema(logical_type=Double)]\n    return_type = ColumnSchema(logical_type=Double)\n\n\n@pytest.mark.parametrize(\n    \"column_schema, expected\",\n    [\n        (ColumnSchema(logical_type=Double), \"Double\"),\n        (ColumnSchema(semantic_tags={\"index\"}), \"index\"),\n        (\n            ColumnSchema(logical_type=Double, semantic_tags={\"index\", \"other\"}),\n            \"Double,index,other\",\n        ),\n    ],\n)\ndef test_column_schema_to_keys(column_schema, expected):\n    actual = column_schema_to_keys(column_schema)\n    assert set(actual) == set(expected)\n\n\n@pytest.mark.parametrize(\n    \"column_list, expected\",\n    [\n        ([ColumnSchema(logical_type=Boolean)], [(\"Boolean\", 1)]),\n        ([ColumnSchema()], [(\"ANY\", 1)]),\n        (\n            [\n                ColumnSchema(logical_type=Boolean),\n                ColumnSchema(logical_type=Boolean),\n            ],\n            [(\"Boolean\", 2)],\n        ),\n    ],\n)\ndef test_index_input_set(column_list, expected):\n    actual = _index_column_set(column_list)\n\n    assert actual == expected\n\n\n@pytest.mark.parametrize(\n    \"feature_args, input_set, commutative, expected\",\n    [\n        (\n            [(\"f1\", Boolean), (\"f2\", Boolean), (\"f3\", Boolean)],\n            [ColumnSchema(logical_type=Boolean)],\n            False,\n            [[\"f1\"], [\"f2\"], [\"f3\"]],\n        ),\n        (\n            [(\"f1\", Boolean), (\"f2\", Boolean)],\n            [ColumnSchema(logical_type=Boolean), ColumnSchema(logical_type=Boolean)],\n            False,\n            [[\"f1\", \"f2\"], [\"f2\", \"f1\"]],\n        ),\n        (\n            [(\"f1\", Boolean), (\"f2\", Boolean)],\n            [ColumnSchema(logical_type=Boolean), ColumnSchema(logical_type=Boolean)],\n            True,\n            [[\"f1\", \"f2\"]],\n        ),\n        (\n            [(\"f1\", Datetime, {\"time_index\"})],\n            [ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"})],\n            False,\n            [[\"f1\"]],\n        ),\n        (\n            [(\"f1\", Double, {\"other\", \"index\"})],\n            [ColumnSchema(logical_type=Double, semantic_tags={\"index\", \"other\"})],\n            False,\n            [[\"f1\"]],\n        ),\n        (\n            [\n                (\"f1\", Double),\n                (\"f2\", Boolean),\n                (\"f3\", Double),\n                (\"f4\", Boolean),\n                (\"f5\", Double),\n            ],\n            [\n                ColumnSchema(logical_type=Double),\n                ColumnSchema(logical_type=Double),\n                ColumnSchema(logical_type=Boolean),\n            ],\n            True,\n            [\n                [\"f1\", \"f3\", \"f2\"],\n                [\"f1\", \"f3\", \"f4\"],\n                [\"f1\", \"f5\", \"f2\"],\n                [\"f1\", \"f5\", \"f4\"],\n                [\"f3\", \"f5\", \"f2\"],\n                [\"f3\", \"f5\", \"f4\"],\n            ],\n        ),\n    ],\n)\n@patch.object(LiteFeature, \"_generate_hash\", lambda x: x.name)\ndef test_get_features(feature_args, input_set, commutative, expected):\n    features = [LiteFeature(*args) for args in feature_args]\n    feature_collection = FeatureCollection(features).reindex()\n\n    column_keys = _index_column_set(input_set)\n    actual = _get_features(feature_collection, tuple(column_keys), commutative)\n\n    assert set([tuple([y.id for y in x]) for x in actual]) == set(\n        [tuple(x) for x in expected],\n    )\n\n\n@pytest.mark.parametrize(\n    \"feature_args, primitive, expected\",\n    [\n        (\n            [(\"f1\", Double), (\"f2\", Double), (\"f3\", Double)],\n            AddNumeric,\n            [[\"f1\", \"f2\"], [\"f1\", \"f3\"], [\"f2\", \"f3\"]],\n        ),\n        (\n            [(\"f1\", Boolean), (\"f2\", Boolean), (\"f3\", Boolean)],\n            AddNumeric,\n            [],\n        ),\n        (\n            [(\"f7\", Double), (\"f8\", Boolean)],\n            MultiplyNumericBoolean,\n            [[\"f7\", \"f8\"]],\n        ),\n        (\n            [(\"f9\", Datetime)],\n            DateFirstEvent,\n            [],\n        ),\n        (\n            [(\"f10\", Datetime, {\"time_index\"})],\n            DateFirstEvent,\n            [[\"f10\"]],\n        ),\n        (\n            [(\"f11\", Datetime, {\"time_index\"}), (\"f12\", Double)],\n            NumUnique,\n            [],\n        ),\n        (\n            [(\"f13\", Datetime, {\"time_index\"}), (\"f14\", Double), (\"f15\", Ordinal)],\n            NumUnique,\n            [[\"f15\"]],\n        ),\n        (\n            [(\"f16\", Datetime, {\"time_index\"}), (\"f17\", Double), (\"f18\", Ordinal)],\n            Equal,\n            [[\"f16\", \"f17\"], [\"f16\", \"f18\"], [\"f17\", \"f18\"]],\n        ),\n        (\n            [\n                (\"t_idx\", Datetime, {\"time_index\"}),\n                (\"f19\", Ordinal),\n                (\"f20\", Double),\n                (\"f21\", Boolean),\n                (\"f22\", BooleanNullable),\n            ],\n            Lag,\n            [[\"f19\", \"t_idx\"], [\"f20\", \"t_idx\"], [\"f21\", \"t_idx\"], [\"f22\", \"t_idx\"]],\n        ),\n        (\n            [\n                (\"idx\", Double, {\"index\"}),\n                (\"f23\", Double),\n            ],\n            Count,\n            [[\"idx\"]],\n        ),\n        (\n            [\n                (\"idx\", Double, {\"index\"}),\n                (\"f23\", Double),\n            ],\n            AddNumeric,\n            [],\n        ),\n    ],\n)\n@patch.object(LiteFeature, \"__lt__\", lambda x, y: x.name < y.name)\ndef test_get_matching_features(feature_args, primitive, expected):\n    features = [LiteFeature(*args) for args in feature_args]\n    feature_collection = FeatureCollection(features).reindex()\n    actual = _get_matching_features(feature_collection, primitive())\n    assert [[y.name for y in x] for x in actual] == expected\n\n\n@pytest.mark.parametrize(\n    \"col_defs, primitives, expected\",\n    [\n        (\n            [\n                (\"f_1\", \"Double\"),\n                (\"f_2\", \"Double\"),\n                (\"f_3\", \"Boolean\"),\n                (\"f_4\", \"Double\"),\n            ],\n            [AddNumeric],\n            {\"f_1 + f_2\", \"f_1 + f_4\", \"f_2 + f_4\"},\n        ),\n        (\n            [\n                (\"f_1\", \"Double\"),\n                (\"f_2\", \"Double\"),\n            ],\n            [Absolute],\n            {\"ABSOLUTE(f_1)\", \"ABSOLUTE(f_2)\"},\n        ),\n    ],\n)\n@patch.object(LiteFeature, \"__lt__\", lambda x, y: x.name < y.name)\ndef test_generate_features_from_primitives(col_defs, primitives, expected):\n    input_feature_names = set([x[0] for x in col_defs])\n    df = generate_fake_dataframe(\n        col_defs=col_defs,\n    )\n\n    origin_features = schema_to_features(df.ww.schema)\n    features = generate_features_from_primitives(origin_features, primitives)\n\n    new_feature_names = set([x.name for x in features]) - input_feature_names\n    assert new_feature_names == expected\n\n\nALL_TRANSFORM_PRIMITIVES = list(get_transform_primitives().values())\n\n\n@pytest.mark.parametrize(\n    \"col_defs, primitives\",\n    [\n        (\n            [\n                (\"idx\", \"Double\", {\"index\"}),\n                (\"t_idx\", \"Datetime\", {\"time_index\"}),\n                (\"f_3\", \"Boolean\"),\n                (\"f_4\", \"Boolean\"),\n                (\"f_5\", \"BooleanNullable\"),\n                (\"f_6\", \"BooleanNullable\"),\n                (\"f_7\", \"Categorical\"),\n                (\"f_8\", \"Categorical\"),\n                (\"f_9\", \"Datetime\"),\n                (\"f_10\", \"Datetime\"),\n                (\"f_11\", \"Double\"),\n                (\"f_12\", \"Double\"),\n                (\"f_13\", \"Integer\"),\n                (\"f_14\", \"Integer\"),\n                (\"f_15\", \"IntegerNullable\"),\n                (\"f_16\", \"IntegerNullable\"),\n                (\"f_17\", \"EmailAddress\"),\n                (\"f_18\", \"EmailAddress\"),\n                (\"f_19\", \"LatLong\"),\n                (\"f_20\", \"LatLong\"),\n                (\"f_21\", \"NaturalLanguage\"),\n                (\"f_22\", \"NaturalLanguage\"),\n                (\"f_23\", \"Ordinal\"),\n                (\"f_24\", \"Ordinal\"),\n                (\"f_25\", \"URL\"),\n                (\"f_26\", \"URL\"),\n                (\"f_27\", \"PostalCode\"),\n                (\"f_28\", \"PostalCode\"),\n            ],\n            ALL_TRANSFORM_PRIMITIVES,\n        ),\n    ],\n)\n@patch.object(LiteFeature, \"_generate_hash\", lambda x: x.name)\ndef test_compare_dfs(col_defs, primitives):\n    input_feature_names = set([x[0] for x in col_defs])\n    df = generate_fake_dataframe(\n        col_defs=col_defs,\n    )\n\n    es = EntitySet(id=\"test\")\n    es.add_dataframe(df, \"df\")\n\n    features_old = dfs(\n        entityset=es,\n        target_dataframe_name=\"df\",\n        trans_primitives=primitives,\n        features_only=True,\n        return_types=\"all\",\n        max_depth=1,\n    )\n\n    origin_features = schema_to_features(df.ww.schema)\n    features = generate_features_from_primitives(origin_features, primitives)\n\n    feature_names_old = set([x.get_name() for x in features_old]) - input_feature_names  # type: ignore\n\n    feature_names_new = set([x.name for x in features]) - input_feature_names\n    assert feature_names_old == feature_names_new\n\n\ndef test_generate_features_from_primitives_inputs():\n    f1 = LiteFeature(\"f1\", Double)\n    with pytest.raises(\n        ValueError,\n        match=\"input_features must be an iterable of LiteFeature objects\",\n    ):\n        generate_features_from_primitives(f1, [Absolute])\n\n    with pytest.raises(\n        ValueError,\n        match=\"input_features must be an iterable of LiteFeature objects\",\n    ):\n        generate_features_from_primitives([f1, \"other\"], [Absolute])\n\n    with pytest.raises(\n        ValueError,\n        match=\"primitives must be a list of Primitive classes or Primitive instances\",\n    ):\n        generate_features_from_primitives([f1], [\"absolute\"])\n\n    with pytest.raises(\n        ValueError,\n        match=\"primitives must be a list of Primitive classes or Primitive instances\",\n    ):\n        generate_features_from_primitives([f1], Absolute)\n"
  },
  {
    "path": "featuretools/tests/feature_discovery/test_type_defs.py",
    "content": "import json\nfrom unittest.mock import patch\n\nimport pytest\nfrom woodwork.logical_types import Boolean, Double\n\nfrom featuretools.feature_discovery.feature_discovery import (\n    generate_features_from_primitives,\n    schema_to_features,\n)\nfrom featuretools.feature_discovery.FeatureCollection import FeatureCollection\nfrom featuretools.feature_discovery.LiteFeature import LiteFeature\nfrom featuretools.primitives import (\n    Absolute,\n    AddNumeric,\n    DivideNumeric,\n    Lag,\n    MultiplyNumeric,\n)\nfrom featuretools.tests.feature_discovery.test_feature_discovery import (\n    MultiOutputPrimitiveForTest,\n)\nfrom featuretools.tests.testing_utils.generate_fake_dataframe import (\n    generate_fake_dataframe,\n)\n\n\ndef test_feature_type_equality():\n    f1 = LiteFeature(\"f1\", Double)\n    f2 = LiteFeature(\"f2\", Double)\n\n    # Add Numeric is Commutative, so should all be equal\n    f3 = LiteFeature(\n        name=\"Column 1\",\n        primitive=AddNumeric(),\n        logical_type=Double,\n        base_features=[f1, f2],\n    )\n\n    f4 = LiteFeature(\n        name=\"Column 10\",\n        primitive=AddNumeric(),\n        logical_type=Double,\n        base_features=[f1, f2],\n    )\n\n    f5 = LiteFeature(\n        name=\"Column 20\",\n        primitive=AddNumeric(),\n        logical_type=Double,\n        base_features=[f2, f1],\n    )\n\n    assert f3 == f4 == f5\n\n    # Divide Numeric is not Commutative, so should not be equal\n    f6 = LiteFeature(\n        name=\"Column 1\",\n        primitive=DivideNumeric(),\n        logical_type=Double,\n        base_features=[f1, f2],\n    )\n\n    f7 = LiteFeature(\n        name=\"Column 1\",\n        primitive=DivideNumeric(),\n        logical_type=Double,\n        base_features=[f2, f1],\n    )\n\n    assert f6 != f7\n\n\ndef test_feature_type_assertions():\n    with pytest.raises(\n        ValueError,\n        match=\"there must be base features if given a primitive\",\n    ):\n        LiteFeature(\n            name=\"Column 1\",\n            primitive=AddNumeric(),\n            logical_type=Double,\n        )\n\n\n@patch.object(LiteFeature, \"_generate_hash\", lambda x: x.name)\n@patch(\n    \"featuretools.feature_discovery.LiteFeature.hash_primitive\",\n    lambda x: (x.name, None),\n)\ndef test_feature_to_dict():\n    f1 = LiteFeature(\"f1\", Double)\n    f2 = LiteFeature(\"f2\", Double)\n    f = LiteFeature(\n        name=\"Column 1\",\n        primitive=AddNumeric(),\n        base_features=[f1, f2],\n    )\n\n    expected = {\n        \"name\": \"Column 1\",\n        \"logical_type\": None,\n        \"tags\": [\"numeric\"],\n        \"primitive\": \"add_numeric\",\n        \"base_features\": [\"f1\", \"f2\"],\n        \"df_id\": None,\n        \"id\": \"Column 1\",\n        \"related_features\": [],\n        \"idx\": 0,\n    }\n\n    actual = f.to_dict()\n    json_str = json.dumps(actual)\n    assert actual == expected\n    assert json.dumps(expected) == json_str\n\n\ndef test_feature_hash():\n    bf1 = LiteFeature(\"bf\", Double)\n    bf2 = LiteFeature(\"bf\", Double, df_id=\"df\")\n\n    p1 = Lag(periods=1)\n    p2 = Lag(periods=2)\n    f1 = LiteFeature(\n        primitive=p1,\n        logical_type=Double,\n        base_features=[bf1],\n    )\n\n    f2 = LiteFeature(\n        primitive=p2,\n        logical_type=Double,\n        base_features=[bf1],\n    )\n\n    f3 = LiteFeature(\n        primitive=p2,\n        logical_type=Double,\n        base_features=[bf1],\n    )\n\n    f4 = LiteFeature(\n        primitive=p1,\n        logical_type=Double,\n        base_features=[bf2],\n    )\n\n    # TODO(dreed): ensure ID is parquet and arrow acceptable, length and starting character might be problematic\n\n    assert f1 != f2\n    assert f2 == f3\n    assert f1 != f4\n\n\ndef test_feature_forced_name():\n    bf = LiteFeature(\"bf\", Double)\n\n    p1 = Lag(periods=1)\n    f1 = LiteFeature(\n        name=\"target_delay_1\",\n        primitive=p1,\n        logical_type=Double,\n        base_features=[bf],\n    )\n    assert f1.name == \"target_delay_1\"\n\n\n@patch.object(LiteFeature, \"_generate_hash\", lambda x: x.name)\n@patch(\n    \"featuretools.feature_discovery.FeatureCollection.hash_primitive\",\n    lambda x: (x.name, None),\n)\n@patch(\n    \"featuretools.feature_discovery.LiteFeature.hash_primitive\",\n    lambda x: (x.name, None),\n)\ndef test_feature_collection_to_dict():\n    f1 = LiteFeature(\"f1\", Double)\n    f2 = LiteFeature(\"f2\", Double)\n    f3 = LiteFeature(\n        name=\"Column 1\",\n        primitive=AddNumeric(),\n        base_features=[f1, f2],\n    )\n\n    fc = FeatureCollection([f3])\n\n    expected = {\n        \"primitives\": {\n            \"add_numeric\": None,\n        },\n        \"feature_ids\": [\"Column 1\"],\n        \"all_features\": {\n            \"Column 1\": {\n                \"name\": \"Column 1\",\n                \"logical_type\": None,\n                \"tags\": [\"numeric\"],\n                \"primitive\": \"add_numeric\",\n                \"base_features\": [\"f1\", \"f2\"],\n                \"df_id\": None,\n                \"id\": \"Column 1\",\n                \"related_features\": [],\n                \"idx\": 0,\n            },\n            \"f1\": {\n                \"name\": \"f1\",\n                \"logical_type\": \"Double\",\n                \"tags\": [\"numeric\"],\n                \"primitive\": None,\n                \"base_features\": [],\n                \"df_id\": None,\n                \"id\": \"f1\",\n                \"related_features\": [],\n                \"idx\": 0,\n            },\n            \"f2\": {\n                \"name\": \"f2\",\n                \"logical_type\": \"Double\",\n                \"tags\": [\"numeric\"],\n                \"primitive\": None,\n                \"base_features\": [],\n                \"df_id\": None,\n                \"id\": \"f2\",\n                \"related_features\": [],\n                \"idx\": 0,\n            },\n        },\n    }\n\n    actual = fc.to_dict()\n    assert actual == expected\n    assert json.dumps(expected, sort_keys=True) == json.dumps(actual, sort_keys=True)\n\n\n@patch.object(LiteFeature, \"_generate_hash\", lambda x: x.name)\ndef test_feature_collection_from_dict():\n    f1 = LiteFeature(\"f1\", Double)\n    f2 = LiteFeature(\"f2\", Double)\n    f3 = LiteFeature(\n        name=\"Column 1\",\n        primitive=AddNumeric(),\n        base_features=[f1, f2],\n    )\n\n    expected = FeatureCollection([f3])\n\n    input_dict = {\n        \"primitives\": {\n            \"009da67f0a1430630c4a419c84aac270ec62337ab20c080e4495272950fd03b3\": {\n                \"type\": \"AddNumeric\",\n                \"module\": \"featuretools.primitives.standard.transform.binary.add_numeric\",\n                \"arguments\": {},\n            },\n        },\n        \"feature_ids\": [\"Column 1\"],\n        \"all_features\": {\n            \"f2\": {\n                \"name\": \"f2\",\n                \"logical_type\": \"Double\",\n                \"tags\": [\"numeric\"],\n                \"primitive\": None,\n                \"base_features\": [],\n                \"df_id\": None,\n                \"id\": \"f2\",\n                \"related_features\": [],\n                \"idx\": 0,\n            },\n            \"f1\": {\n                \"name\": \"f1\",\n                \"logical_type\": \"Double\",\n                \"tags\": [\"numeric\"],\n                \"primitive\": None,\n                \"base_features\": [],\n                \"df_id\": None,\n                \"id\": \"f1\",\n                \"related_features\": [],\n                \"idx\": 0,\n            },\n            \"Column 1\": {\n                \"name\": \"Column 1\",\n                \"logical_type\": None,\n                \"tags\": [\"numeric\"],\n                \"primitive\": \"009da67f0a1430630c4a419c84aac270ec62337ab20c080e4495272950fd03b3\",\n                \"base_features\": [\"f1\", \"f2\"],\n                \"df_id\": None,\n                \"id\": \"Column 1\",\n                \"related_features\": [],\n                \"idx\": 0,\n            },\n        },\n    }\n\n    actual = FeatureCollection.from_dict(input_dict)\n\n    assert actual == expected\n\n\n@patch.object(LiteFeature, \"__lt__\", lambda x, y: x.name < y.name)\ndef test_feature_collection_serialization_roundtrip():\n    col_defs = [\n        (\"idx\", \"Integer\", {\"index\"}),\n        (\"t_idx\", \"Datetime\", {\"time_index\"}),\n        (\"f_1\", \"Double\"),\n        (\"f_2\", \"Double\"),\n        (\"f_3\", \"Categorical\"),\n        (\"f_4\", \"Boolean\"),\n        (\"f_5\", \"NaturalLanguage\"),\n    ]\n\n    df = generate_fake_dataframe(\n        col_defs=col_defs,\n    )\n\n    origin_features = schema_to_features(df.ww.schema)\n    features = generate_features_from_primitives(\n        origin_features,\n        [Absolute, MultiplyNumeric, MultiOutputPrimitiveForTest],\n    )\n\n    features = generate_features_from_primitives(features, [Lag])\n\n    assert set([x.name for x in features]) == set(\n        [\n            \"idx\",\n            \"t_idx\",\n            \"f_1\",\n            \"f_2\",\n            \"f_3\",\n            \"f_4\",\n            \"f_5\",\n            \"ABSOLUTE(f_1)\",\n            \"ABSOLUTE(f_2)\",\n            \"f_1 * f_2\",\n            \"TEST_MO(f_5)[0]\",\n            \"TEST_MO(f_5)[1]\",\n            \"LAG(f_1, t_idx)\",\n            \"LAG(f_2, t_idx)\",\n            \"LAG(f_3, t_idx)\",\n            \"LAG(f_4, t_idx)\",\n            \"LAG(ABSOLUTE(f_1), t_idx)\",\n            \"LAG(ABSOLUTE(f_2), t_idx)\",\n            \"LAG(f_1 * f_2, t_idx)\",\n            \"LAG(TEST_MO(f_5)[1], t_idx)\",\n            \"LAG(TEST_MO(f_5)[0], t_idx)\",\n        ],\n    )\n    fc = FeatureCollection(features=features)\n    fc_dict = fc.to_dict()\n\n    fc_json = json.dumps(fc_dict)\n\n    fc2_dict = json.loads(fc_json)\n\n    fc2 = FeatureCollection.from_dict(fc2_dict)\n\n    assert fc == fc2\n    lsa_features = [x for x in fc2.all_features if x.get_primitive_name() == \"test_mo\"]\n    assert len(lsa_features[0].related_features) == 1\n\n\ndef test_lite_feature_assertions():\n    f1 = LiteFeature(name=\"f1\", logical_type=Double)\n    f2 = LiteFeature(name=\"f1\", logical_type=Double, df_id=\"df1\")\n\n    assert f1 != f2\n\n    with pytest.raises(\n        TypeError,\n        match=\"Name must be given if origin feature\",\n    ):\n        LiteFeature(logical_type=Double)\n\n    with pytest.raises(\n        TypeError,\n        match=\"Logical Type must be given if origin feature\",\n    ):\n        LiteFeature(name=\"f1\")\n\n    with pytest.raises(\n        ValueError,\n        match=\"primitive input must be of type PrimitiveBase\",\n    ):\n        LiteFeature(name=\"f3\", primitive=\"AddNumeric\", base_features=[f1, f2])\n\n    f = LiteFeature(\"f4\", logical_type=Double)\n    with pytest.raises(AttributeError, match=\"name is immutable\"):\n        f.name = \"new name\"\n\n    with pytest.raises(ValueError, match=\"only used on multioutput features\"):\n        f.non_indexed_name\n\n    with pytest.raises(AttributeError, match=\"logical_type is immutable\"):\n        f.logical_type = Boolean\n\n    with pytest.raises(AttributeError, match=\"tags is immutable\"):\n        f.tags = {\"other\"}\n\n    with pytest.raises(AttributeError, match=\"primitive is immutable\"):\n        f.primitive = AddNumeric\n\n    with pytest.raises(AttributeError, match=\"base_features are immutable\"):\n        f.base_features = [f1]\n\n    with pytest.raises(AttributeError, match=\"df_id is immutable\"):\n        f.df_id = \"df_id\"\n\n    with pytest.raises(AttributeError, match=\"id is immutable\"):\n        f.id = \"id\"\n\n    with pytest.raises(AttributeError, match=\"n_output_features is immutable\"):\n        f.n_output_features = \"n_output_features\"\n\n    with pytest.raises(AttributeError, match=\"depth is immutable\"):\n        f.depth = \"depth\"\n\n    with pytest.raises(AttributeError, match=\"idx is immutable\"):\n        f.idx = \"idx\"\n\n\ndef test_lite_feature_to_column_schema():\n    f1 = LiteFeature(name=\"f1\", logical_type=Double, tags={\"index\", \"numeric\"})\n\n    column_schema = f1.column_schema\n\n    assert column_schema.is_numeric\n    assert isinstance(column_schema.logical_type, Double)\n    assert column_schema.semantic_tags == {\"index\", \"numeric\"}\n\n    f2 = LiteFeature(name=\"f2\", primitive=Absolute(), base_features=[f1])\n\n    column_schema = f2.column_schema\n    assert column_schema.semantic_tags == {\"numeric\"}\n\n\ndef test_lite_feature_to_dependent_primitives():\n    f1 = LiteFeature(name=\"f1\", logical_type=Double)\n\n    f2 = LiteFeature(name=\"f2\", primitive=Absolute(), base_features=[f1])\n\n    f3 = LiteFeature(name=\"f3\", primitive=AddNumeric(), base_features=[f1, f2])\n\n    f4 = LiteFeature(name=\"f4\", primitive=MultiplyNumeric(), base_features=[f1, f3])\n\n    assert set([x.name for x in f4.dependent_primitives()]) == set(\n        [\"multiply_numeric\", \"absolute\", \"add_numeric\"],\n    )\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/__init__.py",
    "content": ""
  },
  {
    "path": "featuretools/tests/primitive_tests/aggregation_primitive_tests/__init__.py",
    "content": ""
  },
  {
    "path": "featuretools/tests/primitive_tests/aggregation_primitive_tests/test_agg_primitives.py",
    "content": "from datetime import datetime\nfrom math import sqrt\n\nimport numpy as np\nimport pandas as pd\nimport pytest\nfrom pandas.core.dtypes.dtypes import CategoricalDtype\nfrom pytest import raises\n\nfrom featuretools.primitives import (\n    AverageCountPerUnique,\n    DateFirstEvent,\n    Entropy,\n    FirstLastTimeDelta,\n    HasNoDuplicates,\n    IsMonotonicallyDecreasing,\n    IsMonotonicallyIncreasing,\n    Kurtosis,\n    MaxCount,\n    MaxMinDelta,\n    MedianCount,\n    MinCount,\n    NMostCommon,\n    NMostCommonFrequency,\n    NumFalseSinceLastTrue,\n    NumPeaks,\n    NumTrueSinceLastFalse,\n    NumZeroCrossings,\n    NUniqueDays,\n    NUniqueDaysOfCalendarYear,\n    NUniqueDaysOfMonth,\n    NUniqueMonths,\n    NUniqueWeeks,\n    PercentTrue,\n    Trend,\n    Variance,\n    get_aggregation_primitives,\n)\nfrom featuretools.tests.primitive_tests.utils import (\n    PrimitiveTestBase,\n    check_serialize,\n    find_applicable_primitives,\n    valid_dfs,\n)\n\n\ndef test_nmostcommon_categorical():\n    n_most = NMostCommon(3)\n    expected = pd.Series([1.0, 2.0, np.nan])\n\n    ints = pd.Series([1, 2, 1, 1]).astype(\"int64\")\n    assert pd.Series(n_most(ints)).equals(expected)\n\n    cats = pd.Series([1, 2, 1, 1]).astype(\"category\")\n    assert pd.Series(n_most(cats)).equals(expected)\n\n    # Value counts includes data for categories that are not present in data.\n    # Make sure these counts are not included in most common outputs\n    extra_dtype = CategoricalDtype(categories=[1, 2, 3])\n    cats_extra = pd.Series([1, 2, 1, 1]).astype(extra_dtype)\n    assert pd.Series(n_most(cats_extra)).equals(expected)\n\n\ndef test_agg_primitives_can_init_without_params():\n    agg_primitives = get_aggregation_primitives().values()\n    for agg_primitive in agg_primitives:\n        agg_primitive()\n\n\ndef test_trend_works_with_different_input_dtypes():\n    dates = pd.to_datetime([\"2020-01-01\", \"2020-01-02\", \"2020-01-03\"])\n    numeric = pd.Series([1, 2, 3])\n\n    trend = Trend()\n    dtypes = [\"float64\", \"int64\", \"Int64\"]\n\n    for dtype in dtypes:\n        actual = trend(numeric.astype(dtype), dates)\n        assert np.isclose(actual, 1)\n\n\ndef test_percent_true_boolean():\n    booleans = pd.Series([True, False, True, pd.NA], dtype=\"boolean\")\n    pct_true = PercentTrue()\n    pct_true(booleans) == 0.5\n\n\nclass TestAverageCountPerUnique(PrimitiveTestBase):\n    primitive = AverageCountPerUnique\n    array = pd.Series([1, 1, 2, 2, 3, 4, 5, 6, 7, 8])\n\n    def test_percent_unique(self):\n        primitive_func = AverageCountPerUnique().get_function()\n        assert primitive_func(self.array) == 1.25\n\n    def test_nans(self):\n        primitive_func = AverageCountPerUnique().get_function()\n        array_nans = pd.concat([self.array.copy(), pd.Series([np.nan])])\n        assert primitive_func(array_nans) == 1.25\n        primitive_func = AverageCountPerUnique(skipna=False).get_function()\n        array_nans = pd.concat([self.array.copy(), pd.Series([np.nan])])\n        assert primitive_func(array_nans) == (11 / 9.0)\n\n    def test_empty_string(self):\n        primitive_func = AverageCountPerUnique().get_function()\n        array_empty_string = pd.concat([self.array.copy(), pd.Series([np.nan, \"\", \"\"])])\n        assert primitive_func(array_empty_string) == (4 / 3.0)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        aggregation.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n\n\nclass TestVariance(PrimitiveTestBase):\n    primitive = Variance\n\n    def test_regular(self):\n        variance = self.primitive().get_function()\n        np.testing.assert_almost_equal(variance(np.array([0, 3, 4, 3])), 2.25)\n\n    def test_single(self):\n        variance = self.primitive().get_function()\n        np.testing.assert_almost_equal(variance(np.array([4])), 0)\n\n    def test_double(self):\n        variance = self.primitive().get_function()\n        np.testing.assert_almost_equal(variance(np.array([3, 4])), 0.25)\n\n    def test_empty(self):\n        variance = self.primitive().get_function()\n        np.testing.assert_almost_equal(variance(np.array([])), np.nan)\n\n    def test_nan(self):\n        variance = self.primitive().get_function()\n        np.testing.assert_almost_equal(\n            variance(pd.Series([0, np.nan, 4, 3])),\n            2.8888888888888893,\n        )\n\n    def test_allnan(self):\n        variance = self.primitive().get_function()\n        np.testing.assert_almost_equal(\n            variance(pd.Series([np.nan, np.nan, np.nan])),\n            np.nan,\n        )\n\n\nclass TestFirstLastTimeDelta(PrimitiveTestBase):\n    primitive = FirstLastTimeDelta\n    times = pd.Series([datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)])\n    actual_delta = (times.iloc[-1] - times.iloc[0]).total_seconds()\n\n    def test_first_last_time_delta(self):\n        primitive_func = self.primitive().get_function()\n        assert primitive_func(self.times) == self.actual_delta\n\n    def test_with_nans(self):\n        primitive_func = self.primitive().get_function()\n        times = pd.concat([self.times, pd.Series([np.nan])])\n        assert primitive_func(times) == self.actual_delta\n        assert pd.isna(primitive_func(pd.Series([np.nan])))\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        aggregation.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n\n\nclass TestEntropy(PrimitiveTestBase):\n    primitive = Entropy\n\n    @pytest.mark.parametrize(\n        \"dtype\",\n        [\"category\", \"object\", \"string\"],\n    )\n    def test_regular(self, dtype):\n        data = pd.Series([1, 2, 3, 2], dtype=dtype)\n        primitive_func = self.primitive().get_function()\n        given_answer = primitive_func(data)\n        assert np.isclose(given_answer, 1.03, atol=0.01)\n\n    @pytest.mark.parametrize(\n        \"dtype\",\n        [\"category\", \"object\", \"string\"],\n    )\n    def test_empty(self, dtype):\n        data = pd.Series([], dtype=dtype)\n        primitive_func = self.primitive().get_function()\n        given_answer = primitive_func(data)\n        assert given_answer == 0.0\n\n    @pytest.mark.parametrize(\n        \"dtype\",\n        [\"category\", \"object\", \"string\"],\n    )\n    def test_args(self, dtype):\n        data = pd.Series([1, 2, 3, 2], dtype=dtype)\n        if dtype == \"string\":\n            data = pd.concat([data, pd.Series([pd.NA, pd.NA], dtype=dtype)])\n        else:\n            data = pd.concat([data, pd.Series([np.nan, np.nan], dtype=dtype)])\n        primitive_func = self.primitive(dropna=True, base=2).get_function()\n        given_answer = primitive_func(data)\n        assert np.isclose(given_answer, 1.5, atol=0.001)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        aggregation.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive, max_depth=2)\n\n\nclass TestKurtosis(PrimitiveTestBase):\n    primitive = Kurtosis\n\n    @pytest.mark.parametrize(\n        \"dtype\",\n        [\"int64\", \"float64\"],\n    )\n    def test_regular(self, dtype):\n        data = pd.Series([1, 2, 3, 4, 5], dtype=dtype)\n        answer = -1.3\n        primitive_func = self.primitive().get_function()\n        given_answer = primitive_func(data)\n        assert np.isclose(answer, given_answer, atol=0.01)\n\n        data = pd.Series([1, 2, 3, 4, 5, 6], dtype=dtype)\n        answer = -1.26\n        primitive_func = self.primitive().get_function()\n        given_answer = primitive_func(data)\n        assert np.isclose(answer, given_answer, atol=0.01)\n\n        data = pd.Series([x * x for x in list(range(100))], dtype=dtype)\n        answer = -0.85\n        primitive_func = self.primitive().get_function()\n        given_answer = primitive_func(data)\n        assert np.isclose(answer, given_answer, atol=0.01)\n\n        if dtype == \"float64\":\n            # Series contains floating point values - only check with float dtype\n            data = pd.Series([sqrt(x) for x in list(range(100))], dtype=dtype)\n            answer = -0.46\n            primitive_func = self.primitive().get_function()\n            given_answer = primitive_func(data)\n            assert np.isclose(answer, given_answer, atol=0.01)\n\n    def test_nan(self):\n        data = pd.Series([np.nan, 5, 3], dtype=\"float64\")\n        primitive_func = self.primitive().get_function()\n        given_answer = primitive_func(data)\n        assert pd.isna(given_answer)\n\n    @pytest.mark.parametrize(\n        \"dtype\",\n        [\"int64\", \"float64\"],\n    )\n    def test_empty(self, dtype):\n        data = pd.Series([], dtype=dtype)\n        primitive_func = self.primitive().get_function()\n        given_answer = primitive_func(data)\n        assert pd.isna(given_answer)\n\n    def test_inf(self):\n        data = pd.Series([1, np.inf], dtype=\"float64\")\n        primitive_func = self.primitive().get_function()\n        given_answer = primitive_func(data)\n        assert pd.isna(given_answer)\n\n        data = pd.Series([np.NINF, 1, np.inf], dtype=\"float64\")\n        primitive_func = self.primitive().get_function()\n        given_answer = primitive_func(data)\n        assert pd.isna(given_answer)\n\n    def test_arg(self):\n        data = pd.Series([1, 2, 3, 4, 5, np.nan, np.nan], dtype=\"float64\")\n        answer = -1.3\n        primitive_func = self.primitive(nan_policy=\"omit\").get_function()\n        given_answer = primitive_func(data)\n        assert answer == given_answer\n\n        primitive_func = self.primitive(nan_policy=\"propagate\").get_function()\n        given_answer = primitive_func(data)\n        assert np.isnan(given_answer)\n\n        primitive_func = self.primitive(nan_policy=\"raise\").get_function()\n        with raises(ValueError):\n            primitive_func(data)\n\n    def test_error(self):\n        with raises(ValueError):\n            self.primitive(nan_policy=\"invalid_policy\").get_function()\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        aggregation.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n\n\nclass TestNumZeroCrossings(PrimitiveTestBase):\n    primitive = NumZeroCrossings\n\n    def test_nan(self):\n        data = pd.Series([3, np.nan, 5, 3, np.nan, 0, np.nan, 0, np.nan, -2])\n        # crossing from 0 to np.nan to -2, which is 1 crossing\n        answer = 1\n        primtive_func = self.primitive().get_function()\n        given_answer = primtive_func(data)\n        assert given_answer == answer\n\n    def test_empty(self):\n        data = pd.Series([], dtype=\"int64\")\n        answer = 0\n        primtive_func = self.primitive().get_function()\n        given_answer = primtive_func(data)\n        assert given_answer == answer\n\n    def test_inf(self):\n        data = pd.Series([-1, np.inf])\n        answer = 1\n        primtive_func = self.primitive().get_function()\n        given_answer = primtive_func(data)\n        assert given_answer == answer\n\n        data = pd.Series([np.NINF, 1, np.inf])\n        answer = 1\n        primtive_func = self.primitive().get_function()\n        given_answer = primtive_func(data)\n        assert given_answer == answer\n\n    def test_zeros(self):\n        data = pd.Series([1, 0, -1, 0, 1, 0, -1])\n        answer = 3\n        primtive_func = self.primitive().get_function()\n        given_answer = primtive_func(data)\n        assert given_answer == answer\n\n        data = pd.Series([1, 0, 1, 0, 1])\n        answer = 0\n        primtive_func = self.primitive().get_function()\n        given_answer = primtive_func(data)\n        assert given_answer == answer\n\n    def test_regular(self):\n        data = pd.Series([1, 2, 3, 4, 5])\n        answer = 0\n        primtive_func = self.primitive().get_function()\n        given_answer = primtive_func(data)\n        assert given_answer == answer\n\n        data = pd.Series([1, -1, 2, -2, 3, -3])\n        answer = 5\n        primtive_func = self.primitive().get_function()\n        given_answer = primtive_func(data)\n        assert given_answer == answer\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        aggregation.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n\n\nclass TestNumTrueSinceLastFalse(PrimitiveTestBase):\n    primitive = NumTrueSinceLastFalse\n\n    def test_regular(self):\n        primitive_func = self.primitive().get_function()\n        bools = pd.Series([False, True, False, True, True])\n        answer = primitive_func(bools)\n        correct_answer = 2\n        assert answer == correct_answer\n\n    def test_regular_end_in_false(self):\n        primitive_func = self.primitive().get_function()\n        bools = pd.Series([False, True, False, True, True, False])\n        answer = primitive_func(bools)\n        correct_answer = 0\n        assert answer == correct_answer\n\n    def test_no_false(self):\n        primitive_func = self.primitive().get_function()\n        bools = pd.Series([True] * 5)\n        assert pd.isna(primitive_func(bools))\n\n    def test_all_false(self):\n        primitive_func = self.primitive().get_function()\n        bools = pd.Series([False, False, False])\n        answer = primitive_func(bools)\n        correct_answer = 0\n        assert answer == correct_answer\n\n    def test_nan(self):\n        primitive_func = self.primitive().get_function()\n        bools = pd.Series([False, True, np.nan, True, True])\n        answer = primitive_func(bools)\n        correct_answer = 3\n        assert answer == correct_answer\n\n    def test_all_nan(self):\n        primitive_func = self.primitive().get_function()\n        bools = pd.Series([np.nan, np.nan, np.nan])\n        assert pd.isna(primitive_func(bools))\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        aggregation.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n\n\nclass TestNumFalseSinceLastTrue(PrimitiveTestBase):\n    primitive = NumFalseSinceLastTrue\n\n    def test_regular(self):\n        primitive_func = self.primitive().get_function()\n        bools = pd.Series([True, False, True, False, False])\n        answer = primitive_func(bools)\n        correct_answer = 2\n        assert answer == correct_answer\n\n    def test_regular_end_in_true(self):\n        primitive_func = self.primitive().get_function()\n        bools = pd.Series([True, False, True, False, False, True])\n        answer = primitive_func(bools)\n        correct_answer = 0\n        assert answer == correct_answer\n\n    def test_no_true(self):\n        primitive_func = self.primitive().get_function()\n        bools = pd.Series([False] * 5)\n        assert pd.isna(primitive_func(bools))\n\n    def test_all_true(self):\n        primitive_func = self.primitive().get_function()\n        bools = pd.Series([True, True, True])\n        answer = primitive_func(bools)\n        correct_answer = 0\n        assert answer == correct_answer\n\n    def test_nan(self):\n        primitive_func = self.primitive().get_function()\n        bools = pd.Series([True, False, np.nan, False, False])\n        answer = primitive_func(bools)\n        correct_answer = 3\n        assert answer == correct_answer\n\n    def test_all_nan(self):\n        primitive_func = self.primitive().get_function()\n        bools = pd.Series([np.nan, np.nan, np.nan])\n        assert pd.isna(primitive_func(bools))\n\n    def test_numeric_and_string_input(self):\n        primitive_func = self.primitive().get_function()\n        bools = pd.Series([True, 0, 1, \"10\", \"\"])\n        answer = primitive_func(bools)\n        correct_answer = 1\n        assert answer == correct_answer\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        aggregation.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n\n\nclass TestNumPeaks(PrimitiveTestBase):\n    primitive = NumPeaks\n\n    @pytest.mark.parametrize(\n        \"dtype\",\n        [\"int64\", \"float64\", \"Int64\"],\n    )\n    def test_negative_and_positive_nums(self, dtype):\n        get_peaks = self.primitive().get_function()\n        assert (\n            get_peaks(pd.Series([-5, 0, 10, 0, 10, -5, -4, -5, 10, 0], dtype=dtype))\n            == 4\n        )\n\n    @pytest.mark.parametrize(\n        \"dtype\",\n        [\"int64\", \"float64\", \"Int64\"],\n    )\n    def test_plateu(self, dtype):\n        get_peaks = self.primitive().get_function()\n        assert get_peaks(pd.Series([1, 2, 3, 3, 3, 3, 3, 2, 1], dtype=dtype)) == 1\n        assert get_peaks(pd.Series([1, 2, 3, 3, 3, 4, 3, 3, 3, 2, 1], dtype=dtype)) == 1\n        assert (\n            get_peaks(\n                pd.Series(\n                    [\n                        5,\n                        4,\n                        3,\n                        3,\n                        3,\n                        3,\n                        3,\n                        3,\n                        4,\n                        5,\n                        5,\n                        5,\n                        5,\n                        5,\n                        3,\n                        3,\n                        3,\n                        3,\n                        4,\n                    ],\n                    dtype=dtype,\n                ),\n            )\n            == 1\n        )\n        assert (\n            get_peaks(\n                pd.Series(\n                    [\n                        1,\n                        2,\n                        3,\n                        3,\n                        3,\n                        2,\n                        1,\n                        2,\n                        3,\n                        3,\n                        3,\n                        2,\n                        5,\n                        5,\n                        5,\n                        2,\n                    ],\n                    dtype=dtype,\n                ),\n            )\n            == 3\n        )\n\n    @pytest.mark.parametrize(\n        \"dtype\",\n        [\"int64\", \"float64\", \"Int64\"],\n    )\n    def test_regular(self, dtype):\n        get_peaks = self.primitive().get_function()\n        assert get_peaks(pd.Series([1, 7, 3, 8, 2, 3, 4, 3, 4, 2, 4], dtype=dtype)) == 4\n        assert get_peaks(pd.Series([1, 2, 3, 2, 1], dtype=dtype)) == 1\n\n    @pytest.mark.parametrize(\n        \"dtype\",\n        [\"int64\", \"float64\", \"Int64\"],\n    )\n    def test_no_peak(self, dtype):\n        get_peaks = self.primitive().get_function()\n        assert get_peaks(pd.Series([1, 2, 3], dtype=dtype)) == 0\n        assert get_peaks(pd.Series([3, 2, 2, 2, 2, 1], dtype=dtype)) == 0\n\n    @pytest.mark.parametrize(\n        \"dtype\",\n        [\"int64\", \"float64\", \"Int64\"],\n    )\n    def test_too_small_data(self, dtype):\n        get_peaks = self.primitive().get_function()\n        assert get_peaks(pd.Series([], dtype=dtype)) == 0\n        assert get_peaks(pd.Series([1])) == 0\n        assert get_peaks(pd.Series([1, 1])) == 0\n        assert get_peaks(pd.Series([1, 2])) == 0\n        assert get_peaks(pd.Series([2, 1])) == 0\n\n    @pytest.mark.parametrize(\n        \"dtype\",\n        [\"int64\", \"float64\", \"Int64\"],\n    )\n    def test_nans(self, dtype):\n        get_peaks = self.primitive().get_function()\n        array = pd.Series(\n            [\n                0,\n                5,\n                10,\n                15,\n                20,\n                0,\n                1,\n                2,\n                3,\n                0,\n                0,\n                5,\n                0,\n                7,\n                14,\n            ],\n            dtype=dtype,\n        )\n        if dtype == \"float64\":\n            array = pd.concat([array, pd.Series([np.nan, np.nan])])\n        elif dtype == \"Int64\":\n            array = pd.concat([array, pd.Series([pd.NA, pd.NA])])\n        array = array.astype(dtype=dtype)\n        assert get_peaks(array) == 3\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        aggregation.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n\n\nclass TestDateFirstEvent(PrimitiveTestBase):\n    primitive = DateFirstEvent\n\n    def test_regular(self):\n        primitive_func = self.primitive().get_function()\n        case = pd.Series(\n            [\n                \"2011-04-09 10:30:00\",\n                \"2011-04-09 10:30:06\",\n                \"2011-04-09 10:30:12\",\n                \"2011-04-09 10:30:18\",\n            ],\n            dtype=\"datetime64[ns]\",\n        )\n        answer = pd.Timestamp(\"2011-04-09 10:30:00\")\n        given_answer = primitive_func(case)\n        assert given_answer == answer\n\n    def test_nat(self):\n        primitive_func = self.primitive().get_function()\n        case = pd.Series(\n            [\n                pd.NaT,\n                pd.NaT,\n                \"2011-04-09 10:30:12\",\n                \"2011-04-09 10:30:18\",\n            ],\n            dtype=\"datetime64[ns]\",\n        )\n        answer = pd.Timestamp(\"2011-04-09 10:30:12\")\n        given_answer = primitive_func(case)\n        assert given_answer == answer\n\n    def test_empty(self):\n        primitive_func = self.primitive().get_function()\n        case = pd.Series([], dtype=\"datetime64[ns]\")\n        given_answer = primitive_func(case)\n        assert pd.isna(given_answer)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        aggregation.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n\n    def test_serialize(self, es):\n        check_serialize(self.primitive, es, target_dataframe_name=\"sessions\")\n\n\nclass TestMinCount(PrimitiveTestBase):\n    primitive = MinCount\n\n    def test_nan(self):\n        data = pd.Series([np.nan, np.nan, np.nan])\n        primitive_func = self.primitive().get_function()\n        answer = primitive_func(data)\n        assert pd.isna(answer)\n\n    def test_inf(self):\n        data = pd.Series([5, 10, 10, np.inf, np.inf, np.inf])\n        primitive_func = self.primitive().get_function()\n        answer = primitive_func(data)\n        assert answer == 1\n\n    def test_regular(self):\n        data = pd.Series([1, 2, 2, 2, 3, 4, 4, 4, 5])\n        primitive_func = self.primitive().get_function()\n        answer = primitive_func(data)\n        assert answer == 1\n\n        data = pd.Series([2, 2, 2, 3, 4, 4, 4])\n        primitive_func = self.primitive().get_function()\n        answer = primitive_func(data)\n        assert answer == 3\n\n    def test_skipna(self):\n        data = pd.Series([1, 1, 2, 3, 4, 4, np.nan, 5])\n        primitive_func = self.primitive(skipna=False).get_function()\n        answer = primitive_func(data)\n        assert pd.isna(answer)\n\n    def test_ninf(self):\n        data = pd.Series([np.NINF, np.NINF, np.nan])\n        primitive_func = self.primitive().get_function()\n        answer = primitive_func(data)\n        assert answer == 2\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        aggregation.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n\n\nclass TestMaxCount(PrimitiveTestBase):\n    primitive = MaxCount\n\n    def test_nan(self):\n        data = pd.Series([np.nan, np.nan, np.nan])\n        primitive_func = self.primitive().get_function()\n        answer = primitive_func(data)\n        assert pd.isna(answer)\n\n    def test_inf(self):\n        data = pd.Series([5, 10, 10, np.inf, np.inf, np.inf])\n        primitive_func = self.primitive().get_function()\n        answer = primitive_func(data)\n        assert answer == 3\n\n    def test_regular(self):\n        data = pd.Series([1, 1, 2, 3, 4, 4, 4, 5])\n        primitive_func = self.primitive().get_function()\n        answer = primitive_func(data)\n        assert answer == 1\n\n        data = pd.Series([1, 1, 2, 3, 4, 4, 4])\n        primitive_func = self.primitive().get_function()\n        answer = primitive_func(data)\n        assert answer == 3\n\n    def test_skipna(self):\n        data = pd.Series([1, 1, 2, 3, 4, 4, np.nan, 5])\n        primitive_func = self.primitive(skipna=False).get_function()\n        answer = primitive_func(data)\n        assert pd.isna(answer)\n\n    def test_ninf(self):\n        data = pd.Series([np.NINF, np.NINF, np.nan])\n        primitive_func = self.primitive().get_function()\n        answer = primitive_func(data)\n        assert answer == 2\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        aggregation.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n\n\nclass TestMaxMinDelta(PrimitiveTestBase):\n    primitive = MaxMinDelta\n    array = pd.Series([1, 1, 2, 2, 3, 4, 5, 6, 7, 8])\n\n    def test_max_min_delta(self):\n        primitive_func = self.primitive().get_function()\n        assert primitive_func(self.array) == 7.0\n\n    def test_nans(self):\n        primitive_func = self.primitive().get_function()\n        array_nans = pd.concat([self.array, pd.Series([np.nan])])\n        assert primitive_func(array_nans) == 7.0\n        primitive_func = self.primitive(skipna=False).get_function()\n        array_nans = pd.concat([self.array, pd.Series([np.nan])])\n        assert pd.isna(primitive_func(array_nans))\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        aggregation.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n\n\nclass TestMedianCount(PrimitiveTestBase):\n    primitive = MedianCount\n\n    def test_regular(self):\n        primitive_func = self.primitive().get_function()\n        case = pd.Series([1, 3, 5, 7])\n        given_answer = primitive_func(case)\n        assert given_answer == 0\n\n    def test_nans(self):\n        primitive_func = self.primitive().get_function()\n        case = pd.Series([1, 3, 4, 4, 4, 5, 7, np.nan, np.nan])\n        given_answer = primitive_func(case)\n        assert given_answer == 3\n        primitive_func = self.primitive(skipna=False).get_function()\n        given_answer = primitive_func(case)\n        assert pd.isna(given_answer)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        aggregation.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n\n\nclass TestNMostCommonFrequency(PrimitiveTestBase):\n    primitive = NMostCommonFrequency\n\n    def test_regular(self):\n        test_cases = [\n            pd.Series([8, 7, 10, 10, 10, 3, 4, 5, 10, 8, 7]),\n            pd.Series([7, 7, 7, 6, 6, 5, 4]),\n            pd.Series([4, 5, 6, 6, 7, 7, 7]),\n        ]\n\n        answers = [\n            pd.Series([4, 2, 2]),\n            pd.Series([3, 2, 1]),\n            pd.Series([3, 2, 1]),\n        ]\n\n        primtive_func = self.primitive(3).get_function()\n\n        for case, answer in zip(test_cases, answers):\n            given_answer = primtive_func(case)\n            given_answer = given_answer.reset_index(drop=True)\n            assert given_answer.equals(answer)\n\n    def test_n_larger_than_len(self):\n        test_cases = [\n            pd.Series([\"red\", \"red\", \"blue\", \"green\"]),\n            pd.Series([\"red\", \"red\", \"red\", \"blue\", \"green\"]),\n            pd.Series([\"red\", \"blue\", \"green\", \"orange\"]),\n        ]\n        answers = [\n            pd.Series([2, 1, 1, np.nan, np.nan]),\n            pd.Series([3, 1, 1, np.nan, np.nan]),\n            pd.Series([1, 1, 1, 1, np.nan]),\n        ]\n\n        primtive_func = self.primitive(5).get_function()\n        for case, answer in zip(test_cases, answers):\n            given_answer = primtive_func(case)\n            given_answer = given_answer.reset_index(drop=True)\n            assert given_answer.equals(answer)\n\n    def test_skipna(self):\n        array = pd.Series([\"red\", \"red\", \"blue\", \"green\", np.nan, np.nan])\n        primtive_func = self.primitive(5, skipna=False).get_function()\n        given_answer = primtive_func(array)\n        given_answer = given_answer.reset_index(drop=True)\n        answer = pd.Series([2, 2, 1, 1, np.nan])\n        assert given_answer.equals(answer)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        aggregation.append(self.primitive(5))\n        valid_dfs(\n            es,\n            aggregation,\n            transform,\n            self.primitive,\n            target_dataframe_name=\"customers\",\n            multi_output=True,\n        )\n\n    def test_with_featuretools_args(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        aggregation.append(self.primitive(5, skipna=False))\n        valid_dfs(\n            es,\n            aggregation,\n            transform,\n            self.primitive,\n            target_dataframe_name=\"customers\",\n            multi_output=True,\n        )\n\n    def test_serialize(self, es):\n        check_serialize(\n            primitive=self.primitive,\n            es=es,\n            target_dataframe_name=\"customers\",\n        )\n\n\nclass TestNUniqueDays(PrimitiveTestBase):\n    primitive = NUniqueDays\n\n    def test_two_years(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(pd.date_range(\"2010-01-01\", \"2011-12-31\"))\n        assert primitive_func(array) == 365 * 2\n\n    def test_leap_year(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(pd.date_range(\"2016-01-01\", \"2017-12-31\"))\n        assert primitive_func(array) == 365 * 2 + 1\n\n    def test_ten_years(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(pd.date_range(\"2010-01-01\", \"2019-12-31\"))\n        assert primitive_func(array) == 365 * 10 + 1 + 1\n\n    def test_distinct_dt(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(\n            [\n                datetime(2019, 2, 21),\n                datetime(2019, 2, 1, 1, 20, 0),\n                datetime(2019, 2, 1, 1, 30, 0),\n                datetime(2018, 2, 1),\n                datetime(2019, 1, 1),\n            ],\n        )\n        assert primitive_func(array) == 4\n\n    def test_NaT(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(pd.date_range(\"2010-01-01\", \"2011-12-31\"))\n        NaT_array = pd.Series([pd.NaT] * 100)\n        assert primitive_func(pd.concat([array, NaT_array])) == 365 * 2\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        aggregation.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n\n\nclass TestNUniqueDaysOfCalendarYear(PrimitiveTestBase):\n    primitive = NUniqueDaysOfCalendarYear\n\n    def test_two_years(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(pd.date_range(\"2010-01-01\", \"2011-12-31\"))\n        assert primitive_func(array) == 365\n\n    def test_leap_year(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(pd.date_range(\"2016-01-01\", \"2017-12-31\"))\n        assert primitive_func(array) == 366\n\n    def test_ten_years(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(pd.date_range(\"2010-01-01\", \"2019-12-31\"))\n        assert primitive_func(array) == 366\n\n    def test_distinct_dt(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(\n            [\n                datetime(2019, 2, 21),\n                datetime(2019, 2, 1, 1, 20, 0),\n                datetime(2019, 2, 1, 1, 30, 0),\n                datetime(2018, 2, 1),\n                datetime(2019, 1, 1),\n            ],\n        )\n        assert primitive_func(array) == 3\n\n    def test_NaT(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(pd.date_range(\"2010-01-01\", \"2011-12-31\"))\n        NaT_array = pd.Series([pd.NaT] * 100)\n        assert primitive_func(pd.concat([array, NaT_array])) == 365\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        aggregation.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n\n\nclass TestNUniqueDaysOfMonth(PrimitiveTestBase):\n    primitive = NUniqueDaysOfMonth\n\n    def test_two_days(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(pd.date_range(\"2010-01-01\", \"2010-01-02\"))\n        assert primitive_func(array) == 2\n\n    def test_one_year(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(pd.date_range(\"2010-01-01\", \"2010-12-31\"))\n        assert primitive_func(array) == 31\n\n    def test_leap_year(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(pd.date_range(\"2016-01-01\", \"2017-12-31\"))\n        assert primitive_func(array) == 31\n\n    def test_distinct_dt(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(\n            [\n                datetime(2019, 2, 21),\n                datetime(2019, 2, 1, 1, 20, 0),\n                datetime(2019, 2, 1, 1, 30, 0),\n                datetime(2018, 2, 1),\n                datetime(2019, 1, 1),\n            ],\n        )\n        assert primitive_func(array) == 2\n\n    def test_NaT(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(pd.date_range(\"2010-01-01\", \"2010-12-31\"))\n        NaT_array = pd.Series([pd.NaT] * 100)\n        assert primitive_func(pd.concat([array, NaT_array])) == 31\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        aggregation.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n\n\nclass TestNUniqueMonths(PrimitiveTestBase):\n    primitive = NUniqueMonths\n\n    def test_two_days(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(pd.date_range(\"2010-01-01\", \"2010-01-02\"))\n        assert primitive_func(array) == 1\n\n    def test_ten_years(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(pd.date_range(\"2010-01-01\", \"2019-12-31\"))\n        assert primitive_func(array) == 12 * 10\n\n    def test_distinct_dt(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(\n            [\n                datetime(2019, 2, 21),\n                datetime(2019, 2, 1, 1, 20, 0),\n                datetime(2019, 2, 1, 1, 30, 0),\n                datetime(2018, 2, 1),\n                datetime(2019, 1, 1),\n            ],\n        )\n        assert primitive_func(array) == 3\n\n    def test_NaT(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(pd.date_range(\"2010-01-01\", \"2011-12-31\"))\n        NaT_array = pd.Series([pd.NaT] * 100)\n        assert primitive_func(pd.concat([array, NaT_array])) == 12 * 2\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        aggregation.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n\n\nclass TestNUniqueWeeks(PrimitiveTestBase):\n    primitive = NUniqueWeeks\n\n    def test_same_week(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(pd.date_range(\"2019-01-01\", \"2019-01-02\"))\n        assert primitive_func(array) == 1\n\n    def test_ten_years(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(pd.date_range(\"2010-01-01\", \"2019-12-31\"))\n        assert primitive_func(array) == 523\n\n    def test_distinct_dt(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(\n            [\n                datetime(2019, 2, 21),\n                datetime(2019, 2, 1, 1, 20, 0),\n                datetime(2019, 2, 1, 1, 30, 0),\n                datetime(2018, 2, 2),\n                datetime(2019, 2, 3, 1, 30, 0),\n                datetime(2019, 1, 1),\n            ],\n        )\n        assert primitive_func(array) == 4\n\n    def test_NaT(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(pd.date_range(\"2019-01-01\", \"2019-01-02\"))\n        NaT_array = pd.Series([pd.NaT] * 100)\n        assert primitive_func(pd.concat([array, NaT_array])) == 1\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        aggregation.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n\n\nclass TestHasNoDuplicates(PrimitiveTestBase):\n    primitive = HasNoDuplicates\n\n    def test_regular(self):\n        primitive_func = self.primitive().get_function()\n        data = pd.Series([1, 1, 2])\n        assert not primitive_func(data)\n        assert isinstance(primitive_func(data), bool)\n\n        data = pd.Series([1, 2, 3])\n        assert primitive_func(data)\n        assert isinstance(primitive_func(data), bool)\n\n        data = pd.Series([1, 2, 4])\n        assert primitive_func(data)\n        assert isinstance(primitive_func(data), bool)\n\n        data = pd.Series([\"red\", \"blue\", \"orange\"])\n        assert primitive_func(data)\n        assert isinstance(primitive_func(data), bool)\n\n        data = pd.Series([\"red\", \"blue\", \"red\"])\n        assert not primitive_func(data)\n\n    def test_nan(self):\n        primitive_func = self.primitive().get_function()\n        data = pd.Series([np.nan, 1, 2, 3])\n        assert primitive_func(data)\n        assert isinstance(primitive_func(data), bool)\n\n        data = pd.Series([np.nan, np.nan, 1])\n        # drop both nans, so has 1 value\n        assert primitive_func(data) is True\n        assert isinstance(primitive_func(data), bool)\n\n        primitive_func = self.primitive(skipna=False).get_function()\n        data = pd.Series([np.nan, np.nan, 1])\n        assert primitive_func(data) is False\n        assert isinstance(primitive_func(data), bool)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instantiate = self.primitive()\n        aggregation.append(primitive_instantiate)\n        valid_dfs(\n            es,\n            aggregation,\n            transform,\n            self.primitive,\n            target_dataframe_name=\"customers\",\n            instance_ids=[0, 1, 2],\n        )\n\n\nclass TestIsMonotonicallyDecreasing(PrimitiveTestBase):\n    primitive = IsMonotonicallyDecreasing\n\n    def test_monotonically_decreasing(self):\n        primitive_func = self.primitive().get_function()\n        case = pd.Series([9, 5, 3, 1, -1])\n        assert primitive_func(case) is True\n\n    def test_monotonically_increasing(self):\n        primitive_func = self.primitive().get_function()\n        case = pd.Series([-1, 1, 3, 5, 9])\n        assert primitive_func(case) is False\n\n    def test_non_monotonic(self):\n        primitive_func = self.primitive().get_function()\n        case = pd.Series([-1, 1, 3, 2, 5])\n        assert primitive_func(case) is False\n\n    def test_weakly_decreasing(self):\n        primitive_func = self.primitive().get_function()\n        case = pd.Series([9, 3, 3, 1, -1])\n        assert primitive_func(case) is True\n\n    def test_nan(self):\n        primitive_func = self.primitive().get_function()\n        case = pd.Series([9, 5, 3, np.nan, 1, -1])\n        assert primitive_func(case) is True\n\n        primitive_func = self.primitive().get_function()\n        case = pd.Series([-1, 1, 3, np.nan, 5, 9])\n        assert primitive_func(case) is False\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instantiate = self.primitive()\n        aggregation.append(primitive_instantiate)\n        valid_dfs(es, aggregation, transform, self.primitive)\n\n\nclass TestIsMonotonicallyIncreasing(PrimitiveTestBase):\n    primitive = IsMonotonicallyIncreasing\n\n    def test_monotonically_increasing(self):\n        primitive_func = self.primitive().get_function()\n        case = pd.Series([-1, 1, 3, 5, 9])\n        assert primitive_func(case) is True\n\n    def test_monotonically_decreasing(self):\n        primitive_func = self.primitive().get_function()\n        case = pd.Series([9, 5, 3, 1, -1])\n        assert primitive_func(case) is False\n\n    def test_non_monotonic(self):\n        primitive_func = self.primitive().get_function()\n        case = pd.Series([-1, 1, 3, 2, 5])\n        assert primitive_func(case) is False\n\n    def test_weakly_increasing(self):\n        primitive_func = self.primitive().get_function()\n        case = pd.Series([-1, 1, 3, 3, 9])\n        assert primitive_func(case) is True\n\n    def test_nan(self):\n        primitive_func = self.primitive().get_function()\n        case = pd.Series([-1, 1, 3, np.nan, 5, 9])\n        assert primitive_func(case) is True\n\n        primitive_func = self.primitive().get_function()\n        case = pd.Series([9, 5, 3, np.nan, 1, -1])\n        assert primitive_func(case) is False\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instantiate = self.primitive()\n        aggregation.append(primitive_instantiate)\n        valid_dfs(es, aggregation, transform, self.primitive)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/aggregation_primitive_tests/test_count_aggregation_primitives.py",
    "content": "import numpy as np\nimport pandas as pd\nfrom pytest import raises\n\nfrom featuretools.primitives import (\n    CountAboveMean,\n    CountGreaterThan,\n    CountInsideNthSTD,\n    CountInsideRange,\n    CountLessThan,\n    CountOutsideNthSTD,\n    CountOutsideRange,\n)\nfrom featuretools.tests.primitive_tests.utils import PrimitiveTestBase\n\n\nclass TestCountAboveMean(PrimitiveTestBase):\n    primitive = CountAboveMean\n\n    def test_regular(self):\n        data = pd.Series([1, 2, 3, 4, 5])\n        expected = 2\n        primitive_func = self.primitive().get_function()\n        actual = primitive_func(data)\n        assert expected == actual\n\n        data = pd.Series([1, 2, 3.1, 4, 5])\n        expected = 3\n        primitive_func = self.primitive().get_function()\n        actual = primitive_func(data)\n        assert expected == actual\n\n    def test_nan_without_ignore_nan(self):\n        data = pd.Series([np.nan, 1, 2, 3, 4, 5, np.nan, np.nan])\n        expected = np.nan\n\n        primitive_func = self.primitive(skipna=False).get_function()\n        actual = primitive_func(data)\n        assert np.isnan(actual) == np.isnan(expected)\n\n        data = pd.Series([np.nan])\n        primitive_func = self.primitive(skipna=False).get_function()\n        actual = primitive_func(data)\n        assert np.isnan(actual) == np.isnan(expected)\n\n    def test_nan_with_ignore_nan(self):\n        data = pd.Series([np.nan, 1, 2, 3, 4, 5, np.nan, np.nan])\n        expected = 2\n        primitive_func = self.primitive(skipna=True).get_function()\n        actual = primitive_func(data)\n        assert expected == actual\n\n        data = pd.Series([np.nan, 1, 2, 3.1, 4, 5, np.nan, np.nan])\n        expected = 3\n        primitive_func = self.primitive(skipna=True).get_function()\n        actual = primitive_func(data)\n        assert expected == actual\n\n        data = pd.Series([np.nan])\n        expected = np.nan\n        primitive_func = self.primitive(skipna=True).get_function()\n        actual = primitive_func(data)\n        assert np.isnan(actual) == np.isnan(expected)\n\n    def test_inf(self):\n        data = pd.Series([np.NINF, 1, 2, 3, 4, 5])\n        expected = 5\n        primitive_func = self.primitive().get_function()\n        actual = primitive_func(data)\n        assert expected == actual\n\n        data = pd.Series([1, 2, 3, 4, 5, np.inf])\n        expected = 0\n        primitive_func = self.primitive().get_function()\n        actual = primitive_func(data)\n        assert expected == actual\n\n        data = pd.Series([np.NINF, 1, 2, 3, 4, 5, np.inf])\n        expected = np.nan\n        primitive_func = self.primitive().get_function()\n        actual = primitive_func(data)\n        assert np.isnan(actual) == np.isnan(expected)\n\n        primitive_func = self.primitive(skipna=False).get_function()\n        actual = primitive_func(data)\n        assert np.isnan(actual) == np.isnan(expected)\n\n\nclass TestCountGreaterThan(PrimitiveTestBase):\n    primitive = CountGreaterThan\n\n    def compare_results(self, data, thresholds, results):\n        for threshold, result in zip(thresholds, results):\n            primitive = self.primitive(threshold=threshold)\n            function = primitive.get_function()\n            assert function(data) == result\n            assert isinstance(function(data), np.int64)\n\n    def test_regular(self):\n        data = pd.Series([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5])\n        thresholds = pd.Series([-5, -2, 0, 2, 5])\n        results = pd.Series([10, 7, 5, 3, 0])\n        self.compare_results(data, thresholds, results)\n\n    def test_edges(self):\n        data = pd.Series([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5])\n        thresholds = pd.Series([np.inf, np.NINF, None, np.nan])\n        results = pd.Series([0, len(data), 0, 0])\n        self.compare_results(data, thresholds, results)\n\n    def test_nans(self):\n        data = pd.Series([-5, -4, -3, np.inf, np.NINF, np.nan, 1, 2, 3, 4, 5])\n        thresholds = pd.Series([np.inf, np.NINF, None, 0, np.nan])\n        results = pd.Series([0, 9, 0, 6, 0])\n        self.compare_results(data, thresholds, results)\n\n\nclass TestCountInsideNthSTD:\n    primitive = CountInsideNthSTD\n\n    def test_normal_distribution(self):\n        x = pd.Series(\n            [\n                -76.0,\n                41.0,\n                -43.0,\n                -152.0,\n                -89.0,\n                28.0,\n                49.0,\n                298.0,\n                -132.0,\n                146.0,\n                -107.0,\n                -26.0,\n                26.0,\n                -81.0,\n                116.0,\n                -217.0,\n                -102.0,\n                144.0,\n                120.0,\n                -130.0,\n            ],\n        )\n\n        first_outliers = [-152.0, 298.0, 146.0, 116.0, -217.0, 144.0, 120.0]\n        primitive_instance = self.primitive(1)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == len(x) - len(first_outliers)\n\n        second_outliers = [298.0]\n        primitive_instance = self.primitive(2)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == len(x) - len(second_outliers)\n\n    def test_poisson_distribution(self):\n        x = pd.Series(\n            [\n                1,\n                1,\n                3,\n                3,\n                0,\n                0,\n                1,\n                3,\n                3,\n                1,\n                2,\n                3,\n                2,\n                0,\n                1,\n                3,\n                2,\n                1,\n                0,\n                2,\n            ],\n        )\n\n        first_outliers = [3, 3, 0, 0, 3, 3, 3, 0, 3, 0]\n        primitive_instance = self.primitive(1)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == len(x) - len(first_outliers)\n\n        second_outliers = []\n        primitive_instance = self.primitive(2)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == len(x) - len(second_outliers)\n\n    def test_nan(self):\n        # test if function ignores nan values\n        x = pd.Series(\n            [\n                -76.0,\n                41.0,\n                -43.0,\n                -152.0,\n                -89.0,\n                28.0,\n                49.0,\n                298.0,\n                -132.0,\n                146.0,\n                -107.0,\n                -26.0,\n                26.0,\n                -81.0,\n                116.0,\n                -217.0,\n                -102.0,\n                144.0,\n                120.0,\n                -130.0,\n            ],\n        )\n        x = pd.concat([x, pd.Series([np.nan] * 20)])\n        first_outliers = [-152.0, 298.0, 146.0, 116.0, -217.0, 144.0, 120.0]\n        primitive_instance = self.primitive(1)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == len(x) - len(first_outliers) - 20\n\n        # test a series with all nan values\n        x = pd.Series([np.nan] * 20)\n\n        primitive_instance = self.primitive(1)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == 0\n\n    def test_negative_n(self):\n        with raises(ValueError):\n            self.primitive(-1)\n\n\nclass TestCountInsideRange(PrimitiveTestBase):\n    primitive = CountInsideRange\n\n    def test_integer_range(self):\n        # all integers from -100 to 100\n        x = pd.Series(np.arange(-100, 101, 1))\n        primitive_instance = self.primitive(-100, 100)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == 201\n\n        primitive_instance = self.primitive(-50, 50)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == 101\n\n        primitive_instance = self.primitive(1, 1)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == 1\n\n    def test_float_range(self):\n        x = pd.Series(np.linspace(-3, 3, 10))\n\n        primitive_instance = self.primitive(-3, 3)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == 10\n\n        primitive_instance = self.primitive(-0.34, 1.68)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == 4\n\n        primitive_instance = self.primitive(-3, -3)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == 1\n\n    def test_nan(self):\n        x = pd.Series(np.linspace(-3, 3, 10))\n        x = pd.concat([x, pd.Series([np.nan] * 20)])\n\n        primitive_instance = self.primitive(-0.34, 1.68)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == 4\n\n        primitive_instance = self.primitive(-3, 3, False)\n        primitive_func = primitive_instance.get_function()\n        assert np.isnan(primitive_func(x))\n\n    def test_inf(self):\n        x = pd.Series(np.linspace(-3, 3, 10))\n        num_NINF = 20\n        x = pd.concat([x, pd.Series([np.NINF] * num_NINF)])\n        num_inf = 10\n        x = pd.concat([x, pd.Series([np.inf] * num_inf)])\n\n        primitive_instance = self.primitive(-3, 3)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == 10\n\n        primitive_instance = self.primitive(np.NINF, 3)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == 10 + num_NINF\n\n        primitive_instance = self.primitive(-3, np.inf)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == 10 + num_inf\n\n\nclass TestCountLessThan(PrimitiveTestBase):\n    primitive = CountLessThan\n\n    def compare_answers(self, data, thresholds, answers):\n        for threshold, answer in zip(thresholds, answers):\n            primitive = self.primitive(threshold=threshold)\n            function = primitive.get_function()\n            assert function(data) == answer\n            assert isinstance(function(data), np.int64)\n\n    def test_regular(self):\n        data = pd.Series([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5])\n        thresholds = pd.Series([-5, -2, 0, 2, 5])\n        answers = pd.Series([0, 3, 5, 7, 10])\n        self.compare_answers(data, thresholds, answers)\n\n    def test_edges(self):\n        data = pd.Series([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5])\n        thresholds = pd.Series([np.inf, np.NINF, None, np.nan])\n        answers = pd.Series([len(data), 0, 0, 0])\n        self.compare_answers(data, thresholds, answers)\n\n    def test_nans(self):\n        data = pd.Series([-5, -4, -3, np.inf, np.NINF, np.nan, 1, 2, 3, 4, 5])\n        thresholds = pd.Series([np.inf, np.NINF, None, 0, np.nan])\n        answers = pd.Series([9, 0, 0, 4, 0])\n        self.compare_answers(data, thresholds, answers)\n\n\nclass TestCountOutsideNthSTD(PrimitiveTestBase):\n    primitive = CountOutsideNthSTD\n\n    def test_normal_distribution(self):\n        x = pd.Series(\n            [\n                10,\n                386,\n                479,\n                627,\n                20,\n                523,\n                482,\n                483,\n                542,\n                699,\n                535,\n                617,\n                577,\n                471,\n                615,\n                583,\n                441,\n                562,\n                563,\n                527,\n                453,\n                530,\n                433,\n                541,\n                585,\n                704,\n                443,\n                569,\n                430,\n                637,\n                331,\n                511,\n                552,\n                496,\n                484,\n                566,\n                554,\n                472,\n                335,\n                440,\n                579,\n                341,\n                545,\n                615,\n                548,\n                604,\n                439,\n                556,\n                442,\n                461,\n                624,\n                611,\n                444,\n                578,\n                405,\n                487,\n                490,\n                496,\n                398,\n                512,\n                422,\n                455,\n                449,\n                432,\n                607,\n                679,\n                434,\n                597,\n                639,\n                565,\n                415,\n                486,\n                668,\n                414,\n                665,\n                763,\n                557,\n                304,\n                404,\n                454,\n                689,\n                610,\n                483,\n                441,\n                657,\n                590,\n                492,\n                476,\n                437,\n                483,\n                529,\n                363,\n                711,\n                543,\n            ],\n        )\n        outliers = [10, 20, 763]\n        primitive_instance = self.primitive(2)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == len(outliers)\n\n    def test_poisson_distribution(self):\n        x = pd.Series(\n            [\n                1,\n                1,\n                3,\n                3,\n                0,\n                0,\n                1,\n                3,\n                3,\n                1,\n                2,\n                3,\n                2,\n                0,\n                1,\n                3,\n                2,\n                1,\n                0,\n                2,\n            ],\n        )\n\n        primitive_instance = self.primitive(1)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == 10\n\n        primitive_instance = self.primitive(2)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == 0\n\n    def test_nan(self):\n        # test if function ignores nan values\n        x = pd.Series(\n            [\n                -76.0,\n                41.0,\n                -43.0,\n                -152.0,\n                -89.0,\n                28.0,\n                49.0,\n                298.0,\n                -132.0,\n                146.0,\n                -107.0,\n                -26.0,\n                26.0,\n                -81.0,\n                116.0,\n                -217.0,\n                -102.0,\n                144.0,\n                120.0,\n                -130.0,\n            ],\n        )\n        x = pd.concat([x, pd.Series([np.nan * 20])])\n        primitive_instance = self.primitive(1)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == 7\n\n        # test a series with all nan values\n        x = pd.Series([np.nan] * 20)\n\n        primitive_instance = self.primitive(1)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == 0\n\n    def test_negative_n(self):\n        with raises(ValueError):\n            self.primitive(-1)\n\n\nclass TestCountOutsideRange(PrimitiveTestBase):\n    primitive = CountOutsideRange\n\n    def test_integer_range(self):\n        # all integers from -100 to 100\n        x = pd.Series(np.arange(-100, 101, 1))\n        primitive_instance = CountOutsideRange(-100, 100)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == 0\n\n        primitive_instance = CountOutsideRange(-50, 50)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == 100\n\n        primitive_instance = CountOutsideRange(1, 1)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == len(x) - 1\n\n    def test_float_range(self):\n        x = pd.Series(np.linspace(-3, 3, 10))\n\n        primitive_instance = CountOutsideRange(-3, 3)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == 0\n\n        primitive_instance = CountOutsideRange(-0.34, 1.68)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == 6\n\n        primitive_instance = CountOutsideRange(-3, -3)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == 9\n\n    def test_nan(self):\n        x = pd.Series(np.linspace(-3, 3, 10))\n        x = pd.concat([x, pd.Series([np.nan] * 20)])\n        primitive_instance = CountOutsideRange(-0.34, 1.68)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == 6\n\n        primitive_instance = CountOutsideRange(-3, 3, False)\n        primitive_func = primitive_instance.get_function()\n        assert np.isnan(primitive_func(x))\n\n    def test_inf(self):\n        x = pd.Series(np.linspace(-3, 3, 10))\n        num_NINF = 20\n        x = pd.concat([x, pd.Series([np.NINF] * num_NINF)])\n        num_inf = 10\n        x = pd.concat([x, pd.Series([np.inf] * num_inf)])\n\n        primitive_instance = CountOutsideRange(-3, 3)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == num_inf + num_NINF\n\n        primitive_instance = CountOutsideRange(-0.34, 1.68)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == 6 + num_inf + num_NINF\n\n        primitive_instance = CountOutsideRange(np.NINF, 3)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == num_inf\n\n        primitive_instance = CountOutsideRange(-3, np.inf)\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == num_NINF\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/aggregation_primitive_tests/test_max_consecutive.py",
    "content": "import numpy as np\nimport pandas as pd\nimport pytest\n\nfrom featuretools.primitives import (\n    MaxConsecutiveFalse,\n    MaxConsecutiveNegatives,\n    MaxConsecutivePositives,\n    MaxConsecutiveTrue,\n    MaxConsecutiveZeros,\n)\n\n\nclass TestMaxConsecutiveFalse:\n    def test_regular(self):\n        primitive_instance = MaxConsecutiveFalse()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([False, False, False, True, True, False, True], dtype=\"bool\")\n        assert primitive_func(array) == 3\n\n    def test_all_true(self):\n        primitive_instance = MaxConsecutiveFalse()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([True, True, True, True], dtype=\"bool\")\n        assert primitive_func(array) == 0\n\n    def test_all_false(self):\n        primitive_instance = MaxConsecutiveFalse()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([False, False, False], dtype=\"bool\")\n        assert primitive_func(array) == 3\n\n\nclass TestMaxConsecutiveTrue:\n    def test_regular(self):\n        primitive_instance = MaxConsecutiveTrue()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([True, False, True, True, True, False, True], dtype=\"bool\")\n        assert primitive_func(array) == 3\n\n    def test_all_true(self):\n        primitive_instance = MaxConsecutiveTrue()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([True, True, True, True], dtype=\"bool\")\n        assert primitive_func(array) == 4\n\n    def test_all_false(self):\n        primitive_instance = MaxConsecutiveTrue()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([False, False, False], dtype=\"bool\")\n        assert primitive_func(array) == 0\n\n\n@pytest.mark.parametrize(\"dtype\", [\"float64\", \"int64\"])\nclass TestMaxConsecutiveNegatives:\n    def test_regular(self, dtype):\n        if dtype == \"int64\":\n            pytest.skip(\"test array contains floats which are not supported int64\")\n        primitive_instance = MaxConsecutiveNegatives()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([1.3, -3.4, -1, -4, 10, -1.7, -4.9], dtype=dtype)\n        assert primitive_func(array) == 3\n\n    def test_all_int(self, dtype):\n        primitive_instance = MaxConsecutiveNegatives()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([1, -1, 2, 4, -5], dtype=dtype)\n        assert primitive_func(array) == 1\n\n    def test_all_float(self, dtype):\n        if dtype == \"int64\":\n            pytest.skip(\"test array contains floats which are not supported int64\")\n        primitive_instance = MaxConsecutiveNegatives()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([1.0, -1.0, -2.0, 0.0, 5.0], dtype=dtype)\n        assert primitive_func(array) == 2\n\n    def test_with_nan(self, dtype):\n        if dtype == \"int64\":\n            pytest.skip(\"nans not supported in int64\")\n        primitive_instance = MaxConsecutiveNegatives()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([1, np.nan, -2, -3], dtype=dtype)\n        assert primitive_func(array) == 2\n\n    def test_with_nan_skipna(self, dtype):\n        if dtype == \"int64\":\n            pytest.skip(\"nans not supported in int64\")\n        primitive_instance = MaxConsecutiveNegatives(skipna=False)\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([-1, np.nan, -2, -3], dtype=dtype)\n        assert primitive_func(array) == 2\n\n    def test_all_nan(self, dtype):\n        if dtype == \"int64\":\n            pytest.skip(\"nans not supported in int64\")\n        primitive_instance = MaxConsecutiveNegatives()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([np.nan, np.nan, np.nan, np.nan], dtype=dtype)\n        assert np.isnan(primitive_func(array))\n\n    def test_all_nan_skipna(self, dtype):\n        if dtype == \"int64\":\n            pytest.skip(\"nans not supported in int64\")\n        primitive_instance = MaxConsecutiveNegatives(skipna=True)\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([np.nan, np.nan, np.nan, np.nan], dtype=dtype)\n        assert np.isnan(primitive_func(array))\n\n\n@pytest.mark.parametrize(\"dtype\", [\"float64\", \"int64\"])\nclass TestMaxConsecutivePositives:\n    def test_regular(self, dtype):\n        if dtype == \"int64\":\n            pytest.skip(\"test array contains floats which are not supported int64\")\n        primitive_instance = MaxConsecutivePositives()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([1.3, -3.4, 1, 4, 10, -1.7, -4.9], dtype=dtype)\n        assert primitive_func(array) == 3\n\n    def test_all_int(self, dtype):\n        primitive_instance = MaxConsecutivePositives()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([1, -1, 2, 4, -5], dtype=dtype)\n        assert primitive_func(array) == 2\n\n    def test_all_float(self, dtype):\n        if dtype == \"int64\":\n            pytest.skip(\"test array contains floats which are not supported int64\")\n        primitive_instance = MaxConsecutivePositives()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([1.0, -1.0, 2.0, 4.0, 5.0], dtype=dtype)\n        assert primitive_func(array) == 3\n\n    def test_with_nan(self, dtype):\n        if dtype == \"int64\":\n            pytest.skip(\"nans not supported in int64\")\n        primitive_instance = MaxConsecutivePositives()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([1, np.nan, 2, -3], dtype=dtype)\n        assert primitive_func(array) == 2\n\n    def test_with_nan_skipna(self, dtype):\n        if dtype == \"int64\":\n            pytest.skip(\"nans not supported in int64\")\n        primitive_instance = MaxConsecutivePositives(skipna=False)\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([1, np.nan, 2, -3], dtype=dtype)\n        assert primitive_func(array) == 1\n\n    def test_all_nan(self, dtype):\n        if dtype == \"int64\":\n            pytest.skip(\"nans not supported in int64\")\n        primitive_instance = MaxConsecutivePositives()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([np.nan, np.nan, np.nan, np.nan], dtype=dtype)\n        assert np.isnan(primitive_func(array))\n\n    def test_all_nan_skipna(self, dtype):\n        if dtype == \"int64\":\n            pytest.skip(\"nans not supported in int64\")\n        primitive_instance = MaxConsecutivePositives(skipna=True)\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([np.nan, np.nan, np.nan, np.nan], dtype=dtype)\n        assert np.isnan(primitive_func(array))\n\n\n@pytest.mark.parametrize(\"dtype\", [\"float64\", \"int64\"])\nclass TestMaxConsecutiveZeros:\n    def test_regular(self, dtype):\n        if dtype == \"int64\":\n            pytest.skip(\"test array contains floats which are not supported int64\")\n        primitive_instance = MaxConsecutiveZeros()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([1.3, -3.4, 0, 0, 0.0, 1.7, -4.9], dtype=dtype)\n        assert primitive_func(array) == 3\n\n    def test_all_int(self, dtype):\n        primitive_instance = MaxConsecutiveZeros()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([1, -1, 0, 0, -5], dtype=dtype)\n        assert primitive_func(array) == 2\n\n    def test_all_float(self, dtype):\n        if dtype == \"int64\":\n            pytest.skip(\"test array contains floats which are not supported int64\")\n        primitive_instance = MaxConsecutiveZeros()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([1.0, 0.0, 0.0, 0.0, -5.3], dtype=dtype)\n        assert primitive_func(array) == 3\n\n    def test_with_nan(self, dtype):\n        if dtype == \"int64\":\n            pytest.skip(\"nans not supported in int64\")\n        primitive_instance = MaxConsecutiveZeros()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([0, np.nan, 0, -3], dtype=dtype)\n        assert primitive_func(array) == 2\n\n    def test_with_nan_skipna(self, dtype):\n        if dtype == \"int64\":\n            pytest.skip(\"nans not supported in int64\")\n        primitive_instance = MaxConsecutiveZeros(skipna=False)\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([0, np.nan, 0, -3], dtype=dtype)\n        assert primitive_func(array) == 1\n\n    def test_all_nan(self, dtype):\n        if dtype == \"int64\":\n            pytest.skip(\"nans not supported in int64\")\n        primitive_instance = MaxConsecutiveZeros()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([np.nan, np.nan, np.nan, np.nan], dtype=dtype)\n        assert np.isnan(primitive_func(array))\n\n    def test_all_nan_skipna(self, dtype):\n        if dtype == \"int64\":\n            pytest.skip(\"nans not supported in int64\")\n        primitive_instance = MaxConsecutiveZeros(skipna=True)\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([np.nan, np.nan, np.nan, np.nan], dtype=dtype)\n        assert np.isnan(primitive_func(array))\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/aggregation_primitive_tests/test_num_consecutive.py",
    "content": "import numpy as np\nimport pandas as pd\n\nfrom featuretools.primitives import NumConsecutiveGreaterMean, NumConsecutiveLessMean\n\n\nclass TestNumConsecutiveGreaterMean:\n    primitive = NumConsecutiveGreaterMean\n\n    def test_continuous_range(self):\n        x = pd.Series(range(10))\n        longest_sequence = [5, 6, 7, 8, 9]\n        primitive_instance = self.primitive()\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == len(longest_sequence)\n\n    def test_subsequence_in_middle(self):\n        x = pd.Series(\n            [\n                0.6,\n                0.18,\n                1.11,\n                -0.19,\n                0.25,\n                -1.41,\n                0.54,\n                0.29,\n                -1.59,\n                1.67,\n                1.19,\n                0.44,\n                2.39,\n                -1.38,\n                0.15,\n                -1.16,\n                1.54,\n                -0.34,\n                -1.41,\n                0.58,\n            ],\n        )\n        longest_sequence = [1.67, 1.19, 0.44, 2.39]\n        primitive_instance = self.primitive()\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == len(longest_sequence)\n\n    def test_subsequence_at_start(self):\n        x = pd.Series(\n            [\n                1.67,\n                1.19,\n                0.44,\n                2.39,\n                -0.19,\n                0.6,\n                0.18,\n                1.11,\n                0.25,\n                -1.41,\n                0.54,\n                0.29,\n                -1.59,\n                -1.38,\n                0.15,\n                -1.16,\n                1.54,\n                -0.34,\n                -1.41,\n                0.58,\n            ],\n        )\n        longest_sequence = [1.67, 1.19, 0.44, 2.39]\n        primitive_instance = self.primitive()\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == len(longest_sequence)\n\n    def test_subsequence_at_end(self):\n        x = pd.Series(\n            [\n                0.6,\n                0.18,\n                1.11,\n                -0.19,\n                0.25,\n                -1.41,\n                0.54,\n                0.29,\n                -1.59,\n                -1.38,\n                0.15,\n                -1.16,\n                1.54,\n                -0.34,\n                0.58,\n                -1.41,\n                1.67,\n                1.19,\n                0.44,\n                2.39,\n            ],\n        )\n        longest_sequence = [1.67, 1.19, 0.44, 2.39]\n        primitive_instance = self.primitive()\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == len(longest_sequence)\n\n    def test_nan(self):\n        x = pd.Series(range(10))\n        x = pd.concat([x, pd.Series([np.nan] * 20)])\n        longest_sequence = [5, 6, 7, 8, 9]\n\n        # test ignoring NaN values\n        primitive_instance = self.primitive()\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == len(longest_sequence)\n\n        # test skipna=False\n        primitive_instance = self.primitive(skipna=False)\n        primitive_func = primitive_instance.get_function()\n        assert np.isnan(primitive_func(x))\n\n    def test_inf(self):\n        primitive_instance = self.primitive()\n        primitive_func = primitive_instance.get_function()\n\n        x = pd.Series(range(10))\n        x = pd.concat([x, pd.Series([np.inf])])\n        assert primitive_func(x) == 0\n\n        x = pd.Series(range(10))\n        x = pd.concat([x, pd.Series([np.NINF])])\n        assert primitive_func(x) == 10\n\n        x = pd.Series(range(10))\n        x = pd.concat([x, pd.Series([np.NINF, np.inf, np.inf])])\n        assert np.isnan(primitive_func(x))\n\n\nclass TestNumConsecutiveLessMean:\n    primitive = NumConsecutiveLessMean\n\n    def test_continuous_range(self):\n        x = pd.Series(range(10))\n        longest_sequence = [0, 1, 2, 3, 4]\n        primitive_instance = self.primitive()\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == len(longest_sequence)\n\n    def test_subsequence_in_middle(self):\n        x = pd.Series(\n            [\n                0.6,\n                0.18,\n                1.11,\n                -0.19,\n                0.25,\n                -1.41,\n                0.54,\n                0.29,\n                -1.59,\n                1.67,\n                1.19,\n                0.44,\n                2.39,\n                -1.38,\n                0.15,\n                -1.16,\n                1.54,\n                -0.34,\n                -1.41,\n                0.58,\n            ],\n        )\n        longest_sequence = [-1.38, 0.15, -1.16]\n        primitive_instance = self.primitive()\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == len(longest_sequence)\n\n    def test_subsequence_at_start(self):\n        x = pd.Series(\n            [\n                -1.38,\n                0.15,\n                -1.16,\n                0.6,\n                0.18,\n                1.11,\n                -0.19,\n                0.25,\n                -1.41,\n                0.54,\n                0.29,\n                -1.59,\n                1.67,\n                1.19,\n                0.44,\n                2.39,\n                1.54,\n                -0.34,\n                -1.41,\n                0.58,\n            ],\n        )\n        longest_sequence = [-1.38, 0.15, -1.16]\n        primitive_instance = self.primitive()\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == len(longest_sequence)\n\n    def test_subsequence_at_end(self):\n        x = pd.Series(\n            [\n                0.6,\n                0.18,\n                1.11,\n                -0.19,\n                0.25,\n                -1.41,\n                0.54,\n                0.29,\n                -1.59,\n                1.67,\n                1.19,\n                0.44,\n                2.39,\n                1.54,\n                -0.34,\n                -1.41,\n                0.58,\n                -1.38,\n                0.15,\n                -1.16,\n            ],\n        )\n        longest_sequence = [-1.38, 0.15, -1.16]\n        primitive_instance = self.primitive()\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == len(longest_sequence)\n\n    def test_nan(self):\n        x = pd.Series(range(10))\n        x = pd.concat([x, pd.Series([np.nan] * 20)])\n        longest_sequence = [0, 1, 2, 3, 4]\n\n        # test ignoring NaN values\n        primitive_instance = self.primitive()\n        primitive_func = primitive_instance.get_function()\n        assert primitive_func(x) == len(longest_sequence)\n\n        # test skipna=False\n        primitive_instance = self.primitive(skipna=False)\n        primitive_func = primitive_instance.get_function()\n        assert np.isnan(primitive_func(x))\n\n    def test_inf(self):\n        primitive_instance = self.primitive()\n        primitive_func = primitive_instance.get_function()\n\n        x = pd.Series(range(10))\n        x = pd.concat([x, pd.Series([np.inf])])\n        assert primitive_func(x) == 10\n\n        x = pd.Series(range(10))\n        x = pd.concat([x, pd.Series([np.NINF])])\n        assert primitive_func(x) == 0\n\n        x = pd.Series(range(10))\n        x = pd.concat([x, pd.Series([np.NINF, np.inf, np.inf])])\n        assert np.isnan(primitive_func(x))\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/aggregation_primitive_tests/test_percent_true.py",
    "content": "import pandas as pd\nfrom woodwork.logical_types import BooleanNullable\n\nimport featuretools as ft\n\n\ndef test_percent_true_default_value_with_dfs():\n    es = ft.EntitySet(id=\"customer_data\")\n\n    customers_df = pd.DataFrame(data={\"customer_id\": [1, 2]})\n    transactions_df = pd.DataFrame(\n        data={\"tx_id\": [1], \"customer_id\": [1], \"is_foo\": [True]},\n    )\n\n    es.add_dataframe(\n        dataframe_name=\"customers_df\",\n        dataframe=customers_df,\n        index=\"customer_id\",\n    )\n    es.add_dataframe(\n        dataframe_name=\"transactions_df\",\n        dataframe=transactions_df,\n        index=\"tx_id\",\n        logical_types={\"is_foo\": BooleanNullable},\n    )\n\n    es = es.add_relationship(\n        \"customers_df\",\n        \"customer_id\",\n        \"transactions_df\",\n        \"customer_id\",\n    )\n\n    feature_matrix, _ = ft.dfs(\n        entityset=es,\n        target_dataframe_name=\"customers_df\",\n        agg_primitives=[\"percent_true\"],\n    )\n\n    assert pd.isna(feature_matrix[\"PERCENT_TRUE(transactions_df.is_foo)\"][2])\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/aggregation_primitive_tests/test_rolling_primitive.py",
    "content": "import numpy as np\nimport pandas as pd\nimport pytest\n\nfrom featuretools.primitives import (\n    RollingCount,\n    RollingMax,\n    RollingMean,\n    RollingMin,\n    RollingOutlierCount,\n    RollingSTD,\n    RollingTrend,\n)\nfrom featuretools.primitives.standard.transform.time_series.utils import (\n    apply_rolling_agg_to_series,\n)\nfrom featuretools.tests.primitive_tests.utils import get_number_from_offset\n\n\n@pytest.mark.parametrize(\n    \"window_length, gap\",\n    [\n        (5, 2),\n        (5, 0),\n        (\"5d\", \"7d\"),\n        (\"5d\", \"0d\"),\n    ],\n)\n@pytest.mark.parametrize(\"min_periods\", [1, 0, 2, 5])\ndef test_rolling_max(min_periods, window_length, gap, window_series):\n    gap_num = get_number_from_offset(gap)\n    window_length_num = get_number_from_offset(window_length)\n    # Since we're using a uniform series we can check correctness using numeric parameters\n    expected_vals = apply_rolling_agg_to_series(\n        window_series,\n        lambda x: x.max(),\n        window_length_num,\n        gap=gap_num,\n        min_periods=min_periods,\n    )\n\n    primitive_instance = RollingMax(\n        window_length=window_length,\n        gap=gap,\n        min_periods=min_periods,\n    )\n    primitive_func = primitive_instance.get_function()\n\n    actual_vals = pd.Series(\n        primitive_func(window_series.index, pd.Series(window_series.values)),\n    )\n\n    # Since min_periods of 0 is the same as min_periods of 1\n    num_nans_from_min_periods = min_periods or 1\n\n    assert actual_vals.isna().sum() == gap_num + num_nans_from_min_periods - 1\n    pd.testing.assert_series_equal(pd.Series(expected_vals), actual_vals)\n\n\n@pytest.mark.parametrize(\n    \"window_length, gap\",\n    [\n        (5, 2),\n        (5, 0),\n        (\"5d\", \"7d\"),\n        (\"5d\", \"0d\"),\n    ],\n)\n@pytest.mark.parametrize(\"min_periods\", [1, 0, 2, 5])\ndef test_rolling_min(min_periods, window_length, gap, window_series):\n    gap_num = get_number_from_offset(gap)\n    window_length_num = get_number_from_offset(window_length)\n\n    # Since we're using a uniform series we can check correctness using numeric parameters\n    expected_vals = apply_rolling_agg_to_series(\n        window_series,\n        lambda x: x.min(),\n        window_length_num,\n        gap=gap_num,\n        min_periods=min_periods,\n    )\n\n    primitive_instance = RollingMin(\n        window_length=window_length,\n        gap=gap,\n        min_periods=min_periods,\n    )\n    primitive_func = primitive_instance.get_function()\n\n    actual_vals = pd.Series(\n        primitive_func(window_series.index, pd.Series(window_series.values)),\n    )\n\n    # Since min_periods of 0 is the same as min_periods of 1\n    num_nans_from_min_periods = min_periods or 1\n\n    assert actual_vals.isna().sum() == gap_num + num_nans_from_min_periods - 1\n    pd.testing.assert_series_equal(pd.Series(expected_vals), actual_vals)\n\n\n@pytest.mark.parametrize(\n    \"window_length, gap\",\n    [\n        (5, 2),\n        (5, 0),\n        (\"5d\", \"7d\"),\n        (\"5d\", \"0d\"),\n    ],\n)\n@pytest.mark.parametrize(\"min_periods\", [1, 0, 2, 5])\ndef test_rolling_mean(min_periods, window_length, gap, window_series):\n    gap_num = get_number_from_offset(gap)\n    window_length_num = get_number_from_offset(window_length)\n\n    # Since we're using a uniform series we can check correctness using numeric parameters\n    expected_vals = apply_rolling_agg_to_series(\n        window_series,\n        np.mean,\n        window_length_num,\n        gap=gap_num,\n        min_periods=min_periods,\n    )\n\n    primitive_instance = RollingMean(\n        window_length=window_length,\n        gap=gap,\n        min_periods=min_periods,\n    )\n    primitive_func = primitive_instance.get_function()\n\n    actual_vals = pd.Series(\n        primitive_func(window_series.index, pd.Series(window_series.values)),\n    )\n\n    # Since min_periods of 0 is the same as min_periods of 1\n    num_nans_from_min_periods = min_periods or 1\n\n    assert actual_vals.isna().sum() == gap_num + num_nans_from_min_periods - 1\n    pd.testing.assert_series_equal(pd.Series(expected_vals), actual_vals)\n\n\n@pytest.mark.parametrize(\n    \"window_length, gap\",\n    [\n        (5, 2),\n        (5, 0),\n        (\"5d\", \"7d\"),\n        (\"5d\", \"0d\"),\n    ],\n)\n@pytest.mark.parametrize(\"min_periods\", [1, 0, 2, 5])\ndef test_rolling_std(min_periods, window_length, gap, window_series):\n    gap_num = get_number_from_offset(gap)\n    window_length_num = get_number_from_offset(window_length)\n\n    # Since we're using a uniform series we can check correctness using numeric parameters\n    expected_vals = apply_rolling_agg_to_series(\n        window_series,\n        lambda x: x.std(),\n        window_length_num,\n        gap=gap_num,\n        min_periods=min_periods,\n    )\n\n    primitive_instance = RollingSTD(\n        window_length=window_length,\n        gap=gap,\n        min_periods=min_periods,\n    )\n    primitive_func = primitive_instance.get_function()\n\n    actual_vals = pd.Series(\n        primitive_func(window_series.index, pd.Series(window_series.values)),\n    )\n\n    # Since min_periods of 0 is the same as min_periods of 1\n    num_nans_from_min_periods = min_periods or 2\n\n    if min_periods in [0, 1]:\n        # the additional nan is because std pandas function returns NaN if there's only one value\n        num_nans = gap_num + 1\n    else:\n        num_nans = gap_num + num_nans_from_min_periods - 1\n\n    # The extra 1 at the beginning is because the std pandas function returns NaN if there's only one value\n    assert actual_vals.isna().sum() == num_nans\n    pd.testing.assert_series_equal(pd.Series(expected_vals), actual_vals)\n\n\n@pytest.mark.parametrize(\n    \"window_length, gap\",\n    [\n        (5, 2),\n        (\"6d\", \"7d\"),\n    ],\n)\ndef test_rolling_count(window_length, gap, window_series):\n    gap_num = get_number_from_offset(gap)\n    window_length_num = get_number_from_offset(window_length)\n\n    expected_vals = apply_rolling_agg_to_series(\n        window_series,\n        lambda x: x.count(),\n        window_length_num,\n        gap=gap_num,\n    )\n\n    primitive_instance = RollingCount(\n        window_length=window_length,\n        gap=gap,\n        min_periods=window_length_num,\n    )\n    primitive_func = primitive_instance.get_function()\n\n    actual_vals = pd.Series(primitive_func(window_series.index))\n\n    num_nans = gap_num + window_length_num - 1\n    assert actual_vals.isna().sum() == num_nans\n    # RollingCount will not match the exact roll_series_with_gap call,\n    # because it handles the min_periods difference within the primitive\n    pd.testing.assert_series_equal(\n        pd.Series(expected_vals).iloc[num_nans:],\n        actual_vals.iloc[num_nans:],\n    )\n\n\n@pytest.mark.parametrize(\n    \"min_periods, expected_num_nams\",\n    [(0, 2), (1, 2), (3, 4), (5, 6)],  # 0 and 1 get treated the same\n)\n@pytest.mark.parametrize(\"window_length, gap\", [(\"5d\", \"2d\"), (5, 2)])\ndef test_rolling_count_primitive_min_periods_nans(\n    window_length,\n    gap,\n    min_periods,\n    expected_num_nams,\n    window_series,\n):\n    primitive_instance = RollingCount(\n        window_length=window_length,\n        gap=gap,\n        min_periods=min_periods,\n    )\n    primitive_func = primitive_instance.get_function()\n    vals = pd.Series(primitive_func(window_series.index))\n\n    assert vals.isna().sum() == expected_num_nams\n\n\n@pytest.mark.parametrize(\n    \"min_periods, expected_num_nams\",\n    [(0, 0), (1, 0), (3, 2), (5, 4)],  # 0 and 1 get treated the same\n)\n@pytest.mark.parametrize(\"window_length, gap\", [(\"5d\", \"0d\"), (5, 0)])\ndef test_rolling_count_with_no_gap(\n    window_length,\n    gap,\n    min_periods,\n    expected_num_nams,\n    window_series,\n):\n    primitive_instance = RollingCount(\n        window_length=window_length,\n        gap=gap,\n        min_periods=min_periods,\n    )\n    primitive_func = primitive_instance.get_function()\n    vals = pd.Series(primitive_func(window_series.index))\n\n    assert vals.isna().sum() == expected_num_nams\n\n\n@pytest.mark.parametrize(\n    \"window_length, gap, expected_vals\",\n    [\n        (3, 0, [np.nan, np.nan, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),\n        (\n            4,\n            1,\n            [np.nan, np.nan, np.nan, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n        ),\n        (\n            \"5d\",\n            \"7d\",\n            [\n                np.nan,\n                np.nan,\n                np.nan,\n                np.nan,\n                np.nan,\n                np.nan,\n                np.nan,\n                np.nan,\n                np.nan,\n                1,\n                1,\n                1,\n                1,\n                1,\n                1,\n                1,\n                1,\n                1,\n                1,\n                1,\n            ],\n        ),\n        (\n            \"5d\",\n            \"0d\",\n            [np.nan, np.nan, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n        ),\n    ],\n)\ndef test_rolling_trend(window_length, gap, expected_vals, window_series):\n    primitive_instance = RollingTrend(window_length=window_length, gap=gap)\n\n    actual_vals = primitive_instance(window_series.index, window_series.values)\n\n    pd.testing.assert_series_equal(pd.Series(expected_vals), pd.Series(actual_vals))\n\n\ndef test_rolling_trend_window_length_less_than_three(window_series):\n    primitive_instance = RollingTrend(window_length=2)\n\n    vals = primitive_instance(window_series.index, window_series.values)\n\n    for v in vals:\n        assert np.isnan(v)\n\n\n@pytest.mark.parametrize(\n    \"primitive\",\n    [\n        RollingCount,\n        RollingMax,\n        RollingMin,\n        RollingMean,\n        RollingOutlierCount,\n    ],\n)\ndef test_rolling_primitives_non_uniform(primitive):\n    # When the data isn't uniform, this impacts the number of values in each rolling window\n    datetimes = (\n        list(pd.date_range(start=\"2017-01-01\", freq=\"1d\", periods=3))\n        + list(pd.date_range(start=\"2017-01-10\", freq=\"2d\", periods=4))\n        + list(pd.date_range(start=\"2017-01-22\", freq=\"1d\", periods=7))\n    )\n    no_freq_series = pd.Series(range(len(datetimes)), index=datetimes)\n\n    # Should match RollingCount exactly and have same nan values as other primitives\n    expected_series = pd.Series(\n        [None, 1, 2] + [None, 1, 1, 1] + [None, 1, 2, 3, 3, 3, 3],\n    )\n\n    primitive_instance = primitive(window_length=\"3d\", gap=\"1d\")\n    if isinstance(primitive_instance, RollingCount):\n        rolled_series = pd.Series(primitive_instance(no_freq_series.index))\n        pd.testing.assert_series_equal(rolled_series, expected_series)\n    else:\n        rolled_series = pd.Series(\n            primitive_instance(no_freq_series.index, pd.Series(no_freq_series.values)),\n        )\n        pd.testing.assert_series_equal(expected_series.isna(), rolled_series.isna())\n\n\ndef test_rolling_std_non_uniform():\n    # When the data isn't uniform, this impacts the number of values in each rolling window\n    datetimes = (\n        list(pd.date_range(start=\"2017-01-01\", freq=\"1d\", periods=3))\n        + list(pd.date_range(start=\"2017-01-10\", freq=\"2d\", periods=4))\n        + list(pd.date_range(start=\"2017-01-22\", freq=\"1d\", periods=7))\n    )\n    no_freq_series = pd.Series(range(len(datetimes)), index=datetimes)\n\n    # There will be at least two null values at the beginning of each range's rows, the first for the\n    # row skipped by the gap, and the second because pandas' std returns NaN if there's only one row\n    expected_series = pd.Series(\n        [None, None, 0.707107]\n        + [None, None, None, None]\n        + [  # Because the freq was 2 days, there will never be more than 1 observation\n            None,\n            None,\n            0.707107,\n            1.0,\n            1.0,\n            1.0,\n            1.0,\n        ],\n    )\n\n    primitive_instance = RollingSTD(window_length=\"3d\", gap=\"1d\")\n    rolled_series = pd.Series(\n        primitive_instance(no_freq_series.index, pd.Series(no_freq_series.values)),\n    )\n\n    pd.testing.assert_series_equal(rolled_series, expected_series)\n\n\ndef test_rolling_trend_non_uniform():\n    datetimes = (\n        list(pd.date_range(start=\"2017-01-01\", freq=\"1d\", periods=3))\n        + list(pd.date_range(start=\"2017-01-10\", freq=\"2d\", periods=4))\n        + list(pd.date_range(start=\"2017-01-22\", freq=\"1d\", periods=7))\n    )\n    no_freq_series = pd.Series(range(len(datetimes)), index=datetimes)\n    expected_series = pd.Series(\n        [None, None, None]\n        + [None, None, None, None]\n        + [\n            None,\n            None,\n            None,\n            1.0,\n            1.0,\n            1.0,\n            1.0,\n        ],\n    )\n    primitive_instance = RollingTrend(window_length=\"3d\", gap=\"1d\")\n    rolled_series = pd.Series(\n        primitive_instance(no_freq_series.index, pd.Series(no_freq_series.values)),\n    )\n    pd.testing.assert_series_equal(rolled_series, expected_series)\n\n\n@pytest.mark.parametrize(\n    \"window_length, gap\",\n    [\n        (5, 2),\n        (5, 0),\n        (\"5d\", \"7d\"),\n        (\"5d\", \"0d\"),\n    ],\n)\n@pytest.mark.parametrize(\n    \"min_periods\",\n    [1, 0, 2, 5],\n)\ndef test_rolling_outlier_count(\n    min_periods,\n    window_length,\n    gap,\n    rolling_outlier_series,\n):\n    primitive_instance = RollingOutlierCount(\n        window_length=window_length,\n        gap=gap,\n        min_periods=min_periods,\n    )\n\n    primitive_func = primitive_instance.get_function()\n\n    actual_vals = pd.Series(\n        primitive_func(\n            rolling_outlier_series.index,\n            pd.Series(rolling_outlier_series.values),\n        ),\n    )\n\n    expected_vals = apply_rolling_agg_to_series(\n        series=rolling_outlier_series,\n        agg_func=primitive_instance.get_outliers_count,\n        window_length=window_length,\n        gap=gap,\n        min_periods=min_periods,\n    )\n\n    # Since min_periods of 0 is the same as min_periods of 1\n    num_nans_from_min_periods = min_periods or 1\n    assert (\n        actual_vals.isna().sum()\n        == get_number_from_offset(gap) + num_nans_from_min_periods - 1\n    )\n    pd.testing.assert_series_equal(actual_vals, pd.Series(data=expected_vals))\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/aggregation_primitive_tests/test_time_since.py",
    "content": "from datetime import datetime\nfrom math import isnan\n\nimport numpy as np\nimport pandas as pd\n\nfrom featuretools.primitives import (\n    TimeSinceLastFalse,\n    TimeSinceLastMax,\n    TimeSinceLastMin,\n    TimeSinceLastTrue,\n)\n\n\nclass TestTimeSinceLastFalse:\n    primitive = TimeSinceLastFalse\n    cutoff_time = datetime(2011, 4, 9, 11, 31, 27)\n    times = pd.Series(\n        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]\n        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)],\n    )\n    booleans = pd.Series([True] * 5 + [False] * 4)\n\n    def test_booleans(self):\n        primitive_func = self.primitive().get_function()\n        answer = self.cutoff_time - datetime(2011, 4, 9, 10, 31, 27)\n        assert (\n            primitive_func(\n                self.times,\n                self.booleans,\n                time=self.cutoff_time,\n            )\n            == answer.total_seconds()\n        )\n\n    def test_booleans_reversed(self):\n        primitive_func = self.primitive().get_function()\n        answer = self.cutoff_time - datetime(2011, 4, 9, 10, 30, 18)\n        reversed_booleans = pd.Series(self.booleans.values[::-1])\n        assert (\n            primitive_func(\n                self.times,\n                reversed_booleans,\n                time=self.cutoff_time,\n            )\n            == answer.total_seconds()\n        )\n\n    def test_no_false(self):\n        primitive_func = self.primitive().get_function()\n        times = pd.Series([datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)])\n        booleans = pd.Series([True] * 5)\n        assert isnan(primitive_func(times, booleans, time=self.cutoff_time))\n\n    def test_nans(self):\n        primitive_func = self.primitive().get_function()\n        times = pd.concat([self.times.copy(), pd.Series([np.nan, pd.NaT])])\n        booleans = pd.concat(\n            [self.booleans.copy(), pd.Series([np.nan], dtype=\"boolean\")],\n        )\n        times = times.reset_index(drop=True)\n        booleans = booleans.reset_index(drop=True)\n        answer = self.cutoff_time - datetime(2011, 4, 9, 10, 31, 27)\n        assert (\n            primitive_func(\n                times,\n                booleans,\n                time=self.cutoff_time,\n            )\n            == answer.total_seconds()\n        )\n\n    def test_empty(self):\n        primitive_func = self.primitive().get_function()\n        times = pd.Series([], dtype=\"datetime64[ns]\")\n        booleans = pd.Series([], dtype=\"boolean\")\n        times = times.reset_index(drop=True)\n        answer = primitive_func(\n            times,\n            booleans,\n            time=self.cutoff_time,\n        )\n        assert pd.isna(answer)\n\n\nclass TestTimeSinceLastMax:\n    primitive = TimeSinceLastMax\n    cutoff_time = datetime(2011, 4, 9, 11, 31, 27)\n    times = pd.Series(\n        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]\n        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)],\n    )\n    numerics = pd.Series([0, 1, 2, 8, 2, 5, 1, 3, 7])\n    actual_time_since = cutoff_time - datetime(2011, 4, 9, 10, 30, 18)\n    actual_seconds = actual_time_since.total_seconds()\n\n    def test_primitive_func_1(self):\n        primitive_func = self.primitive().get_function()\n        assert (\n            primitive_func(\n                self.times,\n                self.numerics,\n                time=self.cutoff_time,\n            )\n            == self.actual_seconds\n        )\n\n    def test_no_max(self):\n        primitive_func = self.primitive().get_function()\n        times = pd.Series([datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)])\n        numerics = pd.Series([0] * 5)\n        actual_time_since = self.cutoff_time - datetime(2011, 4, 9, 10, 30, 0)\n        actual_seconds = actual_time_since.total_seconds()\n        assert primitive_func(times, numerics, time=self.cutoff_time) == actual_seconds\n\n    def test_nans(self):\n        primitive_func = self.primitive().get_function()\n        times = pd.concat([self.times.copy(), pd.Series([np.nan, pd.NaT])])\n        numerics = pd.concat(\n            [self.numerics.copy(), pd.Series([np.nan], dtype=\"float64\")],\n        )\n        times = times.reset_index(drop=True)\n        numerics = numerics.reset_index(drop=True)\n        assert (\n            primitive_func(\n                times,\n                numerics,\n                time=self.cutoff_time,\n            )\n            == self.actual_seconds\n        )\n\n\nclass TestTimeSinceLastMin:\n    primitive = TimeSinceLastMin\n    cutoff_time = datetime(2011, 4, 9, 11, 31, 27)\n    times = pd.Series(\n        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]\n        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)],\n    )\n    numerics = pd.Series([1, 0, 2, 8, 2, 5, 1, 3, 7])\n    actual_time_since = cutoff_time - datetime(2011, 4, 9, 10, 30, 6)\n    actual_seconds = actual_time_since.total_seconds()\n\n    def test_primitive_func_1(self):\n        primitive_func = self.primitive().get_function()\n        assert (\n            primitive_func(\n                self.times,\n                self.numerics,\n                time=self.cutoff_time,\n            )\n            == self.actual_seconds\n        )\n\n    def test_no_max(self):\n        primitive_func = self.primitive().get_function()\n        times = pd.Series([datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)])\n        numerics = pd.Series([0] * 5)\n        actual_time_since = self.cutoff_time - datetime(2011, 4, 9, 10, 30, 0)\n        actual_seconds = actual_time_since.total_seconds()\n        assert primitive_func(times, numerics, time=self.cutoff_time) == actual_seconds\n\n    def test_nans(self):\n        primitive_func = self.primitive().get_function()\n        times = pd.concat(\n            [self.times.copy(), pd.Series([np.nan, pd.NaT], dtype=\"datetime64[ns]\")],\n        )\n        numerics = pd.concat(\n            [self.numerics.copy(), pd.Series([np.nan, np.nan], dtype=\"float64\")],\n        )\n        times = times.reset_index(drop=True)\n        numerics = numerics.reset_index(drop=True)\n        assert (\n            primitive_func(\n                times,\n                numerics,\n                time=self.cutoff_time,\n            )\n            == self.actual_seconds\n        )\n\n\nclass TestTimeSinceLastTrue:\n    primitive = TimeSinceLastTrue\n    cutoff_time = datetime(2011, 4, 9, 11, 31, 27)\n    times = pd.Series(\n        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]\n        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)],\n    )\n    booleans = pd.Series([True] * 5 + [False] * 4)\n    actual_time_since = cutoff_time - datetime(2011, 4, 9, 10, 30, 24)\n    actual_seconds = actual_time_since.total_seconds()\n\n    def test_primitive_func_1(self):\n        primitive_func = self.primitive().get_function()\n        assert (\n            primitive_func(\n                self.times,\n                self.booleans,\n                time=self.cutoff_time,\n            )\n            == self.actual_seconds\n        )\n\n    def test_no_true(self):\n        primitive_func = self.primitive().get_function()\n        times = pd.Series([datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)])\n        booleans = pd.Series([False] * 5)\n        assert isnan(primitive_func(times, booleans, time=self.cutoff_time))\n\n    def test_nans(self):\n        primitive_func = self.primitive().get_function()\n        times = pd.concat(\n            [self.times.copy(), pd.Series([np.nan, pd.NaT], dtype=\"datetime64[ns]\")],\n        )\n        booleans = pd.concat(\n            [self.booleans.copy(), pd.Series([np.nan], dtype=\"boolean\")],\n        )\n        times = times.reset_index(drop=True)\n        booleans = booleans.reset_index(drop=True)\n        assert (\n            primitive_func(\n                times,\n                booleans,\n                time=self.cutoff_time,\n            )\n            == self.actual_seconds\n        )\n\n    def test_no_cutofftime(self):\n        primitive_func = self.primitive().get_function()\n        times = pd.Series([datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)])\n        booleans = pd.Series([False] * 5)\n        assert isnan(primitive_func(times, booleans))\n\n    def test_empty(self):\n        primitive_func = self.primitive().get_function()\n        times = pd.Series([], dtype=\"datetime64[ns]\")\n        booleans = pd.Series([], dtype=\"boolean\")\n        times = times.reset_index(drop=True)\n        answer = primitive_func(\n            times,\n            booleans,\n            time=self.cutoff_time,\n        )\n        assert pd.isna(answer)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/bad_primitive_files/__init__.py",
    "content": ""
  },
  {
    "path": "featuretools/tests/primitive_tests/bad_primitive_files/multiple_primitives.py",
    "content": "from woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives import AggregationPrimitive\n\n\nclass CustomMax(AggregationPrimitive):\n    name = \"custom_max\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n\nclass CustomSum(AggregationPrimitive):\n    name = \"custom_sum\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/bad_primitive_files/no_primitives.py",
    "content": ""
  },
  {
    "path": "featuretools/tests/primitive_tests/natural_language_primitives_tests/__init__.py",
    "content": ""
  },
  {
    "path": "featuretools/tests/primitive_tests/natural_language_primitives_tests/test_count_string.py",
    "content": "import numpy as np\nimport pandas as pd\n\nfrom featuretools.primitives import CountString\nfrom featuretools.tests.primitive_tests.utils import (\n    PrimitiveTestBase,\n    find_applicable_primitives,\n    valid_dfs,\n)\n\n\nclass TestCountString(PrimitiveTestBase):\n    primitive = CountString\n\n    def compare(self, primitive_initiated, test_cases, answers):\n        primitive_func = primitive_initiated.get_function()\n        primitive_answers = primitive_func(test_cases)\n        return np.testing.assert_array_equal(answers, primitive_answers)\n\n    test_cases = pd.Series(\n        [\n            # Ignore case\n            \"Hello other words hello hEllo HELLO\",\n            # ignore non alphanumeric\n            \"he\\\\{ll\\t\\n\\t.--?o othe/r words hello hello h.el./lo\",\n            # match whole word\n            \"hellohellohello other hello word go hello here 9hello hello9\",\n            # all combined\n            #   hello/ counts as hello being it's own word\n            #   since * and / are non word characters\n            #   but 9 is a \"word character\" so 9hello9\n            #   does not count as hello being its own word\n            \"helloHellohello 9Hello 9hello9 *hello/ test'hel..lo' 'hE.l.lO' \\\n         hello\",\n        ],\n    )\n\n    def test_non_regex_with_no_other_parameters(self):\n        primitive = self.primitive(\n            \"hello\",\n            ignore_case=False,\n            ignore_non_alphanumeric=False,\n            is_regex=False,\n            match_whole_words_only=False,\n        )\n        answers = [1, 2, 7, 5]\n        self.compare(primitive, self.test_cases, answers)\n\n    def test_non_regex_ignore_case(self):\n        primitive1 = self.primitive(\n            \"hello\",\n            ignore_case=True,\n            ignore_non_alphanumeric=False,\n            is_regex=False,\n            match_whole_words_only=False,\n        )\n\n        primitive2 = self.primitive(\n            \"HeLLo\",\n            ignore_case=True,\n            ignore_non_alphanumeric=False,\n            is_regex=False,\n            match_whole_words_only=False,\n        )\n\n        answers = [4, 2, 7, 7]\n        self.compare(primitive1, self.test_cases, answers)\n        self.compare(primitive2, self.test_cases, answers)\n\n    def test_non_regex_ignore_non_alphanumeric(self):\n        primitive = self.primitive(\n            \"hello\",\n            ignore_case=False,\n            ignore_non_alphanumeric=True,\n            is_regex=False,\n            match_whole_words_only=False,\n        )\n        answers = [1, 4, 7, 6]\n        self.compare(primitive, self.test_cases, answers)\n\n    def test_non_regex_match_whole_words_only(self):\n        primitive = self.primitive(\n            \"hello\",\n            ignore_case=False,\n            ignore_non_alphanumeric=False,\n            is_regex=False,\n            match_whole_words_only=True,\n        )\n\n        answers = [1, 2, 2, 2]\n        self.compare(primitive, self.test_cases, answers)\n\n    def test_non_regex_with_all_others_parameters(self):\n        primitive = self.primitive(\n            \"hello\",\n            ignore_case=True,\n            ignore_non_alphanumeric=True,\n            is_regex=False,\n            match_whole_words_only=True,\n        )\n\n        answers = [4, 4, 2, 3]\n        self.compare(primitive, self.test_cases, answers)\n\n    def test_regex_with_no_other_parameters(self):\n        primitive = self.primitive(\n            \"h.l.o\",\n            ignore_case=False,\n            ignore_non_alphanumeric=False,\n            is_regex=True,\n            match_whole_words_only=False,\n        )\n\n        answers = [2, 2, 7, 5]\n        self.compare(primitive, self.test_cases, answers)\n\n    def test_regex_with_ignore_case(self):\n        primitive = self.primitive(\n            \"h.l.o\",\n            ignore_case=True,\n            ignore_non_alphanumeric=False,\n            is_regex=True,\n            match_whole_words_only=False,\n        )\n\n        answers = [4, 2, 7, 7]\n        self.compare(primitive, self.test_cases, answers)\n\n    def test_regex_with_ignore_non_alphanumeric(self):\n        primitive = self.primitive(\n            \"h.l.o\",\n            ignore_case=False,\n            ignore_non_alphanumeric=True,\n            is_regex=True,\n            match_whole_words_only=False,\n        )\n\n        answers = [2, 4, 7, 6]\n        self.compare(primitive, self.test_cases, answers)\n\n    def test_regex_with_match_whole_words_only(self):\n        primitive = self.primitive(\n            \"h.l.o\",\n            ignore_case=False,\n            ignore_non_alphanumeric=False,\n            is_regex=True,\n            match_whole_words_only=True,\n        )\n\n        answers = [2, 2, 2, 2]\n        self.compare(primitive, self.test_cases, answers)\n\n    def test_regex_with_all_other_parameters(self):\n        primitive = self.primitive(\n            \"h.l.o\",\n            ignore_case=True,\n            ignore_non_alphanumeric=True,\n            is_regex=True,\n            match_whole_words_only=True,\n        )\n\n        answers = [4, 4, 2, 3]\n        self.compare(primitive, self.test_cases, answers)\n\n    def test_overlapping_regex(self):\n        primitive = self.primitive(\n            \"(?=(a.*a))\",\n            ignore_case=True,\n            ignore_non_alphanumeric=True,\n            is_regex=True,\n            match_whole_words_only=False,\n        )\n        test_cases = pd.Series([\"aaaaaaaaaa\", \"atesta aa aa a\"])\n        answers = [9, 6]\n        self.compare(primitive, test_cases, answers)\n\n    def test_the(self):\n        primitive = self.primitive(\n            \"the\",\n            ignore_case=True,\n            ignore_non_alphanumeric=False,\n            is_regex=False,\n            match_whole_words_only=False,\n        )\n        test_cases = pd.Series([\"The fox jumped over the cat\", \"The there then\"])\n\n        answers = [2, 3]\n        self.compare(primitive, test_cases, answers)\n\n    def test_nan(self):\n        primitive = self.primitive(\n            \"the\",\n            ignore_case=True,\n            ignore_non_alphanumeric=False,\n            is_regex=False,\n            match_whole_words_only=False,\n        )\n        test_cases = pd.Series(\n            [np.nan, None, pd.NA, \"The fox jumped over the cat\", \"The there then\"],\n        )\n        answers = [np.nan, np.nan, np.nan, 2, 3]\n        self.compare(primitive, test_cases, answers)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive(\n            \"the\",\n            ignore_case=True,\n            ignore_non_alphanumeric=False,\n            is_regex=False,\n            match_whole_words_only=False,\n        )\n        transform.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n\n    def test_with_featuretools_nan(self, es):\n        log_df = es[\"log\"]\n        comments = log_df[\"comments\"]\n        comments[1] = pd.NA\n        comments[2] = np.nan\n        comments[3] = None\n        log_df[\"comments\"] = comments\n        es.replace_dataframe(dataframe_name=\"log\", df=log_df)\n\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive(\n            \"the\",\n            ignore_case=True,\n            ignore_non_alphanumeric=False,\n            is_regex=False,\n            match_whole_words_only=False,\n        )\n        transform.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/natural_language_primitives_tests/test_mean_characters_per_word.py",
    "content": "import numpy as np\nimport pandas as pd\nimport pytest\n\nfrom featuretools.primitives import MeanCharactersPerWord\nfrom featuretools.tests.primitive_tests.utils import (\n    PrimitiveTestBase,\n    find_applicable_primitives,\n    valid_dfs,\n)\n\n\nclass TestMeanCharactersPerWord(PrimitiveTestBase):\n    primitive = MeanCharactersPerWord\n\n    def test_sentences(self):\n        x = pd.Series(\n            [\n                \"This is a test file\",\n                \"This is second line\",\n                \"third line $1,000\",\n                \"and subsequent lines\",\n                \"and more\",\n            ],\n        )\n        primitive_func = self.primitive().get_function()\n        answers = pd.Series([3.0, 4.0, 5.0, 6.0, 3.5])\n        pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False)\n\n    def test_punctuation(self):\n        x = pd.Series(\n            [\n                \"This: is a test file\",\n                \"This, is second line?\",\n                \"third/line $1,000;\",\n                \"and--subsequen't lines...\",\n                \"*and, more..\",\n            ],\n        )\n        primitive_func = self.primitive().get_function()\n        answers = pd.Series([3.0, 4.0, 8.0, 10.5, 4.0])\n        pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False)\n\n    def test_multiline(self):\n        x = pd.Series(\n            [\n                \"This is a test file\",\n                \"This is second line\\nthird line $1000;\\nand subsequent lines\",\n                \"and more\",\n            ],\n        )\n        primitive_func = self.primitive().get_function()\n        answers = pd.Series([3.0, 4.8, 3.5])\n        pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False)\n\n    @pytest.mark.parametrize(\n        \"na_value\",\n        [None, np.nan, pd.NA],\n    )\n    def test_nans(self, na_value):\n        x = pd.Series([na_value, \"\", \"third line\"])\n        primitive_func = self.primitive().get_function()\n        answers = pd.Series([np.nan, 0, 4.5])\n        pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False)\n\n    @pytest.mark.parametrize(\n        \"na_value\",\n        [None, np.nan, pd.NA],\n    )\n    def test_all_nans(self, na_value):\n        x = pd.Series([na_value, na_value, na_value])\n        primitive_func = self.primitive().get_function()\n        answers = pd.Series([np.nan, np.nan, np.nan])\n        pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        transform.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/natural_language_primitives_tests/test_median_word_length.py",
    "content": "import numpy as np\nimport pandas as pd\n\nfrom featuretools.primitives import MedianWordLength\nfrom featuretools.tests.primitive_tests.utils import (\n    PrimitiveTestBase,\n    find_applicable_primitives,\n    valid_dfs,\n)\n\n\nclass TestMedianWordLength(PrimitiveTestBase):\n    primitive = MedianWordLength\n\n    def test_delimiter_override(self):\n        x = pd.Series(\n            [\"This is a test file.\", \"This,is,second,line?\", \"and;subsequent;lines...\"],\n        )\n\n        expected = pd.Series([4.0, 4.5, 8.0])\n        actual = self.primitive(\"[ ,;]\").get_function()(x)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_multiline(self):\n        x = pd.Series(\n            [\n                \"This is a test file.\",\n                \"This is second line\\nthird line $1000;\\nand subsequent lines\",\n            ],\n        )\n\n        expected = pd.Series([4.0, 4.5])\n        actual = self.primitive().get_function()(x)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_null(self):\n        x = pd.Series([np.nan, pd.NA, None, \"This is a test file.\"])\n\n        actual = self.primitive().get_function()(x)\n        expected = pd.Series([np.nan, np.nan, np.nan, 4.0])\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        transform.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/natural_language_primitives_tests/test_natural_language_primitives_terminate.py",
    "content": "import pandas as pd\nimport pytest\n\nfrom featuretools.primitives.utils import _get_natural_language_primitives\n\nTIMEOUT_THRESHOLD = 20\n\n\nclass TestNaturalLanguagePrimitivesTerminate:\n    # need to sort primitives to avoid pytest collection error\n    primitives = sorted(_get_natural_language_primitives().items())\n\n    @pytest.mark.timeout(TIMEOUT_THRESHOLD)\n    @pytest.mark.parametrize(\"primitive\", [prim for _, prim in primitives])\n    def test_natlang_primitive_does_not_timeout(\n        self,\n        strings_that_have_triggered_errors_before,\n        primitive,\n    ):\n        for text in strings_that_have_triggered_errors_before:\n            primitive().get_function()(pd.Series(text))\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/natural_language_primitives_tests/test_num_characters.py",
    "content": "import numpy as np\nimport pandas as pd\n\nfrom featuretools.primitives import NumCharacters\nfrom featuretools.tests.primitive_tests.utils import (\n    PrimitiveTestBase,\n    find_applicable_primitives,\n    valid_dfs,\n)\n\n\nclass TestNumCharacters(PrimitiveTestBase):\n    primitive = NumCharacters\n\n    def test_general(self):\n        x = pd.Series(\n            [\n                \"test test test test\",\n                \"test TEST test TEST,test test test\",\n                \"and subsequent lines...\",\n            ],\n        )\n        expected = pd.Series([19, 34, 23])\n        actual = self.primitive().get_function()(x)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_special_characters_and_whitespace(self):\n        x = pd.Series([\"50% 50 50% \\t\\t\\t\\n\\n\", \"$5,3040 a test* test\"])\n        expected = pd.Series([16, 20])\n        actual = self.primitive().get_function()(x)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_unicode_input(self):\n        x = pd.Series(\n            [\n                \"Ángel Angel Ángel ángel\",\n            ],\n        )\n        expected = pd.Series([23])\n        actual = self.primitive().get_function()(x)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_null(self):\n        x = pd.Series([np.nan, pd.NA, None, \"This is a test file.\"])\n        actual = self.primitive().get_function()(x)\n        expected = pd.Series([pd.NA, pd.NA, pd.NA, 20])\n        pd.testing.assert_series_equal(\n            actual,\n            expected,\n            check_names=False,\n            check_dtype=False,\n        )\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        transform.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/natural_language_primitives_tests/test_num_unique_separators.py",
    "content": "import numpy as np\nimport pandas as pd\n\nfrom featuretools.primitives import NumUniqueSeparators\nfrom featuretools.tests.primitive_tests.utils import (\n    PrimitiveTestBase,\n    find_applicable_primitives,\n    valid_dfs,\n)\n\n\nclass TestNumUniqueSeparators(PrimitiveTestBase):\n    primitive = NumUniqueSeparators\n\n    def test_punctuation(self):\n        x = pd.Series(\n            [\n                \"This: is a test file\",\n                \"This, is second line?\",\n                \"third/line $1,000;\",\n                \"and--subsequen't lines...\",\n                \"*and, more..\",\n            ],\n        )\n        primitive_func = self.primitive().get_function()\n        answers = pd.Series([1, 3, 3, 2, 3])\n        pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False)\n\n    def test_other_delimeters(self):\n        x = pd.Series([\"@#$%^&*()<>/[]\\\\`~-_=+\"])\n        primitive_func = self.primitive().get_function()\n        answers = pd.Series([0])\n        pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False)\n\n    def test_multiline(self):\n        x = pd.Series(\n            [\n                \"This is a test file\",\n                \"This is second line\\nthird line $1000;\\nand subsequent lines\",\n                \"and more!\",\n            ],\n        )\n        primitive_func = self.primitive().get_function()\n        answers = pd.Series([1, 3, 2])\n        pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False)\n\n    def test_nans(self):\n        x = pd.Series([np.nan, \"\", \"third line.\"])\n        primitive_func = self.primitive().get_function()\n        answers = pd.Series([pd.NA, 0, 2])\n        pd.testing.assert_series_equal(primitive_func(x), answers, check_names=False)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        transform.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/natural_language_primitives_tests/test_num_words.py",
    "content": "import numpy as np\nimport pandas as pd\n\nfrom featuretools.primitives import NumWords\nfrom featuretools.tests.primitive_tests.utils import (\n    PrimitiveTestBase,\n    find_applicable_primitives,\n    valid_dfs,\n)\n\n\nclass TestNumWords(PrimitiveTestBase):\n    primitive = NumWords\n\n    def test_general(self):\n        x = pd.Series(\n            [\n                \"test test test test\",\n                \"test TEST test TEST,test test test\",\n                \"and subsequent lines...\",\n            ],\n        )\n        expected = pd.Series([4, 6, 3])\n        actual = self.primitive().get_function()(x)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_special_characters_and_whitespace(self):\n        x = pd.Series([\"50% 50 50% \\t\\t\\t\\n\\n\", \"$5,3040 a test* test\"])\n        expected = pd.Series([3, 4])\n        actual = self.primitive().get_function()(x)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_unicode_input(self):\n        x = pd.Series(\n            [\n                \"Ángel Angel Ángel ángel\",\n            ],\n        )\n        expected = pd.Series([4])\n        actual = self.primitive().get_function()(x)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_contractions(self):\n        x = pd.Series(\n            [\n                \"can't won't don't can't aren't won't don't they'd there's\",\n            ],\n        )\n        expected = pd.Series([9])\n        actual = self.primitive().get_function()(x)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_multiple_spaces(self):\n        x = pd.Series(\n            [\n                \"    word  word            word word     .\",\n                \"This is                      \\nthird line \\nthird line\",\n            ],\n        )\n        expected = pd.Series([4, 6])\n        actual = self.primitive().get_function()(x)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_null(self):\n        x = pd.Series([np.nan, pd.NA, None, \"This is a test file.\"])\n        actual = self.primitive().get_function()(x)\n        expected = pd.Series([pd.NA, pd.NA, pd.NA, 5])\n        pd.testing.assert_series_equal(\n            actual,\n            expected,\n            check_names=False,\n            check_dtype=False,\n        )\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        transform.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/natural_language_primitives_tests/test_number_of_common_words.py",
    "content": "import numpy as np\nimport pandas as pd\n\nfrom featuretools.primitives import NumberOfCommonWords\nfrom featuretools.tests.primitive_tests.utils import (\n    PrimitiveTestBase,\n    find_applicable_primitives,\n    valid_dfs,\n)\n\n\nclass TestNumberOfCommonWords(PrimitiveTestBase):\n    primitive = NumberOfCommonWords\n    test_word_bank = {\"and\", \"a\", \"is\"}\n\n    def test_delimiter_override(self):\n        x = pd.Series(\n            [\n                \"This is a test file.\",\n                \"This,is,second,line, and?\",\n                \"and;subsequent;lines...\",\n            ],\n        )\n\n        expected = pd.Series([2, 2, 1])\n        actual = self.primitive(\n            word_set=self.test_word_bank,\n            delimiters_regex=\"[ ,;]\",\n        ).get_function()(x)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_multiline(self):\n        x = pd.Series(\n            [\n                \"This is a test file.\",\n                \"This is second line\\nthird line $1000;\\nand subsequent lines\",\n            ],\n        )\n\n        expected = pd.Series([2, 2])\n        actual = self.primitive(self.test_word_bank).get_function()(x)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_null(self):\n        x = pd.Series([np.nan, pd.NA, None, \"This is a test file.\"])\n\n        actual = self.primitive(self.test_word_bank).get_function()(x)\n        expected = pd.Series([pd.NA, pd.NA, pd.NA, 2])\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_case_insensitive(self):\n        x = pd.Series([\"Is\", \"a\", \"AND\"])\n\n        actual = self.primitive(self.test_word_bank).get_function()(x)\n        expected = pd.Series([1, 1, 1])\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        transform.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/natural_language_primitives_tests/test_number_of_hashtags.py",
    "content": "import numpy as np\nimport pandas as pd\n\nfrom featuretools.primitives import NumberOfHashtags\nfrom featuretools.tests.primitive_tests.utils import (\n    PrimitiveTestBase,\n    find_applicable_primitives,\n    valid_dfs,\n)\n\n\nclass TestNumberOfHashtags(PrimitiveTestBase):\n    primitive = NumberOfHashtags\n\n    def test_regular_input(self):\n        x = pd.Series(\n            [\n                \"#hello #hi #hello\",\n                \"#regular#expression#0or1#yes\",\n                \"andorandorand #32309\",\n            ],\n        )\n        expected = [3.0, 0.0, 0.0]\n        actual = self.primitive().get_function()(x)\n        np.testing.assert_array_equal(actual, expected)\n\n    def test_unicode_input(self):\n        x = pd.Series(\n            [\n                \"#Ángel #Æ #ĘÁÊÚ\",\n                \"#############Āndandandandand###\",\n                \"andorandorand #32309\",\n            ],\n        )\n        expected = [3.0, 0.0, 0.0]\n        actual = self.primitive().get_function()(x)\n        np.testing.assert_array_equal(actual, expected)\n\n    def test_multiline(self):\n        x = pd.Series(\n            [\n                \"#\\n\\t\\n\",\n                \"#hashtag\\n#hashtag2\\n#\\n\\n\",\n            ],\n        )\n\n        expected = [0.0, 2.0]\n        actual = self.primitive().get_function()(x)\n        np.testing.assert_array_equal(actual, expected)\n\n    def test_null(self):\n        x = pd.Series([np.nan, pd.NA, None, \"#test\"])\n\n        actual = self.primitive().get_function()(x)\n        expected = [np.nan, np.nan, np.nan, 1.0]\n        np.testing.assert_array_equal(actual, expected)\n\n    def test_alphanumeric_and_special(self):\n        x = pd.Series([\"#1or0\", \"#12\", \"#??!>@?@#>\"])\n\n        actual = self.primitive().get_function()(x)\n        expected = [1.0, 0.0, 0.0]\n        np.testing.assert_array_equal(actual, expected)\n\n    def test_underscore(self):\n        x = pd.Series([\"#no\", \"#__yes\", \"#??!>@?@#>\"])\n\n        actual = self.primitive().get_function()(x)\n        expected = [1.0, 1.0, 0.0]\n        np.testing.assert_array_equal(actual, expected)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        transform.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/natural_language_primitives_tests/test_number_of_mentions.py",
    "content": "import numpy as np\nimport pandas as pd\n\nfrom featuretools.primitives import NumberOfMentions\nfrom featuretools.tests.primitive_tests.utils import (\n    PrimitiveTestBase,\n    find_applicable_primitives,\n    valid_dfs,\n)\n\n\nclass TestNumberOfMentions(PrimitiveTestBase):\n    primitive = NumberOfMentions\n\n    def test_regular_input(self):\n        x = pd.Series(\n            [\n                \"@hello @hi @hello\",\n                \"@and@\",\n                \"andorandorand\",\n            ],\n        )\n        expected = [3.0, 0.0, 0.0]\n        actual = self.primitive().get_function()(x)\n        np.testing.assert_array_equal(actual, expected)\n\n    def test_unicode_input(self):\n        x = pd.Series(\n            [\n                \"@Ángel @Æ @ĘÁÊÚ\",\n                \"@@@@Āndandandandand@\",\n                \"andorandorand @32309\",\n                \"example@gmail.com\",\n                \"@example-20329\",\n            ],\n        )\n        expected = [3.0, 0.0, 1.0, 0.0, 1.0]\n        actual = self.primitive().get_function()(x)\n        np.testing.assert_array_equal(actual, expected)\n\n    def test_multiline(self):\n        x = pd.Series(\n            [\n                \"@\\n\\t\\n\",\n                \"@mention\\n @mention2\\n@\\n\\n\",\n            ],\n        )\n\n        expected = [0.0, 2.0]\n        actual = self.primitive().get_function()(x)\n        np.testing.assert_array_equal(actual, expected)\n\n    def test_null(self):\n        x = pd.Series([np.nan, pd.NA, None, \"@test\"])\n\n        actual = self.primitive().get_function()(x)\n        expected = [np.nan, np.nan, np.nan, 1.0]\n        np.testing.assert_array_equal(actual, expected)\n\n    def test_alphanumeric_and_special(self):\n        x = pd.Series([\"@1or0\", \"@12\", \"#??!>@?@#>\"])\n\n        actual = self.primitive().get_function()(x)\n        expected = [1.0, 1.0, 0.0]\n        np.testing.assert_array_equal(actual, expected)\n\n    def test_underscore(self):\n        x = pd.Series([\"@user1\", \"@__yes\", \"#??!>@?@#>\"])\n\n        actual = self.primitive().get_function()(x)\n        expected = [1.0, 1.0, 0.0]\n        np.testing.assert_array_equal(actual, expected)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        transform.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/natural_language_primitives_tests/test_number_of_unique_words.py",
    "content": "import numpy as np\nimport pandas as pd\n\nfrom featuretools.primitives import NumberOfUniqueWords\nfrom featuretools.tests.primitive_tests.utils import (\n    PrimitiveTestBase,\n    find_applicable_primitives,\n    valid_dfs,\n)\n\n\nclass TestNumberOfUniqueWords(PrimitiveTestBase):\n    primitive = NumberOfUniqueWords\n\n    def test_general(self):\n        x = pd.Series(\n            [\n                \"test test test test\",\n                \"test TEST test TEST\",\n                \"and subsequent lines...\",\n            ],\n        )\n\n        expected = pd.Series([1, 2, 3])\n        actual = self.primitive().get_function()(x)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_special_characters_and_whitespace(self):\n        x = pd.Series([\"50% 50 50% \\t\\t\\t\\n\\n\", \"a test* test\"])\n\n        expected = pd.Series([1, 2])\n        actual = self.primitive().get_function()(x)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_unicode_input(self):\n        x = pd.Series(\n            [\n                \"Ángel Angel Ángel ángel\",\n            ],\n        )\n\n        expected = pd.Series([3])\n        actual = self.primitive().get_function()(x)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_contractions(self):\n        x = pd.Series(\n            [\n                \"can't won't don't can't aren't won't don't they'd there's\",\n            ],\n        )\n\n        expected = pd.Series([6])\n        actual = self.primitive().get_function()(x)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_multiline(self):\n        x = pd.Series(\n            [\n                \"word word word word.\",\n                \"This is \\nthird line \\nthird line\",\n            ],\n        )\n\n        expected = pd.Series([1, 4])\n        actual = self.primitive().get_function()(x)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_null(self):\n        x = pd.Series([np.nan, pd.NA, None, \"This is a test file.\"])\n\n        actual = self.primitive().get_function()(x)\n        expected = pd.Series([pd.NA, pd.NA, pd.NA, 5])\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_case_insensitive(self):\n        x = pd.Series([\"WORD word WORd WORd WOrD word\"])\n\n        actual = self.primitive(case_insensitive=True).get_function()(x)\n        expected = pd.Series([1])\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        transform.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/natural_language_primitives_tests/test_number_of_words_in_quotes.py",
    "content": "import numpy as np\nimport pandas as pd\nimport pytest\n\nfrom featuretools.primitives import NumberOfWordsInQuotes\nfrom featuretools.tests.primitive_tests.utils import (\n    PrimitiveTestBase,\n    find_applicable_primitives,\n    valid_dfs,\n)\n\n\nclass TestNumberOfWordsInQuotes(PrimitiveTestBase):\n    primitive = NumberOfWordsInQuotes\n\n    def test_regular_double_quotes_input(self):\n        x = pd.Series(\n            [\n                'Yes \"    \"',\n                '\"Hello this is a test\"',\n                '\"Yes\" \"   \"',\n                \"\",\n                '\"Python, java prolog\"',\n                '\"Python, java prolog\" three words here \"binary search algorithm\"',\n                '\"Diffie-Hellman key exchange\"',\n                '\"user@email.com\"',\n                '\"https://alteryx.com\"',\n                '\"100,000\"',\n                '\"This Borderlands game here\"\" is the perfect conclusion to the \"\"Borderlands 3\"\" line, which focuses on the fans \"\"favorite character and gives the players the opportunity to close for a long time some very important questions about\\'s character and the memorable scenery with which the players interact.',\n            ],\n        )\n        expected = pd.Series([0, 5, 1, 0, 3, 6, 3, 1, 1, 1, 6], dtype=\"Int64\")\n        actual = self.primitive(\"double\").get_function()(x)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_captures_regular_single_quotes(self):\n        x = pd.Series(\n            [\n                \"'Hello this is a test'\",\n                \"'Python, Java Prolog'\",\n                \"'Python, Java Prolog' three words here 'three words here'\",\n                \"'Diffie-Hellman key exchange'\",\n                \"'user@email.com'\",\n                \"'https://alteryx.com'\",\n                \"'there's where's here's' word 'word'\",\n                \"'100,000'\",\n            ],\n        )\n        expected = pd.Series([5, 3, 6, 3, 1, 1, 4, 1], dtype=\"Int64\")\n        actual = self.primitive(\"single\").get_function()(x)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_captures_both_single_and_double_quotes(self):\n        x = pd.Series(\n            [\n                \"'test test test test' three words here \\\"test test test!\\\"\",\n            ],\n        )\n        expected = pd.Series([7], dtype=\"Int64\")\n        actual = self.primitive().get_function()(x)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_unicode_input(self):\n        x = pd.Series(\n            [\n                '\"Ángel\"',\n                '\"Ángel\" word word',\n            ],\n        )\n        expected = pd.Series([1, 1], dtype=\"Int64\")\n        actual = self.primitive().get_function()(x)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_multiline(self):\n        x = pd.Series(\n            [\n                \"'Yes\\n, this is me'\",\n            ],\n        )\n        expected = pd.Series([4], dtype=\"Int64\")\n        actual = self.primitive().get_function()(x)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_raises_error_invalid_args(self):\n        error_msg = (\n            \"NULL is not a valid quote_type. Specify 'both', 'single', or 'double'\"\n        )\n        with pytest.raises(\n            ValueError,\n            match=error_msg,\n        ):\n            self.primitive(quote_type=\"NULL\")\n\n    def test_null(self):\n        x = pd.Series([np.nan, pd.NA, None, '\"test\"'])\n        actual = self.primitive().get_function()(x)\n        expected = pd.Series([pd.NA, pd.NA, pd.NA, 1.0], dtype=\"Int64\")\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        transform.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/natural_language_primitives_tests/test_punctuation_count.py",
    "content": "import numpy as np\nimport pandas as pd\n\nfrom featuretools.primitives import PunctuationCount\nfrom featuretools.tests.primitive_tests.utils import (\n    PrimitiveTestBase,\n    find_applicable_primitives,\n    valid_dfs,\n)\n\n\nclass TestPunctuationCount(PrimitiveTestBase):\n    primitive = PunctuationCount\n\n    def test_punctuation(self):\n        x = pd.Series(\n            [\n                \"This is a test file.\",\n                \"This, is second line?\",\n                \"third/line $1,000;\",\n                \"and--subsequen't lines...\",\n                \"*and, more..\",\n            ],\n        )\n        primitive_func = self.primitive().get_function()\n        answers = [1.0, 2.0, 4.0, 6.0, 4.0]\n        np.testing.assert_array_equal(primitive_func(x), answers)\n\n    def test_multiline(self):\n        x = pd.Series(\n            [\n                \"This is a test file.\",\n                \"This is second line\\nthird line $1000;\\nand subsequent lines\",\n            ],\n        )\n        primitive_func = self.primitive().get_function()\n        answers = [1.0, 2.0]\n        np.testing.assert_array_equal(primitive_func(x), answers)\n\n    def test_nan(self):\n        x = pd.Series([np.nan, \"\", \"This is a test file.\"])\n        primitive_func = self.primitive().get_function()\n        answers = [np.nan, 0.0, 1.0]\n        np.testing.assert_array_equal(primitive_func(x), answers)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        transform.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/natural_language_primitives_tests/test_title_word_count.py",
    "content": "import numpy as np\nimport pandas as pd\n\nfrom featuretools.primitives import TitleWordCount\nfrom featuretools.tests.primitive_tests.utils import (\n    PrimitiveTestBase,\n    find_applicable_primitives,\n    valid_dfs,\n)\n\n\nclass TestTitleWordCount(PrimitiveTestBase):\n    primitive = TitleWordCount\n\n    def test_strings(self):\n        x = pd.Series(\n            [\n                \"My favorite movie is Jaws.\",\n                \"this is a string\",\n                \"AAA\",\n                \"I bought a Yo-Yo\",\n            ],\n        )\n        primitive_func = self.primitive().get_function()\n        answers = [2.0, 0.0, 1.0, 2.0]\n        np.testing.assert_array_equal(answers, primitive_func(x))\n\n    def test_nan(self):\n        x = pd.Series([np.nan, \"\", \"My favorite movie is Jaws.\"])\n        primitive_func = self.primitive().get_function()\n        answers = [np.nan, 0.0, 2.0]\n        np.testing.assert_array_equal(answers, primitive_func(x))\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        transform.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/natural_language_primitives_tests/test_total_word_length.py",
    "content": "import numpy as np\nimport pandas as pd\n\nfrom featuretools.primitives import TotalWordLength\nfrom featuretools.tests.primitive_tests.utils import (\n    PrimitiveTestBase,\n    find_applicable_primitives,\n    valid_dfs,\n)\n\n\nclass TestTotalWordLength(PrimitiveTestBase):\n    primitive = TotalWordLength\n\n    def test_delimiter_override(self):\n        x = pd.Series(\n            [\"This is a test file.\", \"This,is,second,line?\", \"and;subsequent;lines...\"],\n        )\n\n        expected = pd.Series([16, 17, 21])\n        actual = self.primitive(\"[ ,;]\").get_function()(x)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_multiline(self):\n        x = pd.Series(\n            [\n                \"This is a test file.\",\n                \"This is second line\\nthird line $1000;\\nand subsequent lines\",\n            ],\n        )\n\n        expected = pd.Series([15, 47])\n        actual = self.primitive().get_function()(x)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_null(self):\n        x = pd.Series([np.nan, pd.NA, None, \"This is a test file.\"])\n\n        expected = pd.Series([np.nan, np.nan, np.nan, 15])\n        actual = self.primitive().get_function()(x).astype(float)\n        pd.testing.assert_series_equal(actual, expected, check_names=False)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        transform.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/natural_language_primitives_tests/test_upper_case_count.py",
    "content": "import numpy as np\nimport pandas as pd\n\nfrom featuretools.primitives import UpperCaseCount\nfrom featuretools.tests.primitive_tests.utils import (\n    PrimitiveTestBase,\n    find_applicable_primitives,\n    valid_dfs,\n)\n\n\nclass TestUpperCaseCount(PrimitiveTestBase):\n    primitive = UpperCaseCount\n\n    def test_strings(self):\n        x = pd.Series(\n            [\"This IS a STRING.\", \"Testing AaA\", \"Testing AAA-BBB\", \"testing aaa\"],\n        )\n        primitive_func = self.primitive().get_function()\n        answers = [9.0, 3.0, 7.0, 0.0]\n        np.testing.assert_array_equal(primitive_func(x), answers)\n\n    def test_nan(self):\n        x = pd.Series([np.nan, \"\", \"This IS a STRING.\"])\n        primitive_func = self.primitive().get_function()\n        answers = [np.nan, 0.0, 9.0]\n        np.testing.assert_array_equal(primitive_func(x), answers)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        transform.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/natural_language_primitives_tests/test_upper_case_word_count.py",
    "content": "import numpy as np\nimport pandas as pd\n\nfrom featuretools.primitives import UpperCaseWordCount\n\n\nclass TestUpperCaseWordCount:\n    primitive = UpperCaseWordCount\n\n    def test_strings(self):\n        x = pd.Series(\n            [\n                \"This IS a STRING.\",\n                \"Testing AAA\",\n                \"Testing AAA BBB\",\n                \"Testing TEsTIng AA3 AA_33 HELLO\",\n                \"AAA $@()#$@@#$\",\n            ],\n            dtype=\"string\",\n        )\n        primitive_func = self.primitive().get_function()\n        answers = pd.Series([2, 1, 2, 3, 1], dtype=\"Int64\")\n        pd.testing.assert_series_equal(\n            primitive_func(x).astype(\"Int64\"),\n            answers,\n            check_names=False,\n        )\n\n    def test_nan(self):\n        x = pd.Series(\n            [\n                np.nan,\n                \"\",\n                \"This IS a STRING.\",\n            ],\n            dtype=\"string\",\n        )\n        primitive_func = self.primitive().get_function()\n        answers = pd.Series([pd.NA, 0, 2], dtype=\"Int64\")\n        pd.testing.assert_series_equal(\n            primitive_func(x).astype(\"Int64\"),\n            answers,\n            check_names=False,\n        )\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/natural_language_primitives_tests/test_whitespace_count.py",
    "content": "import numpy as np\nimport pandas as pd\n\nfrom featuretools.primitives import WhitespaceCount\nfrom featuretools.tests.primitive_tests.utils import (\n    PrimitiveTestBase,\n    find_applicable_primitives,\n    valid_dfs,\n)\n\n\nclass TestWhitespaceCount(PrimitiveTestBase):\n    primitive = WhitespaceCount\n\n    def compare(self, primitive_initiated, test_cases, answers):\n        primitive_func = primitive_initiated.get_function()\n        primitive_answers = primitive_func(test_cases)\n        return np.testing.assert_array_equal(answers, primitive_answers)\n\n    def test_strings(self):\n        x = pd.Series(\n            [\"\", \"hi im ethan!\", \"consecutive.    spaces.\", \" spaces-on-ends \"],\n        )\n        answers = [0, 2, 4, 2]\n        self.compare(self.primitive(), x, answers)\n\n    def test_nan(self):\n        x = pd.Series([np.nan, None, pd.NA, \"\", \"This IS a STRING.\"])\n        answers = [np.nan, np.nan, np.nan, 0, 3]\n        self.compare(self.primitive(), x, answers)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        transform.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/primitives_to_install/__init__.py",
    "content": ""
  },
  {
    "path": "featuretools/tests/primitive_tests/primitives_to_install/custom_max.py",
    "content": "from woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass CustomMax(AggregationPrimitive):\n    name = \"custom_max\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/primitives_to_install/custom_mean.py",
    "content": "from woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass CustomMean(AggregationPrimitive):\n    name = \"custom_mean\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/primitives_to_install/custom_sum.py",
    "content": "from woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives.base import AggregationPrimitive\n\n\nclass CustomSum(AggregationPrimitive):\n    name = \"custom_sum\"\n    input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n    return_type = ColumnSchema(semantic_tags={\"numeric\"})\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/test_absolute_diff.py",
    "content": "import numpy as np\nimport pandas as pd\nimport pytest\n\nfrom featuretools.primitives import AbsoluteDiff\n\n\nclass TestAbsoluteDiff:\n    def test_nan(self):\n        data = pd.Series([np.nan, 5, 10, 20, np.nan, 10, np.nan])\n        answer = pd.Series([np.nan, np.nan, 5, 10, 0, 10, 0])\n        primitive_func = AbsoluteDiff().get_function()\n        given_answer = primitive_func(data)\n        np.testing.assert_array_equal(given_answer, answer)\n\n    def test_regular(self):\n        data = pd.Series([2, 5, 15, 3, 9, 4.5])\n        answer = pd.Series([np.nan, 3, 10, 12, 6, 4.5])\n        primitive_func = AbsoluteDiff().get_function()\n        given_answer = primitive_func(data)\n        np.testing.assert_array_equal(given_answer, answer)\n\n    def test_method(self):\n        data = pd.Series([2, np.nan, 15, 3, np.nan, 4.5])\n        answer = pd.Series([np.nan, 13, 0, 12, 1.5, 0])\n        primitive_func = AbsoluteDiff(method=\"backfill\").get_function()\n        given_answer = primitive_func(data)\n        np.testing.assert_array_equal(given_answer, answer)\n\n    def test_limit(self):\n        data = pd.Series([2, np.nan, np.nan, np.nan, 3.0, 4.5])\n        answer = pd.Series([np.nan, 0, 0, np.nan, np.nan, 1.5])\n        primitive_func = AbsoluteDiff(limit=2).get_function()\n        given_answer = primitive_func(data)\n        np.testing.assert_array_equal(given_answer, answer)\n\n    def test_zero(self):\n        data = pd.Series([2, 0, 0, 5, 0, -4])\n        answer = pd.Series([np.nan, 2, 0, 5, 5, 4])\n        primitive_func = AbsoluteDiff().get_function()\n        given_answer = primitive_func(data)\n        np.testing.assert_array_equal(given_answer, answer)\n\n    def test_empty(self):\n        data = pd.Series([], dtype=\"float64\")\n        answer = pd.Series([], dtype=\"float64\")\n        primitive_func = AbsoluteDiff().get_function()\n        given_answer = primitive_func(data)\n        np.testing.assert_array_equal(given_answer, answer)\n\n    def test_inf(self):\n        data = pd.Series([0, np.inf, 0, 5, np.NINF, np.inf, np.NINF])\n        answer = pd.Series([np.nan, np.inf, np.inf, 5, np.inf, np.inf, np.inf])\n        primitive_func = AbsoluteDiff().get_function()\n        given_answer = primitive_func(data)\n        np.testing.assert_array_equal(given_answer, answer)\n\n    def test_raises(self):\n        with pytest.raises(ValueError):\n            AbsoluteDiff(method=\"invalid\")\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/test_agg_feats.py",
    "content": "from datetime import datetime\nfrom inspect import isclass\nfrom math import isnan\n\nimport numpy as np\nimport pandas as pd\nimport pytest\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime\n\nfrom featuretools import (\n    AggregationFeature,\n    Feature,\n    IdentityFeature,\n    Timedelta,\n    calculate_feature_matrix,\n    dfs,\n    primitives,\n)\nfrom featuretools.entityset.relationship import RelationshipPath\nfrom featuretools.feature_base.cache import feature_cache\nfrom featuretools.primitives import (\n    Count,\n    Max,\n    Mean,\n    Median,\n    NMostCommon,\n    NumTrue,\n    NumUnique,\n    Sum,\n    TimeSinceFirst,\n    TimeSinceLast,\n    get_aggregation_primitives,\n)\nfrom featuretools.primitives.base import AggregationPrimitive\nfrom featuretools.synthesis.deep_feature_synthesis import DeepFeatureSynthesis, match\nfrom featuretools.tests.testing_utils import backward_path, feature_with_name\n\n\n@pytest.fixture(autouse=True)\ndef reset_dfs_cache():\n    feature_cache.enabled = False\n    feature_cache.clear_all()\n\n\ndef test_get_depth(es):\n    log_id_feat = IdentityFeature(es[\"log\"].ww[\"id\"])\n    customer_id_feat = IdentityFeature(es[\"customers\"].ww[\"id\"])\n    count_logs = Feature(log_id_feat, parent_dataframe_name=\"sessions\", primitive=Count)\n    sum_count_logs = Feature(\n        count_logs,\n        parent_dataframe_name=\"customers\",\n        primitive=Sum,\n    )\n    num_logs_greater_than_5 = sum_count_logs > 5\n    count_customers = Feature(\n        customer_id_feat,\n        parent_dataframe_name=\"régions\",\n        where=num_logs_greater_than_5,\n        primitive=Count,\n    )\n    num_customers_region = Feature(count_customers, dataframe_name=\"customers\")\n\n    depth = num_customers_region.get_depth()\n    assert depth == 5\n\n\ndef test_makes_count(es):\n    dfs = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[Count],\n        trans_primitives=[],\n    )\n\n    features = dfs.build_features()\n    assert feature_with_name(features, \"device_type\")\n    assert feature_with_name(features, \"customer_id\")\n    assert feature_with_name(features, \"customers.région_id\")\n    assert feature_with_name(features, \"customers.age\")\n    assert feature_with_name(features, \"COUNT(log)\")\n    assert feature_with_name(features, \"customers.COUNT(sessions)\")\n    assert feature_with_name(features, \"customers.régions.language\")\n    assert feature_with_name(features, \"customers.COUNT(log)\")\n\n\ndef test_count_null(es):\n    class Count(AggregationPrimitive):\n        name = \"count\"\n        input_types = [[ColumnSchema(semantic_tags={\"foreign_key\"})], [ColumnSchema()]]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n        stack_on_self = False\n\n        def __init__(self, count_null=True):\n            self.count_null = count_null\n\n        def get_function(self):\n            def count_func(values):\n                if self.count_null:\n                    values = values.fillna(0)\n\n                return values.count()\n\n            return count_func\n\n        def generate_name(\n            self,\n            base_feature_names,\n            relationship_path_name,\n            parent_dataframe_name,\n            where_str,\n            use_prev_str,\n        ):\n            return \"COUNT(%s%s%s)\" % (relationship_path_name, where_str, use_prev_str)\n\n    count_null = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Count(count_null=True),\n    )\n    feature_matrix = calculate_feature_matrix([count_null], entityset=es)\n    values = [5, 4, 1, 2, 3, 2]\n    assert (values == feature_matrix[count_null.get_name()]).all()\n\n\ndef test_check_input_types(es):\n    count = Feature(\n        es[\"sessions\"].ww[\"id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Count,\n    )\n    mean = Feature(count, parent_dataframe_name=\"régions\", primitive=Mean)\n    assert mean._check_input_types()\n\n    boolean = count > 3\n    mean = Feature(\n        count,\n        parent_dataframe_name=\"régions\",\n        where=boolean,\n        primitive=Mean,\n    )\n    assert mean._check_input_types()\n\n\ndef test_mean_nan(es):\n    array = pd.Series([5, 5, 5, 5, 5])\n    mean_func_nans_default = Mean().get_function()\n    mean_func_nans_false = Mean(skipna=False).get_function()\n    mean_func_nans_true = Mean(skipna=True).get_function()\n    assert mean_func_nans_default(array) == 5\n    assert mean_func_nans_false(array) == 5\n    assert mean_func_nans_true(array) == 5\n    array = pd.Series([5, np.nan, np.nan, np.nan, np.nan, 10])\n    assert mean_func_nans_default(array) == 7.5\n    assert isnan(mean_func_nans_false(array))\n    assert mean_func_nans_true(array) == 7.5\n    array_nans = pd.Series([np.nan, np.nan, np.nan, np.nan])\n    assert isnan(mean_func_nans_default(array_nans))\n    assert isnan(mean_func_nans_false(array_nans))\n    assert isnan(mean_func_nans_true(array_nans))\n\n    # test naming\n    default_feat = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Mean,\n    )\n    assert default_feat.get_name() == \"MEAN(log.value)\"\n    ignore_nan_feat = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Mean(skipna=True),\n    )\n    assert ignore_nan_feat.get_name() == \"MEAN(log.value)\"\n    include_nan_feat = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Mean(skipna=False),\n    )\n    assert include_nan_feat.get_name() == \"MEAN(log.value, skipna=False)\"\n\n\ndef test_init_and_name(es):\n    log = es[\"log\"]\n\n    # Add a BooleanNullable column so primitives with that input type get tested\n    boolean_nullable = log.ww[\"purchased\"]\n    boolean_nullable = boolean_nullable.ww.set_logical_type(\"BooleanNullable\")\n    log.ww[\"boolean_nullable\"] = boolean_nullable\n\n    features = [Feature(es[\"log\"].ww[col]) for col in log.columns]\n\n    # check all primitives have name\n    for attribute_string in dir(primitives):\n        attr = getattr(primitives, attribute_string)\n        if isclass(attr):\n            if issubclass(attr, AggregationPrimitive) and attr != AggregationPrimitive:\n                assert getattr(attr, \"name\") is not None\n\n    agg_primitives = get_aggregation_primitives().values()\n\n    for agg_prim in agg_primitives:\n        input_types = agg_prim.input_types\n        if not isinstance(input_types[0], list):\n            input_types = [input_types]\n\n        # test each allowed input_types for this primitive\n        for it in input_types:\n            # use the input_types matching function from DFS\n            matching_types = match(it, features)\n            if len(matching_types) == 0:\n                raise Exception(\"Agg Primitive %s not tested\" % agg_prim.name)\n            for t in matching_types:\n                instance = Feature(\n                    t,\n                    parent_dataframe_name=\"sessions\",\n                    primitive=agg_prim,\n                )\n\n                # try to get name and calculate\n                instance.get_name()\n                calculate_feature_matrix([instance], entityset=es)\n\n\ndef test_invalid_init_args(diamond_es):\n    error_text = \"parent_dataframe must match first relationship in path\"\n    with pytest.raises(AssertionError, match=error_text):\n        path = backward_path(diamond_es, [\"stores\", \"transactions\"])\n        AggregationFeature(\n            IdentityFeature(diamond_es[\"transactions\"].ww[\"amount\"]),\n            \"customers\",\n            Mean,\n            relationship_path=path,\n        )\n\n    error_text = (\n        \"Base feature must be defined on the dataframe at the end of relationship_path\"\n    )\n    with pytest.raises(AssertionError, match=error_text):\n        path = backward_path(diamond_es, [\"regions\", \"stores\"])\n        AggregationFeature(\n            IdentityFeature(diamond_es[\"transactions\"].ww[\"amount\"]),\n            \"regions\",\n            Mean,\n            relationship_path=path,\n        )\n\n    error_text = \"All relationships in path must be backward\"\n    with pytest.raises(AssertionError, match=error_text):\n        backward = backward_path(diamond_es, [\"customers\", \"transactions\"])\n        forward = RelationshipPath([(True, r) for _, r in backward])\n        path = RelationshipPath(list(forward) + list(backward))\n        AggregationFeature(\n            IdentityFeature(diamond_es[\"transactions\"].ww[\"amount\"]),\n            \"transactions\",\n            Mean,\n            relationship_path=path,\n        )\n\n\ndef test_init_with_multiple_possible_paths(diamond_es):\n    error_text = (\n        \"There are multiple possible paths to the base dataframe. \"\n        \"You must specify a relationship path.\"\n    )\n    with pytest.raises(RuntimeError, match=error_text):\n        AggregationFeature(\n            IdentityFeature(diamond_es[\"transactions\"].ww[\"amount\"]),\n            \"regions\",\n            Mean,\n        )\n\n    # Does not raise if path specified.\n    path = backward_path(diamond_es, [\"regions\", \"customers\", \"transactions\"])\n    AggregationFeature(\n        IdentityFeature(diamond_es[\"transactions\"].ww[\"amount\"]),\n        \"regions\",\n        Mean,\n        relationship_path=path,\n    )\n\n\ndef test_init_with_single_possible_path(diamond_es):\n    # This uses diamond_es to test that there being a cycle somewhere in the\n    # graph doesn't cause an error.\n    feat = AggregationFeature(\n        IdentityFeature(diamond_es[\"transactions\"].ww[\"amount\"]),\n        \"customers\",\n        Mean,\n    )\n    expected_path = backward_path(diamond_es, [\"customers\", \"transactions\"])\n    assert feat.relationship_path == expected_path\n\n\ndef test_init_with_no_path(diamond_es):\n    error_text = 'No backward path from \"transactions\" to \"customers\" found.'\n    with pytest.raises(RuntimeError, match=error_text):\n        AggregationFeature(\n            IdentityFeature(diamond_es[\"customers\"].ww[\"name\"]),\n            \"transactions\",\n            Count,\n        )\n\n    error_text = 'No backward path from \"transactions\" to \"transactions\" found.'\n    with pytest.raises(RuntimeError, match=error_text):\n        AggregationFeature(\n            IdentityFeature(diamond_es[\"transactions\"].ww[\"amount\"]),\n            \"transactions\",\n            Mean,\n        )\n\n\ndef test_name_with_multiple_possible_paths(diamond_es):\n    path = backward_path(diamond_es, [\"regions\", \"customers\", \"transactions\"])\n    feat = AggregationFeature(\n        IdentityFeature(diamond_es[\"transactions\"].ww[\"amount\"]),\n        \"regions\",\n        Mean,\n        relationship_path=path,\n    )\n\n    assert feat.get_name() == \"MEAN(customers.transactions.amount)\"\n    assert feat.relationship_path_name() == \"customers.transactions\"\n\n\ndef test_copy(games_es):\n    home_games = next(\n        r for r in games_es.relationships if r._child_column_name == \"home_team_id\"\n    )\n    path = RelationshipPath([(False, home_games)])\n    feat = AggregationFeature(\n        IdentityFeature(games_es[\"games\"].ww[\"home_team_score\"]),\n        \"teams\",\n        relationship_path=path,\n        primitive=Mean,\n    )\n    copied = feat.copy()\n    assert copied.dataframe_name == feat.dataframe_name\n    assert copied.base_features == feat.base_features\n    assert copied.relationship_path == feat.relationship_path\n    assert copied.primitive == feat.primitive\n\n\ndef test_serialization(es):\n    value = IdentityFeature(es[\"log\"].ww[\"value\"])\n    primitive = Max()\n    max1 = AggregationFeature(value, \"customers\", primitive)\n\n    path = next(es.find_backward_paths(\"customers\", \"log\"))\n    dictionary = {\n        \"name\": max1.get_name(),\n        \"base_features\": [value.unique_name()],\n        \"relationship_path\": [r.to_dictionary() for r in path],\n        \"primitive\": primitive,\n        \"where\": None,\n        \"use_previous\": None,\n    }\n\n    assert dictionary == max1.get_arguments()\n    deserialized = AggregationFeature.from_dictionary(\n        dictionary,\n        es,\n        {value.unique_name(): value},\n        primitive,\n    )\n    _assert_agg_feats_equal(max1, deserialized)\n\n    is_purchased = IdentityFeature(es[\"log\"].ww[\"purchased\"])\n    use_previous = Timedelta(3, \"d\")\n    max2 = AggregationFeature(\n        value,\n        \"customers\",\n        primitive,\n        where=is_purchased,\n        use_previous=use_previous,\n    )\n\n    dictionary = {\n        \"name\": max2.get_name(),\n        \"base_features\": [value.unique_name()],\n        \"relationship_path\": [r.to_dictionary() for r in path],\n        \"primitive\": primitive,\n        \"where\": is_purchased.unique_name(),\n        \"use_previous\": use_previous.get_arguments(),\n    }\n\n    assert dictionary == max2.get_arguments()\n    dependencies = {\n        value.unique_name(): value,\n        is_purchased.unique_name(): is_purchased,\n    }\n    deserialized = AggregationFeature.from_dictionary(\n        dictionary,\n        es,\n        dependencies,\n        primitive,\n    )\n    _assert_agg_feats_equal(max2, deserialized)\n\n\ndef test_time_since_last(es):\n    f = Feature(\n        es[\"log\"].ww[\"datetime\"],\n        parent_dataframe_name=\"customers\",\n        primitive=TimeSinceLast,\n    )\n    fm = calculate_feature_matrix(\n        [f],\n        entityset=es,\n        instance_ids=[0, 1, 2],\n        cutoff_time=datetime(2015, 6, 8),\n    )\n\n    correct = [131376000.0, 131289534.0, 131287797.0]\n    # note: must round to nearest second\n    assert all(fm[f.get_name()].round().values == correct)\n\n\ndef test_time_since_first(es):\n    f = Feature(\n        es[\"log\"].ww[\"datetime\"],\n        parent_dataframe_name=\"customers\",\n        primitive=TimeSinceFirst,\n    )\n    fm = calculate_feature_matrix(\n        [f],\n        entityset=es,\n        instance_ids=[0, 1, 2],\n        cutoff_time=datetime(2015, 6, 8),\n    )\n\n    correct = [131376600.0, 131289600.0, 131287800.0]\n    # note: must round to nearest second\n    assert all(fm[f.get_name()].round().values == correct)\n\n\ndef test_median(es):\n    f = Feature(\n        es[\"log\"].ww[\"value_many_nans\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Median,\n    )\n    fm = calculate_feature_matrix(\n        [f],\n        entityset=es,\n        instance_ids=[0, 1, 2],\n        cutoff_time=datetime(2015, 6, 8),\n    )\n\n    correct = [1, 3, np.nan]\n    np.testing.assert_equal(fm[f.get_name()].values, correct)\n\n\ndef test_agg_same_method_name(es):\n    \"\"\"\n    Pandas relies on the function name when calculating aggregations. This means if a two\n    primitives with the same function name are applied to the same column, pandas\n    can't differentiate them. We have a work around to this based on the name property\n    that we test here.\n    \"\"\"\n\n    # test with normally defined functions\n    class Sum(AggregationPrimitive):\n        name = \"sum\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n        def get_function(self):\n            def custom_primitive(x):\n                return x.sum()\n\n            return custom_primitive\n\n    class Max(AggregationPrimitive):\n        name = \"max\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n        def get_function(self):\n            def custom_primitive(x):\n                return x.max()\n\n            return custom_primitive\n\n    f_sum = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Sum,\n    )\n    f_max = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Max,\n    )\n\n    fm = calculate_feature_matrix([f_sum, f_max], entityset=es)\n    assert fm.columns.tolist() == [f_sum.get_name(), f_max.get_name()]\n\n    # test with lambdas\n    class Sum(AggregationPrimitive):\n        name = \"sum\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n        def get_function(self):\n            return lambda x: x.sum()\n\n    class Max(AggregationPrimitive):\n        name = \"max\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n        def get_function(self):\n            return lambda x: x.max()\n\n    f_sum = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Sum,\n    )\n    f_max = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Max,\n    )\n    fm = calculate_feature_matrix([f_sum, f_max], entityset=es)\n    assert fm.columns.tolist() == [f_sum.get_name(), f_max.get_name()]\n\n\ndef test_time_since_last_custom(es):\n    class TimeSinceLast(AggregationPrimitive):\n        name = \"time_since_last\"\n        input_types = [\n            ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"}),\n        ]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n        uses_calc_time = True\n\n        def get_function(self):\n            def time_since_last(values, time):\n                time_since = time - values.iloc[0]\n                return time_since.total_seconds()\n\n            return time_since_last\n\n    f = Feature(\n        es[\"log\"].ww[\"datetime\"],\n        parent_dataframe_name=\"customers\",\n        primitive=TimeSinceLast,\n    )\n    fm = calculate_feature_matrix(\n        [f],\n        entityset=es,\n        instance_ids=[0, 1, 2],\n        cutoff_time=datetime(2015, 6, 8),\n    )\n\n    correct = [131376600, 131289600, 131287800]\n    # note: must round to nearest second\n    assert all(fm[f.get_name()].round().values == correct)\n\n\ndef test_custom_primitive_multiple_inputs(es):\n    class MeanSunday(AggregationPrimitive):\n        name = \"mean_sunday\"\n        input_types = [\n            ColumnSchema(semantic_tags={\"numeric\"}),\n            ColumnSchema(logical_type=Datetime),\n        ]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n        def get_function(self):\n            def mean_sunday(numeric, datetime):\n                \"\"\"\n                Finds the mean of non-null values of a feature that occurred on Sundays\n                \"\"\"\n                days = pd.DatetimeIndex(datetime).weekday.values\n                df = pd.DataFrame({\"numeric\": numeric, \"time\": days})\n                return df[df[\"time\"] == 6][\"numeric\"].mean()\n\n            return mean_sunday\n\n    fm, features = dfs(\n        entityset=es,\n        target_dataframe_name=\"sessions\",\n        agg_primitives=[MeanSunday],\n        trans_primitives=[],\n    )\n    mean_sunday_value = pd.Series([None, None, None, 2.5, 7, None])\n    iterator = zip(fm[\"MEAN_SUNDAY(log.value, datetime)\"], mean_sunday_value)\n    for x, y in iterator:\n        assert (pd.isnull(x) and pd.isnull(y)) or (x == y)\n\n    es.add_interesting_values()\n    mean_sunday_value_priority_0 = pd.Series([None, None, None, 2.5, 0, None])\n    fm, features = dfs(\n        entityset=es,\n        target_dataframe_name=\"sessions\",\n        agg_primitives=[MeanSunday],\n        trans_primitives=[],\n        where_primitives=[MeanSunday],\n    )\n    where_feat = \"MEAN_SUNDAY(log.value, datetime WHERE priority_level = 0)\"\n    for x, y in zip(fm[where_feat], mean_sunday_value_priority_0):\n        assert (pd.isnull(x) and pd.isnull(y)) or (x == y)\n\n\ndef test_custom_primitive_default_kwargs(es):\n    class SumNTimes(AggregationPrimitive):\n        name = \"sum_n_times\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n        def __init__(self, n=1):\n            self.n = n\n\n    sum_n_1_n = 1\n    sum_n_1_base_f = Feature(es[\"log\"].ww[\"value\"])\n    sum_n_1 = Feature(\n        [sum_n_1_base_f],\n        parent_dataframe_name=\"sessions\",\n        primitive=SumNTimes(n=sum_n_1_n),\n    )\n    sum_n_2_n = 2\n    sum_n_2_base_f = Feature(es[\"log\"].ww[\"value_2\"])\n    sum_n_2 = Feature(\n        [sum_n_2_base_f],\n        parent_dataframe_name=\"sessions\",\n        primitive=SumNTimes(n=sum_n_2_n),\n    )\n    assert sum_n_1_base_f == sum_n_1.base_features[0]\n    assert sum_n_1_n == sum_n_1.primitive.n\n    assert sum_n_2_base_f == sum_n_2.base_features[0]\n    assert sum_n_2_n == sum_n_2.primitive.n\n\n\ndef test_makes_numtrue(es):\n    dfs = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[NumTrue],\n        trans_primitives=[],\n    )\n    features = dfs.build_features()\n    assert feature_with_name(features, \"customers.NUM_TRUE(log.purchased)\")\n    assert feature_with_name(features, \"NUM_TRUE(log.purchased)\")\n\n\ndef test_make_three_most_common(es):\n    class NMostCommoner(AggregationPrimitive):\n        name = \"pd_top3\"\n        input_types = ([ColumnSchema(semantic_tags={\"category\"})],)\n        return_type = None\n        number_output_features = 3\n\n        def get_function(self):\n            def pd_top3(x):\n                counts = x.value_counts()\n                counts = counts[counts > 0]\n                array = np.array(counts[:3].index)\n                if len(array) < 3:\n                    filler = np.full(3 - len(array), np.nan)\n                    array = np.append(array, filler)\n                return array\n\n            return pd_top3\n\n    fm, features = dfs(\n        entityset=es,\n        target_dataframe_name=\"customers\",\n        instance_ids=[0, 1, 2],\n        agg_primitives=[NMostCommoner],\n        trans_primitives=[],\n    )\n\n    df = fm[[\"PD_TOP3(log.product_id)[%s]\" % i for i in range(3)]]\n\n    assert set(df.iloc[0].values[:2]) == set(\n        [\"coke zero\", \"toothpaste\"],\n    )  # coke zero and toothpaste have same number of occurrences\n    assert df.iloc[0].values[2] in [\n        \"car\",\n        \"brown bag\",\n    ]  # so just check that the top two match\n\n    assert (\n        df.iloc[1]\n        .reset_index(drop=True)\n        .equals(pd.Series([\"coke zero\", \"Haribo sugar-free gummy bears\", np.nan]))\n    )\n    assert (\n        df.iloc[2]\n        .reset_index(drop=True)\n        .equals(pd.Series([\"taco clock\", np.nan, np.nan]))\n    )\n\n\ndef test_stacking_multi(es):\n    threecommon = NMostCommon(3)\n    tc = Feature(\n        es[\"log\"].ww[\"product_id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=threecommon,\n    )\n\n    stacked = []\n    for i in range(3):\n        stacked.append(\n            Feature(tc[i], parent_dataframe_name=\"customers\", primitive=NumUnique),\n        )\n\n    fm = calculate_feature_matrix(stacked, entityset=es, instance_ids=[0, 1, 2])\n\n    correct_vals = [[3, 2, 1], [2, 1, 0], [0, 0, 0]]\n    correct_vals1 = [[3, 1, 1], [2, 1, 0], [0, 0, 0]]\n    # either of the above can be correct, and the outcome depends on the sorting of\n    # two values in the initial n most common function, which changes arbitrarily.\n\n    for i in range(3):\n        f = \"NUM_UNIQUE(sessions.N_MOST_COMMON(log.product_id)[%d])\" % i\n        cols = fm.columns\n        assert f in cols\n        assert (\n            fm[cols[i]].tolist() == correct_vals[i]\n            or fm[cols[i]].tolist() == correct_vals1[i]\n        )\n\n\ndef test_use_previous_pd_dateoffset(es):\n    total_events_pd = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"customers\",\n        use_previous=pd.DateOffset(hours=47, minutes=60),\n        primitive=Count,\n    )\n\n    feature_matrix = calculate_feature_matrix(\n        [total_events_pd],\n        es,\n        cutoff_time=pd.Timestamp(\"2011-04-11 10:31:30\"),\n        instance_ids=[0, 1, 2],\n    )\n    col_name = list(feature_matrix.head().keys())[0]\n    assert (feature_matrix[col_name] == [1, 5, 2]).all()\n\n\ndef _assert_agg_feats_equal(f1, f2):\n    assert f1.unique_name() == f2.unique_name()\n    assert f1.child_dataframe_name == f2.child_dataframe_name\n    assert f1.parent_dataframe_name == f2.parent_dataframe_name\n    assert f1.relationship_path == f2.relationship_path\n    assert f1.use_previous == f2.use_previous\n\n\ndef test_override_multi_feature_names(es):\n    def gen_custom_names(\n        primitive,\n        base_feature_names,\n        relationship_path_name,\n        parent_dataframe_name,\n        where_str,\n        use_prev_str,\n    ):\n        base_string = \"Custom_%s({}.{})\".format(\n            parent_dataframe_name,\n            base_feature_names,\n        )\n        return [base_string % i for i in range(primitive.number_output_features)]\n\n    class NMostCommoner(AggregationPrimitive):\n        name = \"pd_top3\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"category\"})\n        number_output_features = 3\n\n        def generate_names(\n            self,\n            base_feature_names,\n            relationship_path_name,\n            parent_dataframe_name,\n            where_str,\n            use_prev_str,\n        ):\n            return gen_custom_names(\n                self,\n                base_feature_names,\n                relationship_path_name,\n                parent_dataframe_name,\n                where_str,\n                use_prev_str,\n            )\n\n    fm, features = dfs(\n        entityset=es,\n        target_dataframe_name=\"products\",\n        instance_ids=[0, 1, 2],\n        agg_primitives=[NMostCommoner],\n        trans_primitives=[],\n    )\n\n    expected_names = []\n    base_names = [[\"value\"], [\"value_2\"], [\"value_many_nans\"]]\n    for name in base_names:\n        expected_names += gen_custom_names(\n            NMostCommoner,\n            name,\n            None,\n            \"products\",\n            None,\n            None,\n        )\n\n    for name in expected_names:\n        assert name in fm.columns\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/test_all_primitive_docstrings.py",
    "content": "from featuretools.primitives import get_aggregation_primitives, get_transform_primitives\n\n\ndef docstring_is_uniform(primitive):\n    docstring = primitive.__doc__\n    valid_verbs = [\n        \"Calculates\",\n        \"Determines\",\n        \"Transforms\",\n        \"Computes\",\n        \"Counts\",\n        \"Negates\",\n        \"Adds\",\n        \"Subtracts\",\n        \"Multiplies\",\n        \"Divides\",\n        \"Performs\",\n        \"Returns\",\n        \"Shifts\",\n        \"Extracts\",\n        \"Applies\",\n    ]\n    return any(docstring.startswith(s) for s in valid_verbs)\n\n\ndef test_transform_primitive_docstrings():\n    for primitive in get_transform_primitives().values():\n        assert docstring_is_uniform(primitive)\n\n\ndef test_aggregation_primitive_docstrings():\n    for primitive in get_aggregation_primitives().values():\n        assert docstring_is_uniform(primitive)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/test_direct_features.py",
    "content": "import numpy as np\nimport pandas as pd\nimport pytest\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime\n\nfrom featuretools.computational_backends.feature_set import FeatureSet\nfrom featuretools.computational_backends.feature_set_calculator import (\n    FeatureSetCalculator,\n)\nfrom featuretools.feature_base import DirectFeature, Feature, IdentityFeature\nfrom featuretools.primitives import (\n    AggregationPrimitive,\n    Day,\n    Hour,\n    Minute,\n    Month,\n    NMostCommon,\n    Second,\n    TransformPrimitive,\n    Year,\n)\nfrom featuretools.primitives.utils import PrimitivesDeserializer\nfrom featuretools.synthesis import dfs\n\n\ndef test_direct_from_identity(es):\n    device = Feature(es[\"sessions\"].ww[\"device_type\"])\n    d = DirectFeature(base_feature=device, child_dataframe_name=\"log\")\n\n    feature_set = FeatureSet([d])\n    calculator = FeatureSetCalculator(es, feature_set=feature_set, time_last=None)\n    df = calculator.run(np.array([0, 5]))\n    v = df[d.get_name()].tolist()\n    expected = [0, 1]\n    assert v == expected\n\n\ndef test_direct_from_column(es):\n    # should be same behavior as test_direct_from_identity\n    device = Feature(es[\"sessions\"].ww[\"device_type\"])\n    d = DirectFeature(base_feature=device, child_dataframe_name=\"log\")\n\n    feature_set = FeatureSet([d])\n    calculator = FeatureSetCalculator(es, feature_set=feature_set, time_last=None)\n    df = calculator.run(np.array([0, 5]))\n    v = df[d.get_name()].tolist()\n    expected = [0, 1]\n    assert v == expected\n\n\ndef test_direct_rename_multioutput(es):\n    n_common = Feature(\n        es[\"log\"].ww[\"product_id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=NMostCommon(n=2),\n    )\n    feat = DirectFeature(n_common, \"sessions\")\n    copy_feat = feat.rename(\"session_test\")\n    assert feat.unique_name() != copy_feat.unique_name()\n    assert feat.get_name() != copy_feat.get_name()\n    assert (\n        feat.base_features[0].generate_name()\n        == copy_feat.base_features[0].generate_name()\n    )\n    assert feat.dataframe_name == copy_feat.dataframe_name\n\n\ndef test_direct_rename(es):\n    # should be same behavior as test_direct_from_identity\n    feat = DirectFeature(\n        base_feature=IdentityFeature(es[\"sessions\"].ww[\"device_type\"]),\n        child_dataframe_name=\"log\",\n    )\n    copy_feat = feat.rename(\"session_test\")\n    assert feat.unique_name() != copy_feat.unique_name()\n    assert feat.get_name() != copy_feat.get_name()\n    assert (\n        feat.base_features[0].generate_name()\n        == copy_feat.base_features[0].generate_name()\n    )\n    assert feat.dataframe_name == copy_feat.dataframe_name\n\n\ndef test_direct_copy(games_es):\n    home_team = next(\n        r for r in games_es.relationships if r._child_column_name == \"home_team_id\"\n    )\n    feat = DirectFeature(\n        IdentityFeature(games_es[\"teams\"].ww[\"name\"]),\n        \"games\",\n        relationship=home_team,\n    )\n    copied = feat.copy()\n    assert copied.dataframe_name == feat.dataframe_name\n    assert copied.base_features == feat.base_features\n    assert copied.relationship_path == feat.relationship_path\n\n\ndef test_direct_of_multi_output_transform_feat(es):\n    class TestTime(TransformPrimitive):\n        name = \"test_time\"\n        input_types = [ColumnSchema(logical_type=Datetime)]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n        number_output_features = 6\n\n        def get_function(self):\n            def test_f(x):\n                times = pd.Series(x)\n                units = [\"year\", \"month\", \"day\", \"hour\", \"minute\", \"second\"]\n                return [times.apply(lambda x: getattr(x, unit)) for unit in units]\n\n            return test_f\n\n    base_feature = IdentityFeature(es[\"customers\"].ww[\"signup_date\"])\n    join_time_split = Feature(base_feature, primitive=TestTime)\n    alt_features = [\n        Feature(base_feature, primitive=Year),\n        Feature(base_feature, primitive=Month),\n        Feature(base_feature, primitive=Day),\n        Feature(base_feature, primitive=Hour),\n        Feature(base_feature, primitive=Minute),\n        Feature(base_feature, primitive=Second),\n    ]\n    fm, fl = dfs(\n        entityset=es,\n        target_dataframe_name=\"sessions\",\n        trans_primitives=[TestTime, Year, Month, Day, Hour, Minute, Second],\n    )\n\n    # Get column names of for multi feature and normal features\n    subnames = DirectFeature(join_time_split, \"sessions\").get_feature_names()\n    altnames = [DirectFeature(f, \"sessions\").get_name() for f in alt_features]\n\n    # Check values are equal between\n    for col1, col2 in zip(subnames, altnames):\n        assert (fm[col1] == fm[col2]).all()\n\n\ndef test_direct_features_of_multi_output_agg_primitives(es):\n    class ThreeMostCommonCat(AggregationPrimitive):\n        name = \"n_most_common_categorical\"\n        input_types = [ColumnSchema(semantic_tags={\"category\"})]\n        return_type = ColumnSchema(semantic_tags={\"category\"})\n        number_output_features = 3\n\n        def get_function(self):\n            def pd_top3(x):\n                counts = x.value_counts()\n                counts = counts[counts > 0]\n                array = np.array(counts.index[:3])\n                if len(array) < 3:\n                    filler = np.full(3 - len(array), np.nan)\n                    array = np.append(array, filler)\n                return array\n\n            return pd_top3\n\n    fm, fl = dfs(\n        entityset=es,\n        target_dataframe_name=\"log\",\n        agg_primitives=[ThreeMostCommonCat],\n        trans_primitives=[],\n        max_depth=3,\n    )\n\n    has_nmost_as_base = []\n    for feature in fl:\n        is_base = False\n        if len(feature.base_features) > 0 and isinstance(\n            feature.base_features[0].primitive,\n            ThreeMostCommonCat,\n        ):\n            is_base = True\n        has_nmost_as_base.append(is_base)\n    assert any(has_nmost_as_base)\n\n    true_result_rows = []\n    session_data = {\n        0: [\"coke zero\", \"car\", np.nan],\n        1: [\"toothpaste\", \"brown bag\", np.nan],\n        2: [\"brown bag\", np.nan, np.nan],\n        3: set([\"Haribo sugar-free gummy bears\", \"coke zero\", np.nan]),\n        4: [\"coke zero\", np.nan, np.nan],\n        5: [\"taco clock\", np.nan, np.nan],\n    }\n    for i, count in enumerate([5, 4, 1, 2, 3, 2]):\n        while count > 0:\n            true_result_rows.append(session_data[i])\n            count -= 1\n\n    tempname = \"sessions.N_MOST_COMMON_CATEGORICAL(log.product_id)[%s]\"\n    for i, row in enumerate(true_result_rows):\n        for j in range(3):\n            value = fm[tempname % (j)][i]\n            if isinstance(row, set):\n                assert pd.isnull(value) or value in row\n            else:\n                assert (pd.isnull(value) and pd.isnull(row[j])) or value == row[j]\n\n\ndef test_direct_with_invalid_init_args(diamond_es):\n    customer_to_region = diamond_es.get_forward_relationships(\"customers\")[0]\n    error_text = \"child_dataframe must be the relationship child dataframe\"\n    with pytest.raises(AssertionError, match=error_text):\n        DirectFeature(\n            IdentityFeature(diamond_es[\"regions\"].ww[\"name\"]),\n            \"stores\",\n            relationship=customer_to_region,\n        )\n\n    transaction_relationships = diamond_es.get_forward_relationships(\"transactions\")\n    transaction_to_store = next(\n        r for r in transaction_relationships if r.parent_dataframe.ww.name == \"stores\"\n    )\n    error_text = \"Base feature must be defined on the relationship parent dataframe\"\n    with pytest.raises(AssertionError, match=error_text):\n        DirectFeature(\n            IdentityFeature(diamond_es[\"regions\"].ww[\"name\"]),\n            \"transactions\",\n            relationship=transaction_to_store,\n        )\n\n\ndef test_direct_with_multiple_possible_paths(games_es):\n    error_text = (\n        \"There are multiple relationships to the base dataframe. \"\n        \"You must specify a relationship.\"\n    )\n    with pytest.raises(RuntimeError, match=error_text):\n        DirectFeature(IdentityFeature(games_es[\"teams\"].ww[\"name\"]), \"games\")\n\n    # Does not raise if path specified.\n    relationship = next(\n        r\n        for r in games_es.get_forward_relationships(\"games\")\n        if r._child_column_name == \"home_team_id\"\n    )\n    feat = DirectFeature(\n        IdentityFeature(games_es[\"teams\"].ww[\"name\"]),\n        \"games\",\n        relationship=relationship,\n    )\n    assert feat.relationship_path_name() == \"teams[home_team_id]\"\n    assert feat.get_name() == \"teams[home_team_id].name\"\n\n\ndef test_direct_with_single_possible_path(es):\n    feat = DirectFeature(IdentityFeature(es[\"customers\"].ww[\"age\"]), \"sessions\")\n    assert feat.relationship_path_name() == \"customers\"\n    assert feat.get_name() == \"customers.age\"\n\n\ndef test_direct_with_no_path(diamond_es):\n    error_text = 'No relationship from \"regions\" to \"customers\" found.'\n    with pytest.raises(RuntimeError, match=error_text):\n        DirectFeature(IdentityFeature(diamond_es[\"customers\"].ww[\"name\"]), \"regions\")\n\n    error_text = 'No relationship from \"customers\" to \"customers\" found.'\n    with pytest.raises(RuntimeError, match=error_text):\n        DirectFeature(IdentityFeature(diamond_es[\"customers\"].ww[\"name\"]), \"customers\")\n\n\ndef test_serialization(es):\n    value = IdentityFeature(es[\"products\"].ww[\"rating\"])\n    direct = DirectFeature(value, \"log\")\n\n    log_to_products = next(\n        r\n        for r in es.get_forward_relationships(\"log\")\n        if r.parent_dataframe.ww.name == \"products\"\n    )\n    dictionary = {\n        \"name\": direct.get_name(),\n        \"base_feature\": value.unique_name(),\n        \"relationship\": log_to_products.to_dictionary(),\n    }\n\n    assert dictionary == direct.get_arguments()\n    assert direct == DirectFeature.from_dictionary(\n        dictionary,\n        es,\n        {value.unique_name(): value},\n        PrimitivesDeserializer(),\n    )\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/test_feature_base.py",
    "content": "import os.path\nimport re\n\nimport pytest\nfrom pympler.asizeof import asizeof\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime, Integer\n\nfrom featuretools import Feature, config, feature_base\nfrom featuretools.feature_base import IdentityFeature\nfrom featuretools.primitives import (\n    Count,\n    Diff,\n    Last,\n    Mode,\n    Negate,\n    NMostCommon,\n    NumUnique,\n    Sum,\n    TransformPrimitive,\n)\nfrom featuretools.synthesis.deep_feature_synthesis import can_stack_primitive_on_inputs\nfrom featuretools.tests.testing_utils import check_rename\n\n\ndef test_copy_features_does_not_copy_entityset(es):\n    agg = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Sum,\n    )\n    agg_where = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"sessions\",\n        where=IdentityFeature(es[\"log\"].ww[\"value\"]) == 2,\n        primitive=Sum,\n    )\n    agg_use_previous = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"sessions\",\n        use_previous=\"4 days\",\n        primitive=Sum,\n    )\n    agg_use_previous_where = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"sessions\",\n        where=IdentityFeature(es[\"log\"].ww[\"value\"]) == 2,\n        use_previous=\"4 days\",\n        primitive=Sum,\n    )\n    features = [agg, agg_where, agg_use_previous, agg_use_previous_where]\n    in_memory_size = asizeof(locals())\n    copied = [f.copy() for f in features]\n    new_in_memory_size = asizeof(locals())\n    assert new_in_memory_size < 2 * in_memory_size\n\n\ndef test_get_dependencies(es):\n    f = Feature(es[\"log\"].ww[\"value\"])\n    agg1 = Feature(f, parent_dataframe_name=\"sessions\", primitive=Sum)\n    agg2 = Feature(agg1, parent_dataframe_name=\"customers\", primitive=Sum)\n    d1 = Feature(agg2, \"sessions\")\n    shallow = d1.get_dependencies(deep=False, ignored=None)\n    deep = d1.get_dependencies(deep=True, ignored=None)\n    ignored = set([agg1.unique_name()])\n    deep_ignored = d1.get_dependencies(deep=True, ignored=ignored)\n    assert [s.unique_name() for s in shallow] == [agg2.unique_name()]\n    assert [d.unique_name() for d in deep] == [\n        agg2.unique_name(),\n        agg1.unique_name(),\n        f.unique_name(),\n    ]\n    assert [d.unique_name() for d in deep_ignored] == [agg2.unique_name()]\n\n\ndef test_get_depth(es):\n    f = Feature(es[\"log\"].ww[\"value\"])\n    g = Feature(es[\"log\"].ww[\"value\"])\n    agg1 = Feature(f, parent_dataframe_name=\"sessions\", primitive=Last)\n    agg2 = Feature(agg1, parent_dataframe_name=\"customers\", primitive=Last)\n    d1 = Feature(agg2, \"sessions\")\n    d2 = Feature(d1, \"log\")\n    assert d2.get_depth() == 4\n    # Make sure this works if we pass in two of the same\n    # feature. This came up when user supplied duplicates\n    # in the seed_features of DFS.\n    assert d2.get_depth(stop_at=[f, g]) == 4\n    assert d2.get_depth(stop_at=[f, g, agg1]) == 3\n    assert d2.get_depth(stop_at=[f, g, agg1]) == 3\n    assert d2.get_depth(stop_at=[f, g, agg2]) == 2\n    assert d2.get_depth(stop_at=[f, g, d1]) == 1\n    assert d2.get_depth(stop_at=[f, g, d2]) == 0\n\n\ndef test_squared(es):\n    feature = Feature(es[\"log\"].ww[\"value\"])\n    squared = feature * feature\n    assert len(squared.base_features) == 2\n    assert (\n        squared.base_features[0].unique_name() == squared.base_features[1].unique_name()\n    )\n\n\ndef test_return_type_inference(es):\n    mode = Feature(\n        es[\"log\"].ww[\"priority_level\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Mode,\n    )\n    assert (\n        mode.column_schema\n        == IdentityFeature(es[\"log\"].ww[\"priority_level\"]).column_schema\n    )\n\n\ndef test_return_type_inference_direct_feature(es):\n    mode = Feature(\n        es[\"log\"].ww[\"priority_level\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Mode,\n    )\n    mode_session = Feature(mode, \"sessions\")\n    assert (\n        mode_session.column_schema\n        == IdentityFeature(es[\"log\"].ww[\"priority_level\"]).column_schema\n    )\n\n\ndef test_return_type_inference_index(es):\n    last = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Last,\n    )\n    assert \"index\" not in last.column_schema.semantic_tags\n    assert isinstance(last.column_schema.logical_type, Integer)\n\n\ndef test_return_type_inference_datetime_time_index(es):\n    last = Feature(\n        es[\"log\"].ww[\"datetime\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Last,\n    )\n    assert isinstance(last.column_schema.logical_type, Datetime)\n\n\ndef test_return_type_inference_numeric_time_index(int_es):\n    last = Feature(\n        int_es[\"log\"].ww[\"datetime\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Last,\n    )\n    assert \"numeric\" in last.column_schema.semantic_tags\n\n\ndef test_return_type_inference_id(es):\n    # direct features should keep foreign key tag\n    direct_id_feature = Feature(es[\"sessions\"].ww[\"customer_id\"], \"log\")\n    assert \"foreign_key\" in direct_id_feature.column_schema.semantic_tags\n\n    # aggregations of foreign key types should get converted\n    last_feat = Feature(\n        es[\"log\"].ww[\"session_id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Last,\n    )\n    assert \"foreign_key\" not in last_feat.column_schema.semantic_tags\n    assert isinstance(last_feat.column_schema.logical_type, Integer)\n\n    # also test direct feature of aggregation\n    last_direct = Feature(last_feat, \"sessions\")\n    assert \"foreign_key\" not in last_direct.column_schema.semantic_tags\n    assert isinstance(last_direct.column_schema.logical_type, Integer)\n\n\ndef test_set_data_path(es):\n    key = \"primitive_data_folder\"\n\n    # Don't change orig_path\n    orig_path = config.get(key)\n    new_path = \"/example/new/directory\"\n    filename = \"test.csv\"\n\n    # Test that default path works\n    sum_prim = Sum()\n    assert sum_prim.get_filepath(filename) == os.path.join(orig_path, filename)\n\n    # Test that new path works\n    config.set({key: new_path})\n    assert sum_prim.get_filepath(filename) == os.path.join(new_path, filename)\n\n    # Test that new path with trailing / works\n    new_path += \"/\"\n    config.set({key: new_path})\n    assert sum_prim.get_filepath(filename) == os.path.join(new_path, filename)\n\n    # Test that the path is correct on newly defined feature\n    sum_prim2 = Sum()\n    assert sum_prim2.get_filepath(filename) == os.path.join(new_path, filename)\n\n    # Ensure path was reset\n    config.set({key: orig_path})\n    assert config.get(key) == orig_path\n\n\ndef test_to_dictionary_direct(es):\n    actual = Feature(\n        IdentityFeature(es[\"sessions\"].ww[\"customer_id\"]),\n        \"log\",\n    ).to_dictionary()\n\n    expected = {\n        \"type\": \"DirectFeature\",\n        \"dependencies\": [\"sessions: customer_id\"],\n        \"arguments\": {\n            \"name\": \"sessions.customer_id\",\n            \"base_feature\": \"sessions: customer_id\",\n            \"relationship\": {\n                \"parent_dataframe_name\": \"sessions\",\n                \"child_dataframe_name\": \"log\",\n                \"parent_column_name\": \"id\",\n                \"child_column_name\": \"session_id\",\n            },\n        },\n    }\n\n    assert expected == actual\n\n\ndef test_to_dictionary_identity(es):\n    actual = Feature(es[\"sessions\"].ww[\"customer_id\"]).to_dictionary()\n\n    expected = {\n        \"type\": \"IdentityFeature\",\n        \"dependencies\": [],\n        \"arguments\": {\n            \"name\": \"customer_id\",\n            \"column_name\": \"customer_id\",\n            \"dataframe_name\": \"sessions\",\n        },\n    }\n\n    assert expected == actual\n\n\ndef test_to_dictionary_agg(es):\n    primitive = Sum()\n    actual = Feature(\n        es[\"customers\"].ww[\"age\"],\n        primitive=primitive,\n        parent_dataframe_name=\"cohorts\",\n    ).to_dictionary()\n\n    expected = {\n        \"type\": \"AggregationFeature\",\n        \"dependencies\": [\"customers: age\"],\n        \"arguments\": {\n            \"name\": \"SUM(customers.age)\",\n            \"base_features\": [\"customers: age\"],\n            \"relationship_path\": [\n                {\n                    \"parent_dataframe_name\": \"cohorts\",\n                    \"child_dataframe_name\": \"customers\",\n                    \"parent_column_name\": \"cohort\",\n                    \"child_column_name\": \"cohort\",\n                },\n            ],\n            \"primitive\": primitive,\n            \"where\": None,\n            \"use_previous\": None,\n        },\n    }\n\n    assert expected == actual\n\n\ndef test_to_dictionary_where(es):\n    primitive = Sum()\n    actual = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"sessions\",\n        where=IdentityFeature(es[\"log\"].ww[\"value\"]) == 2,\n        primitive=primitive,\n    ).to_dictionary()\n\n    expected = {\n        \"type\": \"AggregationFeature\",\n        \"dependencies\": [\"log: value\", \"log: value = 2\"],\n        \"arguments\": {\n            \"name\": \"SUM(log.value WHERE value = 2)\",\n            \"base_features\": [\"log: value\"],\n            \"relationship_path\": [\n                {\n                    \"parent_dataframe_name\": \"sessions\",\n                    \"child_dataframe_name\": \"log\",\n                    \"parent_column_name\": \"id\",\n                    \"child_column_name\": \"session_id\",\n                },\n            ],\n            \"primitive\": primitive,\n            \"where\": \"log: value = 2\",\n            \"use_previous\": None,\n        },\n    }\n\n    assert expected == actual\n\n\ndef test_to_dictionary_trans(es):\n    primitive = Negate()\n    trans_feature = Feature(es[\"customers\"].ww[\"age\"], primitive=primitive)\n\n    expected = {\n        \"type\": \"TransformFeature\",\n        \"dependencies\": [\"customers: age\"],\n        \"arguments\": {\n            \"name\": \"-(age)\",\n            \"base_features\": [\"customers: age\"],\n            \"primitive\": primitive,\n        },\n    }\n\n    assert expected == trans_feature.to_dictionary()\n\n\ndef test_to_dictionary_groupby_trans(es):\n    primitive = Negate()\n    id_feat = Feature(es[\"log\"].ww[\"product_id\"])\n    groupby_feature = Feature(\n        es[\"log\"].ww[\"value\"],\n        primitive=primitive,\n        groupby=id_feat,\n    )\n\n    expected = {\n        \"type\": \"GroupByTransformFeature\",\n        \"dependencies\": [\"log: value\", \"log: product_id\"],\n        \"arguments\": {\n            \"name\": \"-(value) by product_id\",\n            \"base_features\": [\"log: value\"],\n            \"primitive\": primitive,\n            \"groupby\": \"log: product_id\",\n        },\n    }\n\n    assert expected == groupby_feature.to_dictionary()\n\n\ndef test_to_dictionary_multi_slice(es):\n    slice_feature = Feature(\n        es[\"log\"].ww[\"product_id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=NMostCommon(n=2),\n    )[0]\n\n    expected = {\n        \"type\": \"FeatureOutputSlice\",\n        \"dependencies\": [\"customers: N_MOST_COMMON(log.product_id, n=2)\"],\n        \"arguments\": {\n            \"name\": \"N_MOST_COMMON(log.product_id, n=2)[0]\",\n            \"base_feature\": \"customers: N_MOST_COMMON(log.product_id, n=2)\",\n            \"n\": 0,\n        },\n    }\n\n    assert expected == slice_feature.to_dictionary()\n\n\ndef test_multi_output_base_error_agg(es):\n    three_common = NMostCommon(3)\n    tc = Feature(\n        es[\"log\"].ww[\"product_id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=three_common,\n    )\n    error_text = \"Cannot stack on whole multi-output feature.\"\n    with pytest.raises(ValueError, match=error_text):\n        Feature(tc, parent_dataframe_name=\"customers\", primitive=NumUnique)\n\n\ndef test_multi_output_base_error_trans(es):\n    class TestTime(TransformPrimitive):\n        name = \"test_time\"\n        input_types = [ColumnSchema(logical_type=Datetime)]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n        number_output_features = 6\n\n    tc = Feature(es[\"customers\"].ww[\"birthday\"], primitive=TestTime)\n\n    error_text = \"Cannot stack on whole multi-output feature.\"\n    with pytest.raises(ValueError, match=error_text):\n        Feature(tc, primitive=Diff)\n\n\ndef test_multi_output_attributes(es):\n    tc = Feature(\n        es[\"log\"].ww[\"product_id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=NMostCommon,\n    )\n\n    assert tc.generate_name() == \"N_MOST_COMMON(log.product_id)\"\n    assert tc.number_output_features == 3\n    assert tc.base_features == [\"<Feature: product_id>\"]\n\n    assert tc[0].generate_name() == \"N_MOST_COMMON(log.product_id)[0]\"\n    assert tc[0].number_output_features == 1\n    assert tc[0].base_features == [tc]\n    assert tc.relationship_path == tc[0].relationship_path\n\n\ndef test_multi_output_index_error(es):\n    error_text = \"can only access slice of multi-output feature\"\n    three_common = Feature(\n        es[\"log\"].ww[\"product_id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=NMostCommon,\n    )\n\n    with pytest.raises(AssertionError, match=error_text):\n        single = Feature(\n            es[\"log\"].ww[\"product_id\"],\n            parent_dataframe_name=\"sessions\",\n            primitive=NumUnique,\n        )\n        single[0]\n\n    error_text = \"Cannot get item from slice of multi output feature\"\n    with pytest.raises(ValueError, match=error_text):\n        three_common[0][0]\n\n    error_text = \"index is higher than the number of outputs\"\n    with pytest.raises(AssertionError, match=error_text):\n        three_common[10]\n\n\ndef test_rename(es):\n    feat = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Count,\n    )\n    new_name = \"session_test\"\n    new_names = [\"session_test\"]\n    check_rename(feat, new_name, new_names)\n\n\ndef test_rename_multioutput(es):\n    feat = Feature(\n        es[\"log\"].ww[\"product_id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=NMostCommon(n=2),\n    )\n    new_name = \"session_test\"\n    new_names = [\"session_test[0]\", \"session_test[1]\"]\n    check_rename(feat, new_name, new_names)\n\n\ndef test_rename_featureoutputslice(es):\n    multi_output_feat = Feature(\n        es[\"log\"].ww[\"product_id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=NMostCommon(n=2),\n    )\n    feat = feature_base.FeatureOutputSlice(multi_output_feat, 0)\n    new_name = \"session_test\"\n    new_names = [\"session_test\"]\n    check_rename(feat, new_name, new_names)\n\n\ndef test_set_feature_names_wrong_number_of_names(es):\n    feat = Feature(\n        es[\"log\"].ww[\"product_id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=NMostCommon(n=2),\n    )\n    new_names = [\"col1\"]\n    error_msg = re.escape(\n        \"Number of names provided must match the number of output features: 1 name(s) provided, 2 expected.\",\n    )\n    with pytest.raises(ValueError, match=error_msg):\n        feat.set_feature_names(new_names)\n\n\ndef test_set_feature_names_not_unique(es):\n    feat = Feature(\n        es[\"log\"].ww[\"product_id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=NMostCommon(n=2),\n    )\n    new_names = [\"col1\", \"col1\"]\n    error_msg = \"Provided output feature names must be unique.\"\n    with pytest.raises(ValueError, match=error_msg):\n        feat.set_feature_names(new_names)\n\n\ndef test_set_feature_names_error_on_single_output_feature(es):\n    feat = Feature(es[\"sessions\"].ww[\"device_name\"], \"log\")\n    new_names = [\"sessions_device\"]\n    error_msg = \"The set_feature_names can only be used on features that have more than one output column.\"\n    with pytest.raises(ValueError, match=error_msg):\n        feat.set_feature_names(new_names)\n\n\ndef test_set_feature_names_transform_feature(es):\n    class MultiCumulative(TransformPrimitive):\n        name = \"multi_cum_sum\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n        number_output_features = 3\n\n    feat = Feature(es[\"log\"].ww[\"value\"], primitive=MultiCumulative)\n    new_names = [\"cumulative_sum\", \"cumulative_max\", \"cumulative_min\"]\n    feat.set_feature_names(new_names)\n    assert feat.get_feature_names() == new_names\n\n\ndef test_set_feature_names_aggregation_feature(es):\n    feat = Feature(\n        es[\"log\"].ww[\"product_id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=NMostCommon(n=2),\n    )\n    new_names = [\"agg_col_1\", \"second_agg_col\"]\n    feat.set_feature_names(new_names)\n    assert feat.get_feature_names() == new_names\n\n\ndef test_renaming_resets_feature_output_names_to_default(es):\n    feat = Feature(\n        es[\"log\"].ww[\"product_id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=NMostCommon(n=2),\n    )\n    new_names = [\"renamed1\", \"renamed2\"]\n    feat.set_feature_names(new_names)\n    assert feat.get_feature_names() == new_names\n\n    feat = feat.rename(\"new_feature_name\")\n    assert feat.get_feature_names() == [\"new_feature_name[0]\", \"new_feature_name[1]\"]\n\n\ndef test_base_of_and_stack_on_heuristic(es, test_aggregation_primitive):\n    child = Feature(\n        es[\"sessions\"].ww[\"id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Count,\n    )\n    test_aggregation_primitive.stack_on = []\n    child.primitive.base_of = []\n    assert not can_stack_primitive_on_inputs(test_aggregation_primitive(), [child])\n\n    test_aggregation_primitive.stack_on = []\n    child.primitive.base_of = None\n    assert can_stack_primitive_on_inputs(test_aggregation_primitive(), [child])\n\n    test_aggregation_primitive.stack_on = []\n    child.primitive.base_of = [test_aggregation_primitive]\n    assert can_stack_primitive_on_inputs(test_aggregation_primitive(), [child])\n\n    test_aggregation_primitive.stack_on = None\n    child.primitive.base_of = []\n    assert can_stack_primitive_on_inputs(test_aggregation_primitive(), [child])\n\n    test_aggregation_primitive.stack_on = None\n    child.primitive.base_of = None\n    assert can_stack_primitive_on_inputs(test_aggregation_primitive(), [child])\n\n    test_aggregation_primitive.stack_on = None\n    child.primitive.base_of = [test_aggregation_primitive]\n    assert can_stack_primitive_on_inputs(test_aggregation_primitive(), [child])\n\n    test_aggregation_primitive.stack_on = [type(child.primitive)]\n    child.primitive.base_of = []\n    assert can_stack_primitive_on_inputs(test_aggregation_primitive(), [child])\n\n    test_aggregation_primitive.stack_on = [type(child.primitive)]\n    child.primitive.base_of = None\n    assert can_stack_primitive_on_inputs(test_aggregation_primitive(), [child])\n\n    test_aggregation_primitive.stack_on = [type(child.primitive)]\n    child.primitive.base_of = [test_aggregation_primitive]\n    assert can_stack_primitive_on_inputs(test_aggregation_primitive(), [child])\n\n    test_aggregation_primitive.stack_on = None\n    child.primitive.base_of = None\n    child.primitive.base_of_exclude = [test_aggregation_primitive]\n    assert not can_stack_primitive_on_inputs(test_aggregation_primitive(), [child])\n\n    test_aggregation_primitive.stack_on_exclude = [Count]\n    assert not can_stack_primitive_on_inputs(test_aggregation_primitive(), [child])\n\n    child.primitive.number_output_features = 2\n    test_aggregation_primitive.stack_on_exclude = []\n    test_aggregation_primitive.stack_on = []\n    child.primitive.base_of = []\n    assert not can_stack_primitive_on_inputs(test_aggregation_primitive(), [child])\n\n\ndef test_stack_on_self(es, test_transform_primitive):\n    # test stacks on self\n    child = Feature(\n        es[\"log\"].ww[\"value\"],\n        primitive=test_transform_primitive,\n    )\n    test_transform_primitive.stack_on = []\n    child.primitive.base_of = []\n    test_transform_primitive.stack_on_self = False\n    child.primitive.stack_on_self = False\n    assert not can_stack_primitive_on_inputs(test_transform_primitive(), [child])\n\n    test_transform_primitive.stack_on_self = True\n    assert can_stack_primitive_on_inputs(test_transform_primitive(), [child])\n\n    test_transform_primitive.stack_on = None\n    test_transform_primitive.stack_on_self = False\n    assert not can_stack_primitive_on_inputs(test_transform_primitive(), [child])\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/test_feature_descriptions.py",
    "content": "import json\nimport os\n\nimport pytest\nfrom woodwork.column_schema import ColumnSchema\n\nfrom featuretools import describe_feature\nfrom featuretools.feature_base import (\n    AggregationFeature,\n    DirectFeature,\n    GroupByTransformFeature,\n    IdentityFeature,\n    TransformFeature,\n)\nfrom featuretools.primitives import (\n    Absolute,\n    AggregationPrimitive,\n    CumMean,\n    EqualScalar,\n    Mean,\n    Mode,\n    NMostCommon,\n    NumUnique,\n    PercentTrue,\n    Sum,\n    TransformPrimitive,\n)\n\n\ndef test_identity_description(es):\n    feature = IdentityFeature(es[\"log\"].ww[\"session_id\"])\n    description = 'The \"session_id\".'\n\n    assert describe_feature(feature) == description\n\n\ndef test_direct_description(es):\n    feature = DirectFeature(\n        IdentityFeature(es[\"customers\"].ww[\"loves_ice_cream\"]),\n        \"sessions\",\n    )\n    description = (\n        'The \"loves_ice_cream\" for the instance of \"customers\" associated '\n        'with this instance of \"sessions\".'\n    )\n    assert describe_feature(feature) == description\n\n    deep_direct = DirectFeature(feature, \"log\")\n    deep_description = (\n        'The \"loves_ice_cream\" for the instance of \"customers\" '\n        'associated with the instance of \"sessions\" associated with '\n        'this instance of \"log\".'\n    )\n    assert describe_feature(deep_direct) == deep_description\n\n    agg = AggregationFeature(\n        IdentityFeature(es[\"log\"].ww[\"purchased\"]),\n        \"sessions\",\n        PercentTrue,\n    )\n    complicated_direct = DirectFeature(agg, \"log\")\n    agg_on_direct = AggregationFeature(complicated_direct, \"products\", Mean)\n\n    complicated_description = (\n        \"The average of the percentage of true values in \"\n        'the \"purchased\" of all instances of \"log\" for each \"id\" in \"sessions\" for '\n        'the instance of \"sessions\" associated with this instance of \"log\" of all '\n        'instances of \"log\" for each \"id\" in \"products\".'\n    )\n    assert describe_feature(agg_on_direct) == complicated_description\n\n\ndef test_transform_description(es):\n    feature = TransformFeature(IdentityFeature(es[\"log\"].ww[\"value\"]), Absolute)\n    description = 'The absolute value of the \"value\".'\n    assert describe_feature(feature) == description\n\n\ndef test_groupby_transform_description(es):\n    feature = GroupByTransformFeature(\n        IdentityFeature(es[\"log\"].ww[\"value\"]),\n        CumMean,\n        IdentityFeature(es[\"log\"].ww[\"session_id\"]),\n    )\n    description = 'The cumulative mean of the \"value\" for each \"session_id\".'\n\n    assert describe_feature(feature) == description\n\n\ndef test_aggregation_description(es):\n    feature = AggregationFeature(\n        IdentityFeature(es[\"log\"].ww[\"value\"]),\n        \"sessions\",\n        Mean,\n    )\n    description = 'The average of the \"value\" of all instances of \"log\" for each \"id\" in \"sessions\".'\n    assert describe_feature(feature) == description\n\n    stacked_agg = AggregationFeature(feature, \"customers\", Sum)\n    stacked_description = (\n        'The sum of t{} of all instances of \"sessions\" for each \"id\" '\n        'in \"customers\".'.format(description[1:-1])\n    )\n    assert describe_feature(stacked_agg) == stacked_description\n\n\ndef test_aggregation_description_where(es):\n    where_feature = TransformFeature(\n        IdentityFeature(es[\"log\"].ww[\"countrycode\"]),\n        EqualScalar(\"US\"),\n    )\n    feature = AggregationFeature(\n        IdentityFeature(es[\"log\"].ww[\"value\"]),\n        \"sessions\",\n        Mean,\n        where=where_feature,\n    )\n    description = (\n        'The average of the \"value\" of all instances of \"log\" where the '\n        '\"countrycode\" is US for each \"id\" in \"sessions\".'\n    )\n\n    assert describe_feature(feature) == description\n\n\ndef test_aggregation_description_use_previous(es):\n    feature = AggregationFeature(\n        IdentityFeature(es[\"log\"].ww[\"value\"]),\n        \"sessions\",\n        Mean,\n        use_previous=\"5d\",\n    )\n    description = 'The average of the \"value\" of the previous 5 days of \"log\" for each \"id\" in \"sessions\".'\n\n    assert describe_feature(feature) == description\n\n\ndef test_multioutput_description(es):\n    n_most_common = NMostCommon(2)\n    n_most_common_feature = AggregationFeature(\n        IdentityFeature(es[\"log\"].ww[\"zipcode\"]),\n        \"sessions\",\n        n_most_common,\n    )\n    first_most_common_slice = n_most_common_feature[0]\n    second_most_common_slice = n_most_common_feature[1]\n\n    n_most_common_base = 'The 2 most common values of the \"zipcode\" of all instances of \"log\" for each \"id\" in \"sessions\".'\n    n_most_common_first = (\n        'The most common value of the \"zipcode\" of all instances of \"log\" '\n        'for each \"id\" in \"sessions\".'\n    )\n    n_most_common_second = (\n        'The 2nd most common value of the \"zipcode\" of all instances of '\n        '\"log\" for each \"id\" in \"sessions\".'\n    )\n\n    assert describe_feature(n_most_common_feature) == n_most_common_base\n    assert describe_feature(first_most_common_slice) == n_most_common_first\n    assert describe_feature(second_most_common_slice) == n_most_common_second\n\n    class CustomMultiOutput(TransformPrimitive):\n        name = \"custom_multioutput\"\n        input_types = [ColumnSchema(semantic_tags={\"category\"})]\n        return_type = ColumnSchema(semantic_tags={\"category\"})\n\n        number_output_features = 4\n\n    custom_feat = TransformFeature(\n        IdentityFeature(es[\"log\"].ww[\"zipcode\"]),\n        CustomMultiOutput,\n    )\n\n    generic_base = 'The result of applying CUSTOM_MULTIOUTPUT to the \"zipcode\".'\n    generic_first = 'The 1st output from applying CUSTOM_MULTIOUTPUT to the \"zipcode\".'\n    generic_second = 'The 2nd output from applying CUSTOM_MULTIOUTPUT to the \"zipcode\".'\n\n    assert describe_feature(custom_feat) == generic_base\n    assert describe_feature(custom_feat[0]) == generic_first\n    assert describe_feature(custom_feat[1]) == generic_second\n\n    CustomMultiOutput.description_template = [\n        \"the multioutput of {}\",\n        \"the {nth_slice} multioutput part of {}\",\n    ]\n    template_base = 'The multioutput of the \"zipcode\".'\n    template_first_slice = 'The 1st multioutput part of the \"zipcode\".'\n    template_second_slice = 'The 2nd multioutput part of the \"zipcode\".'\n    template_third_slice = 'The 3rd multioutput part of the \"zipcode\".'\n    template_fourth_slice = 'The 4th multioutput part of the \"zipcode\".'\n    assert describe_feature(custom_feat) == template_base\n    assert describe_feature(custom_feat[0]) == template_first_slice\n    assert describe_feature(custom_feat[1]) == template_second_slice\n    assert describe_feature(custom_feat[2]) == template_third_slice\n    assert describe_feature(custom_feat[3]) == template_fourth_slice\n\n    CustomMultiOutput.description_template = [\n        \"the multioutput of {}\",\n        \"the primary multioutput part of {}\",\n        \"the secondary multioutput part of {}\",\n    ]\n    custom_base = 'The multioutput of the \"zipcode\".'\n    custom_first_slice = 'The primary multioutput part of the \"zipcode\".'\n    custom_second_slice = 'The secondary multioutput part of the \"zipcode\".'\n    bad_slice_error = \"Slice out of range of template\"\n    assert describe_feature(custom_feat) == custom_base\n    assert describe_feature(custom_feat[0]) == custom_first_slice\n    assert describe_feature(custom_feat[1]) == custom_second_slice\n    with pytest.raises(IndexError, match=bad_slice_error):\n        describe_feature(custom_feat[2])\n\n\ndef test_generic_description(es):\n    class NoName(TransformPrimitive):\n        input_types = [ColumnSchema(semantic_tags={\"category\"})]\n        output_type = ColumnSchema(semantic_tags={\"category\"})\n\n        def generate_name(self, base_feature_names):\n            return \"%s(%s%s)\" % (\n                \"NO_NAME\",\n                \", \".join(base_feature_names),\n                self.get_args_string(),\n            )\n\n    class CustomAgg(AggregationPrimitive):\n        name = \"custom_aggregation\"\n        input_types = [ColumnSchema(semantic_tags={\"category\"})]\n        output_type = ColumnSchema(semantic_tags={\"category\"})\n\n    class CustomTrans(TransformPrimitive):\n        name = \"custom_transform\"\n        input_types = [ColumnSchema(semantic_tags={\"category\"})]\n        output_type = ColumnSchema(semantic_tags={\"category\"})\n\n    no_name = TransformFeature(IdentityFeature(es[\"log\"].ww[\"zipcode\"]), NoName)\n    no_name_description = 'The result of applying NoName to the \"zipcode\".'\n    assert describe_feature(no_name) == no_name_description\n\n    custom_agg = AggregationFeature(\n        IdentityFeature(es[\"log\"].ww[\"zipcode\"]),\n        \"customers\",\n        CustomAgg,\n    )\n    custom_agg_description = 'The result of applying CUSTOM_AGGREGATION to the \"zipcode\" of all instances of \"log\" for each \"id\" in \"customers\".'\n    assert describe_feature(custom_agg) == custom_agg_description\n\n    custom_trans = TransformFeature(\n        IdentityFeature(es[\"log\"].ww[\"zipcode\"]),\n        CustomTrans,\n    )\n    custom_trans_description = (\n        'The result of applying CUSTOM_TRANSFORM to the \"zipcode\".'\n    )\n    assert describe_feature(custom_trans) == custom_trans_description\n\n\ndef test_column_description(es):\n    column_description = \"the name of the device used for each session\"\n    es[\"sessions\"].ww.columns[\"device_name\"].description = column_description\n    identity_feat = IdentityFeature(es[\"sessions\"].ww[\"device_name\"])\n    assert (\n        describe_feature(identity_feat)\n        == column_description[0].upper() + column_description[1:] + \".\"\n    )\n\n\ndef test_metadata(es, tmp_path):\n    identity_feature_descriptions = {\n        \"sessions: device_name\": \"the name of the device used for each session\",\n        \"customers: id\": \"the customer's id\",\n    }\n    agg_feat = AggregationFeature(\n        IdentityFeature(es[\"sessions\"].ww[\"device_name\"]),\n        \"customers\",\n        NumUnique,\n    )\n    agg_description = (\n        \"The number of unique elements in the name of the device used for each \"\n        'session of all instances of \"sessions\" for each customer\\'s id.'\n    )\n    assert (\n        describe_feature(agg_feat, feature_descriptions=identity_feature_descriptions)\n        == agg_description\n    )\n\n    transform_feat = GroupByTransformFeature(\n        IdentityFeature(es[\"log\"].ww[\"value\"]),\n        CumMean,\n        IdentityFeature(es[\"log\"].ww[\"session_id\"]),\n    )\n    transform_description = 'The running average of the \"value\" for each \"session_id\".'\n    primitive_templates = {\"cum_mean\": \"the running average of {}\"}\n    assert (\n        describe_feature(transform_feat, primitive_templates=primitive_templates)\n        == transform_description\n    )\n\n    custom_agg = AggregationFeature(\n        IdentityFeature(es[\"log\"].ww[\"zipcode\"]),\n        \"sessions\",\n        Mode,\n    )\n    auto_description = 'The most frequently occurring value of the \"zipcode\" of all instances of \"log\" for each \"id\" in \"sessions\".'\n    custom_agg_description = \"the most frequently used zipcode\"\n    custom_feature_description = (\n        custom_agg_description[0].upper() + custom_agg_description[1:] + \".\"\n    )\n    feature_description_dict = {\"sessions: MODE(log.zipcode)\": custom_agg_description}\n    assert describe_feature(custom_agg) == auto_description\n    assert (\n        describe_feature(custom_agg, feature_descriptions=feature_description_dict)\n        == custom_feature_description\n    )\n\n    metadata = {\n        \"feature_descriptions\": {\n            **identity_feature_descriptions,\n            **feature_description_dict,\n        },\n        \"primitive_templates\": primitive_templates,\n    }\n    metadata_path = os.path.join(tmp_path, \"description_metadata.json\")\n    with open(metadata_path, \"w\") as f:\n        json.dump(metadata, f)\n    assert describe_feature(agg_feat, metadata_file=metadata_path) == agg_description\n    assert (\n        describe_feature(transform_feat, metadata_file=metadata_path)\n        == transform_description\n    )\n    assert (\n        describe_feature(custom_agg, metadata_file=metadata_path)\n        == custom_feature_description\n    )\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/test_feature_serialization.py",
    "content": "import os\n\nimport boto3\nimport pandas as pd\nimport pytest\nfrom pympler.asizeof import asizeof\nfrom smart_open import open\nfrom woodwork.column_schema import ColumnSchema\n\nfrom featuretools import (\n    AggregationFeature,\n    DirectFeature,\n    EntitySet,\n    Feature,\n    GroupByTransformFeature,\n    IdentityFeature,\n    TransformFeature,\n    dfs,\n    feature_base,\n    load_features,\n    primitives,\n    save_features,\n)\nfrom featuretools.feature_base import FeatureOutputSlice\nfrom featuretools.feature_base.cache import feature_cache\nfrom featuretools.feature_base.features_deserializer import FeaturesDeserializer\nfrom featuretools.feature_base.features_serializer import FeaturesSerializer\nfrom featuretools.primitives import (\n    Count,\n    CumSum,\n    Day,\n    DistanceToHoliday,\n    Haversine,\n    IsIn,\n    Max,\n    Mean,\n    Min,\n    Mode,\n    Month,\n    MultiplyNumericScalar,\n    Negate,\n    NMostCommon,\n    NumberOfCommonWords,\n    NumCharacters,\n    NumUnique,\n    NumWords,\n    PercentTrue,\n    Skew,\n    Std,\n    Sum,\n    TransformPrimitive,\n    Weekday,\n    Year,\n)\nfrom featuretools.primitives.base import AggregationPrimitive\nfrom featuretools.tests.testing_utils import check_names\nfrom featuretools.version import ENTITYSET_SCHEMA_VERSION, FEATURES_SCHEMA_VERSION\n\nBUCKET_NAME = \"test-bucket\"\nWRITE_KEY_NAME = \"test-key\"\nTEST_S3_URL = \"s3://{}/{}\".format(BUCKET_NAME, WRITE_KEY_NAME)\nTEST_FILE = \"test_feature_serialization_feature_schema_{}_entityset_schema_{}_2022_12_28.json\".format(\n    FEATURES_SCHEMA_VERSION,\n    ENTITYSET_SCHEMA_VERSION,\n)\nS3_URL = \"s3://featuretools-static/\" + TEST_FILE\nURL = \"https://featuretools-static.s3.amazonaws.com/\" + TEST_FILE\nTEST_CONFIG = \"CheckConfigPassesOn\"\nTEST_KEY = \"test_access_key_features\"\n\n\n@pytest.fixture(autouse=True)\ndef reset_dfs_cache():\n    feature_cache.enabled = False\n    feature_cache.clear_all()\n\n\ndef assert_features(original, deserialized):\n    for feat_1, feat_2 in zip(original, deserialized):\n        assert feat_1.unique_name() == feat_2.unique_name()\n        assert feat_1.entityset == feat_2.entityset\n\n\ndef pickle_features_test_helper(es_size, features_original, dir_path):\n    filepath = os.path.join(dir_path, \"test_feature\")\n\n    save_features(features_original, filepath)\n    features_deserializedA = load_features(filepath)\n    assert os.path.getsize(filepath) < es_size\n    os.remove(filepath)\n\n    with open(filepath, \"w\") as f:\n        save_features(features_original, f)\n    features_deserializedB = load_features(open(filepath))\n    assert os.path.getsize(filepath) < es_size\n    os.remove(filepath)\n\n    features = save_features(features_original)\n    features_deserializedC = load_features(features)\n    assert asizeof(features) < es_size\n\n    features_deserialized_options = [\n        features_deserializedA,\n        features_deserializedB,\n        features_deserializedC,\n    ]\n    for features_deserialized in features_deserialized_options:\n        assert_features(features_original, features_deserialized)\n\n\ndef test_pickle_features(es, tmp_path):\n    features_original = dfs(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        features_only=True,\n    )\n    pickle_features_test_helper(asizeof(es), features_original, str(tmp_path))\n\n\ndef test_pickle_features_with_custom_primitive(es, tmp_path):\n    class NewMax(AggregationPrimitive):\n        name = \"new_max\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n    features_original = dfs(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[\"Last\", \"Mean\", NewMax],\n        features_only=True,\n    )\n\n    assert any([isinstance(feat.primitive, NewMax) for feat in features_original])\n    pickle_features_test_helper(asizeof(es), features_original, str(tmp_path))\n\n\ndef test_serialized_renamed_features(es):\n    def serialize_name_unchanged(original):\n        new_name = \"MyFeature\"\n        original_names = original.get_feature_names()\n        renamed = original.rename(new_name)\n        new_names = (\n            [new_name]\n            if len(original_names) == 1\n            else [new_name + \"[{}]\".format(i) for i in range(len(original_names))]\n        )\n        check_names(renamed, new_name, new_names)\n\n        serializer = FeaturesSerializer([renamed])\n        serialized = serializer.to_dict()\n\n        deserializer = FeaturesDeserializer(serialized)\n        deserialized = deserializer.to_list()[0]\n        check_names(deserialized, new_name, new_names)\n\n    identity_original = IdentityFeature(es[\"log\"].ww[\"value\"])\n    assert identity_original.get_name() == \"value\"\n\n    value = IdentityFeature(es[\"log\"].ww[\"value\"])\n\n    primitive = primitives.Max()\n    agg_original = AggregationFeature(value, \"customers\", primitive)\n    assert agg_original.get_name() == \"MAX(log.value)\"\n\n    direct_original = DirectFeature(\n        IdentityFeature(es[\"customers\"].ww[\"age\"]),\n        \"sessions\",\n    )\n    assert direct_original.get_name() == \"customers.age\"\n\n    primitive = primitives.MultiplyNumericScalar(value=2)\n    transform_original = TransformFeature(value, primitive)\n    assert transform_original.get_name() == \"value * 2\"\n\n    zipcode = IdentityFeature(es[\"log\"].ww[\"zipcode\"])\n    primitive = CumSum()\n    groupby_original = feature_base.GroupByTransformFeature(value, primitive, zipcode)\n    assert groupby_original.get_name() == \"CUM_SUM(value) by zipcode\"\n\n    multioutput_original = Feature(\n        es[\"log\"].ww[\"product_id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=NMostCommon(n=2),\n    )\n    assert multioutput_original.get_name() == \"N_MOST_COMMON(log.product_id, n=2)\"\n\n    featureslice_original = feature_base.FeatureOutputSlice(multioutput_original, 0)\n    assert featureslice_original.get_name() == \"N_MOST_COMMON(log.product_id, n=2)[0]\"\n\n    feature_type_list = [\n        identity_original,\n        agg_original,\n        direct_original,\n        transform_original,\n        groupby_original,\n        multioutput_original,\n        featureslice_original,\n    ]\n\n    for feature_type in feature_type_list:\n        serialize_name_unchanged(feature_type)\n\n\n@pytest.fixture\ndef s3_client():\n    _environ = os.environ.copy()\n    from moto import mock_aws\n\n    with mock_aws():\n        s3 = boto3.resource(\"s3\")\n        yield s3\n    os.environ.clear()\n    os.environ.update(_environ)\n\n\n@pytest.fixture\ndef s3_bucket(s3_client, region=\"us-east-2\"):\n    location = {\"LocationConstraint\": region}\n    s3_client.create_bucket(\n        Bucket=BUCKET_NAME,\n        ACL=\"public-read-write\",\n        CreateBucketConfiguration=location,\n    )\n    s3_bucket = s3_client.Bucket(BUCKET_NAME)\n    yield s3_bucket\n\n\ndef test_serialize_features_mock_s3(es, s3_client, s3_bucket):\n    features_original = dfs(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        features_only=True,\n    )\n\n    save_features(features_original, TEST_S3_URL)\n\n    obj = list(s3_bucket.objects.all())[0].key\n    s3_client.ObjectAcl(BUCKET_NAME, obj).put(ACL=\"public-read-write\")\n\n    features_deserialized = load_features(TEST_S3_URL)\n    assert_features(features_original, features_deserialized)\n\n\ndef test_serialize_features_mock_anon_s3(es, s3_client, s3_bucket):\n    features_original = dfs(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        features_only=True,\n    )\n\n    save_features(features_original, TEST_S3_URL, profile_name=False)\n\n    obj = list(s3_bucket.objects.all())[0].key\n    s3_client.ObjectAcl(BUCKET_NAME, obj).put(ACL=\"public-read-write\")\n\n    features_deserialized = load_features(TEST_S3_URL, profile_name=False)\n    assert_features(features_original, features_deserialized)\n\n\n@pytest.mark.parametrize(\"profile_name\", [\"test\", False])\ndef test_s3_test_profile(es, s3_client, s3_bucket, setup_test_profile, profile_name):\n    features_original = dfs(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        features_only=True,\n    )\n\n    save_features(features_original, TEST_S3_URL, profile_name=\"test\")\n\n    obj = list(s3_bucket.objects.all())[0].key\n    s3_client.ObjectAcl(BUCKET_NAME, obj).put(ACL=\"public-read-write\")\n\n    features_deserialized = load_features(TEST_S3_URL, profile_name=profile_name)\n    assert_features(features_original, features_deserialized)\n\n\n@pytest.mark.parametrize(\"url,profile_name\", [(S3_URL, False), (URL, None)])\ndef test_deserialize_features_s3(es, url, profile_name):\n    agg_primitives = [\n        Sum,\n        Std,\n        Max,\n        Skew,\n        Min,\n        Mean,\n        Count,\n        PercentTrue,\n        NumUnique,\n        Mode,\n    ]\n\n    trans_primitives = [Day, Year, Month, Weekday, Haversine, NumWords, NumCharacters]\n\n    features_original = dfs(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        features_only=True,\n        agg_primitives=agg_primitives,\n        trans_primitives=trans_primitives,\n    )\n\n    features_deserialized = load_features(url, profile_name=profile_name)\n    assert_features(features_original, features_deserialized)\n\n\ndef test_serialize_url(es):\n    features_original = dfs(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        features_only=True,\n    )\n    error_text = \"Writing to URLs is not supported\"\n    with pytest.raises(ValueError, match=error_text):\n        save_features(features_original, URL)\n\n\ndef test_custom_feature_names_retained_during_serialization(es, tmp_path):\n    class MultiCumulative(TransformPrimitive):\n        name = \"multi_cum_sum\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n        number_output_features = 3\n\n    multi_output_trans_feat = Feature(\n        es[\"log\"].ww[\"value\"],\n        primitive=MultiCumulative,\n    )\n    groupby_trans_feat = GroupByTransformFeature(\n        es[\"log\"].ww[\"value\"],\n        primitive=MultiCumulative,\n        groupby=es[\"log\"].ww[\"product_id\"],\n    )\n    multi_output_agg_feat = Feature(\n        es[\"log\"].ww[\"product_id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=NMostCommon(n=2),\n    )\n    slice = FeatureOutputSlice(multi_output_trans_feat, 1)\n    stacked_feat = Feature(slice, primitive=Negate)\n\n    trans_names = [\"cumulative_sum\", \"cumulative_max\", \"cumulative_min\"]\n    multi_output_trans_feat.set_feature_names(trans_names)\n    groupby_trans_names = [\"grouped_sum\", \"grouped_max\", \"grouped_min\"]\n    groupby_trans_feat.set_feature_names(groupby_trans_names)\n    agg_names = [\"first_most_common\", \"second_most_common\"]\n    multi_output_agg_feat.set_feature_names(agg_names)\n\n    features = [\n        multi_output_trans_feat,\n        multi_output_agg_feat,\n        groupby_trans_feat,\n        stacked_feat,\n    ]\n    file = os.path.join(tmp_path, \"features.json\")\n    save_features(features, file)\n    deserialized_features = load_features(file)\n\n    new_trans, new_agg, new_groupby, new_stacked = deserialized_features\n    assert new_trans.get_feature_names() == trans_names\n    assert new_agg.get_feature_names() == agg_names\n    assert new_groupby.get_feature_names() == groupby_trans_names\n    assert new_stacked.get_feature_names() == [\"-(cumulative_max)\"]\n\n\ndef test_deserializer_uses_common_primitive_instances_no_args(es, tmp_path):\n    features = dfs(\n        entityset=es,\n        target_dataframe_name=\"products\",\n        features_only=True,\n        agg_primitives=[\"sum\"],\n        trans_primitives=[\"is_null\"],\n    )\n\n    is_null_features = [f for f in features if f.primitive.name == \"is_null\"]\n    sum_features = [f for f in features if f.primitive.name == \"sum\"]\n\n    # Make sure we have multiple features of each type\n    assert len(is_null_features) > 1\n    assert len(sum_features) > 1\n\n    # DFS should use the same primitive instance for all features that share a primitive\n    is_null_primitive = is_null_features[0].primitive\n    sum_primitive = sum_features[0].primitive\n    assert all([f.primitive is is_null_primitive for f in is_null_features])\n    assert all([f.primitive is sum_primitive for f in sum_features])\n\n    file = os.path.join(tmp_path, \"features.json\")\n    save_features(features, file)\n    deserialized_features = load_features(file)\n    new_is_null_features = [\n        f for f in deserialized_features if f.primitive.name == \"is_null\"\n    ]\n    new_sum_features = [f for f in deserialized_features if f.primitive.name == \"sum\"]\n\n    # After deserialization all features that share a primitive should use the same primitive instance\n    new_is_null_primitive = new_is_null_features[0].primitive\n    new_sum_primitive = new_sum_features[0].primitive\n    assert all([f.primitive is new_is_null_primitive for f in new_is_null_features])\n    assert all([f.primitive is new_sum_primitive for f in new_sum_features])\n\n\ndef test_deserializer_uses_common_primitive_instances_with_args(es, tmp_path):\n    # Single argument\n    scalar1 = MultiplyNumericScalar(value=1)\n    scalar5 = MultiplyNumericScalar(value=5)\n    features = dfs(\n        entityset=es,\n        target_dataframe_name=\"products\",\n        features_only=True,\n        agg_primitives=[\"sum\"],\n        trans_primitives=[scalar1, scalar5],\n    )\n\n    scalar1_features = [\n        f\n        for f in features\n        if f.primitive.name == \"multiply_numeric_scalar\" and \" * 1\" in f.get_name()\n    ]\n    scalar5_features = [\n        f\n        for f in features\n        if f.primitive.name == \"multiply_numeric_scalar\" and \" * 5\" in f.get_name()\n    ]\n\n    # Make sure we have multiple features of each type\n    assert len(scalar1_features) > 1\n    assert len(scalar5_features) > 1\n\n    # DFS should use the the passed in primitive instance for all features\n    assert all([f.primitive is scalar1 for f in scalar1_features])\n    assert all([f.primitive is scalar5 for f in scalar5_features])\n\n    file = os.path.join(tmp_path, \"features.json\")\n    save_features(features, file)\n    deserialized_features = load_features(file)\n\n    new_scalar1_features = [\n        f\n        for f in deserialized_features\n        if f.primitive.name == \"multiply_numeric_scalar\" and \" * 1\" in f.get_name()\n    ]\n    new_scalar5_features = [\n        f\n        for f in deserialized_features\n        if f.primitive.name == \"multiply_numeric_scalar\" and \" * 5\" in f.get_name()\n    ]\n\n    # After deserialization all features that share a primitive should use the same primitive instance\n    new_scalar1_primitive = new_scalar1_features[0].primitive\n    new_scalar5_primitive = new_scalar5_features[0].primitive\n    assert all([f.primitive is new_scalar1_primitive for f in new_scalar1_features])\n    assert all([f.primitive is new_scalar5_primitive for f in new_scalar5_features])\n    assert new_scalar1_primitive.value == 1\n    assert new_scalar5_primitive.value == 5\n\n    # Test primitive with multiple args\n    distance_to_holiday = DistanceToHoliday(\n        holiday=\"Canada Day\",\n        country=\"Canada\",\n    )\n    features = dfs(\n        entityset=es,\n        target_dataframe_name=\"customers\",\n        features_only=True,\n        agg_primitives=[],\n        trans_primitives=[distance_to_holiday],\n    )\n\n    distance_features = [\n        f for f in features if f.primitive.name == \"distance_to_holiday\"\n    ]\n\n    assert len(distance_features) > 1\n\n    # DFS should use the the passed in primitive instance for all features\n    assert all([f.primitive is distance_to_holiday for f in distance_features])\n\n    file = os.path.join(tmp_path, \"distance_features.json\")\n    save_features(distance_features, file)\n    new_distance_features = load_features(file)\n\n    # After deserialization all features that share a primitive should use the same primitive instance\n    new_distance_primitive = new_distance_features[0].primitive\n    assert all(\n        [f.primitive is new_distance_primitive for f in new_distance_features],\n    )\n    assert new_distance_primitive.holiday == \"Canada Day\"\n    assert new_distance_primitive.country == \"Canada\"\n\n    # Test primitive with list arg\n    is_in = IsIn(list_of_outputs=[5, True, \"coke zero\"])\n    features = dfs(\n        entityset=es,\n        target_dataframe_name=\"customers\",\n        features_only=True,\n        agg_primitives=[],\n        trans_primitives=[is_in],\n    )\n\n    is_in_features = [f for f in features if f.primitive.name == \"isin\"]\n    assert len(is_in_features) > 1\n\n    # DFS should use the the passed in primitive instance for all features\n    assert all([f.primitive is is_in for f in is_in_features])\n\n    file = os.path.join(tmp_path, \"distance_features.json\")\n    save_features(is_in_features, file)\n    new_is_in_features = load_features(file)\n\n    # After deserialization all features that share a primitive should use the same primitive instance\n    new_is_in_primitive = new_is_in_features[0].primitive\n    assert all([f.primitive is new_is_in_primitive for f in new_is_in_features])\n    assert new_is_in_primitive.list_of_outputs == [5, True, \"coke zero\"]\n\n\ndef test_can_serialize_word_set_for_number_of_common_words_feature(es):\n    # The word_set argument is passed in as a set, which is not JSON-serializable.\n    # This test checks internal logic that converts the set to a list so it can be serialized\n    common_word_set = {\"hello\", \"my\"}\n    df = pd.DataFrame({\"text\": [\"hello my name is hi\"]})\n    es = EntitySet()\n    es.add_dataframe(dataframe_name=\"df\", index=\"idx\", dataframe=df, make_index=True)\n\n    num_common_words = NumberOfCommonWords(word_set=common_word_set)\n    fm, fd = dfs(\n        entityset=es,\n        target_dataframe_name=\"df\",\n        trans_primitives=[num_common_words],\n    )\n\n    feat = fd[-1]\n    save_features([feat])\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/test_feature_utils.py",
    "content": "from woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Double, Integer\n\nfrom featuretools.feature_base.utils import is_valid_input\n\n\ndef test_is_valid_input():\n    assert is_valid_input(candidate=ColumnSchema(), template=ColumnSchema())\n\n    assert is_valid_input(\n        candidate=ColumnSchema(logical_type=Integer, semantic_tags={\"index\"}),\n        template=ColumnSchema(logical_type=Integer, semantic_tags={\"index\"}),\n    )\n\n    assert is_valid_input(\n        candidate=ColumnSchema(\n            logical_type=Integer,\n            semantic_tags={\"index\", \"numeric\"},\n        ),\n        template=ColumnSchema(semantic_tags={\"index\"}),\n    )\n\n    assert is_valid_input(\n        candidate=ColumnSchema(semantic_tags={\"index\"}),\n        template=ColumnSchema(semantic_tags={\"index\"}),\n    )\n\n    assert is_valid_input(\n        candidate=ColumnSchema(logical_type=Integer, semantic_tags={\"index\"}),\n        template=ColumnSchema(),\n    )\n\n    assert is_valid_input(\n        candidate=ColumnSchema(logical_type=Integer),\n        template=ColumnSchema(logical_type=Integer),\n    )\n\n    assert is_valid_input(\n        candidate=ColumnSchema(logical_type=Integer, semantic_tags={\"numeric\"}),\n        template=ColumnSchema(logical_type=Integer),\n    )\n\n    assert not is_valid_input(\n        candidate=ColumnSchema(logical_type=Integer, semantic_tags={\"index\"}),\n        template=ColumnSchema(logical_type=Double, semantic_tags={\"index\"}),\n    )\n\n    assert not is_valid_input(\n        candidate=ColumnSchema(logical_type=Integer, semantic_tags={}),\n        template=ColumnSchema(logical_type=Integer, semantic_tags={\"index\"}),\n    )\n\n    assert not is_valid_input(\n        candidate=ColumnSchema(),\n        template=ColumnSchema(logical_type=Integer, semantic_tags={\"index\"}),\n    )\n\n    assert not is_valid_input(\n        candidate=ColumnSchema(),\n        template=ColumnSchema(logical_type=Integer),\n    )\n\n    assert not is_valid_input(\n        candidate=ColumnSchema(),\n        template=ColumnSchema(semantic_tags={\"index\"}),\n    )\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/test_feature_visualizer.py",
    "content": "import json\nimport os\nimport re\n\nimport graphviz\nimport pytest\n\nfrom featuretools.feature_base import (\n    AggregationFeature,\n    DirectFeature,\n    FeatureOutputSlice,\n    GroupByTransformFeature,\n    IdentityFeature,\n    TransformFeature,\n    graph_feature,\n)\nfrom featuretools.primitives import Count, CumMax, Mode, NMostCommon, Year\n\n\n@pytest.fixture\ndef simple_feat(es):\n    return IdentityFeature(es[\"log\"].ww[\"id\"])\n\n\n@pytest.fixture\ndef trans_feat(es):\n    return TransformFeature(IdentityFeature(es[\"customers\"].ww[\"cancel_date\"]), Year)\n\n\ndef test_returns_digraph_object(simple_feat):\n    graph = graph_feature(simple_feat)\n    assert isinstance(graph, graphviz.Digraph)\n\n\ndef test_saving_png_file(simple_feat, tmp_path):\n    output_path = str(tmp_path.joinpath(\"test1.png\"))\n    graph_feature(simple_feat, to_file=output_path)\n    assert os.path.isfile(output_path)\n\n\ndef test_missing_file_extension(simple_feat):\n    output_path = \"test1\"\n    with pytest.raises(ValueError, match=\"Please use a file extension\"):\n        graph_feature(simple_feat, to_file=output_path)\n\n\ndef test_invalid_format(simple_feat):\n    output_path = \"test1.xyz\"\n    with pytest.raises(ValueError, match=\"Unknown format\"):\n        graph_feature(simple_feat, to_file=output_path)\n\n\ndef test_transform(es, trans_feat):\n    feat = trans_feat\n    graph = graph_feature(feat).source\n\n    feat_name = feat.get_name()\n    prim_node = \"0_{}_year\".format(feat_name)\n    dataframe_table = \"\\u2605 customers (target)\"\n    prim_edge = 'customers:cancel_date -> \"{}\"'.format(prim_node)\n    feat_edge = '\"{}\" -> customers:\"{}\"'.format(prim_node, feat_name)\n\n    graph_components = [feat_name, dataframe_table, prim_node, prim_edge, feat_edge]\n    for component in graph_components:\n        assert component in graph\n\n    matches = re.findall(r\"customers \\[label=<\\n<TABLE.*?</TABLE>>\", graph, re.DOTALL)\n    assert len(matches) == 1\n    rows = re.findall(r\"<TR.*?</TR>\", matches[0], re.DOTALL)\n    assert len(rows) == 3\n    to_match = [\"customers\", \"cancel_date\", feat_name]\n    for match, row in zip(to_match, rows):\n        assert match in row\n\n\ndef test_html_symbols(es, tmp_path):\n    output_path_template = str(tmp_path.joinpath(\"test{}.png\"))\n    value = IdentityFeature(es[\"log\"].ww[\"value\"])\n    gt = value > 5\n    lt = value < 5\n    ge = value >= 5\n    le = value <= 5\n\n    for i, feat in enumerate([gt, lt, ge, le]):\n        output_path = output_path_template.format(i)\n        graph = graph_feature(feat, to_file=output_path).source\n        assert os.path.isfile(output_path)\n        assert feat.get_name() in graph\n\n\ndef test_groupby_transform(es):\n    feat = GroupByTransformFeature(\n        IdentityFeature(es[\"customers\"].ww[\"age\"]),\n        CumMax,\n        IdentityFeature(es[\"customers\"].ww[\"cohort\"]),\n    )\n    graph = graph_feature(feat).source\n\n    feat_name = feat.get_name()\n    prim_node = \"0_{}_cum_max\".format(feat_name)\n    groupby_node = \"{}_groupby_customers--cohort\".format(feat_name)\n    dataframe_table = \"\\u2605 customers (target)\"\n\n    groupby_edge = 'customers:cohort -> \"{}\"'.format(groupby_node)\n    groupby_input = 'customers:age -> \"{}\"'.format(groupby_node)\n    prim_input = '\"{}\" -> \"{}\"'.format(groupby_node, prim_node)\n    feat_edge = '\"{}\" -> customers:\"{}\"'.format(prim_node, feat_name)\n\n    graph_components = [\n        feat_name,\n        prim_node,\n        groupby_node,\n        dataframe_table,\n        groupby_edge,\n        groupby_input,\n        prim_input,\n        feat_edge,\n    ]\n    for component in graph_components:\n        assert component in graph\n\n    matches = re.findall(r\"customers \\[label=<\\n<TABLE.*?</TABLE>>\", graph, re.DOTALL)\n    assert len(matches) == 1\n    rows = re.findall(r\"<TR.*?</TR>\", matches[0], re.DOTALL)\n    assert len(rows) == 4\n    assert dataframe_table in rows[0]\n    assert feat_name in rows[-1]\n    assert (\"age\" in rows[1] and \"cohort\" in rows[2]) or (\n        \"age\" in rows[2] and \"cohort\" in rows[1]\n    )\n\n\ndef test_groupby_transform_direct_groupby(es):\n    groupby = DirectFeature(\n        IdentityFeature(es[\"cohorts\"].ww[\"cohort_name\"]),\n        \"customers\",\n    )\n    feat = GroupByTransformFeature(\n        IdentityFeature(es[\"customers\"].ww[\"age\"]),\n        CumMax,\n        groupby,\n    )\n    graph = graph_feature(feat).source\n\n    groupby_name = groupby.get_name()\n    feat_name = feat.get_name()\n    join_node = \"1_{}_join\".format(groupby_name)\n    prim_node = \"0_{}_cum_max\".format(feat_name)\n    groupby_node = \"{}_groupby_customers--{}\".format(feat_name, groupby_name)\n    customers_table = \"\\u2605 customers (target)\"\n    cohorts_table = \"cohorts\"\n\n    join_groupby = '\"{}\" -> customers:cohort'.format(join_node)\n    join_input = 'cohorts:cohort_name -> \"{}\"'.format(join_node)\n    join_out_edge = '\"{}\" -> customers:\"{}\"'.format(join_node, groupby_name)\n    groupby_edge = 'customers:\"{}\" -> \"{}\"'.format(groupby_name, groupby_node)\n    groupby_input = 'customers:age -> \"{}\"'.format(groupby_node)\n    prim_input = '\"{}\" -> \"{}\"'.format(groupby_node, prim_node)\n    feat_edge = '\"{}\" -> customers:\"{}\"'.format(prim_node, feat_name)\n\n    graph_components = [\n        groupby_name,\n        feat_name,\n        join_node,\n        prim_node,\n        groupby_node,\n        customers_table,\n        cohorts_table,\n        join_groupby,\n        join_input,\n        join_out_edge,\n        groupby_edge,\n        groupby_input,\n        prim_input,\n        feat_edge,\n    ]\n    for component in graph_components:\n        assert component in graph\n\n    dataframes = {\n        \"cohorts\": [cohorts_table, \"cohort_name\"],\n        \"customers\": [customers_table, \"cohort\", \"age\", groupby_name, feat_name],\n    }\n    for dataframe in dataframes:\n        regex = r\"{} \\[label=<\\n<TABLE.*?</TABLE>>\".format(dataframe)\n        matches = re.findall(regex, graph, re.DOTALL)\n        assert len(matches) == 1\n\n        rows = re.findall(r\"<TR.*?</TR>\", matches[0], re.DOTALL)\n        assert len(rows) == len(dataframes[dataframe])\n\n        for row in rows:\n            matched = False\n            for i in dataframes[dataframe]:\n                if i in row:\n                    matched = True\n                    dataframes[dataframe].remove(i)\n                    break\n            assert matched\n\n\ndef test_aggregation(es):\n    feat = AggregationFeature(IdentityFeature(es[\"log\"].ww[\"id\"]), \"sessions\", Count)\n    graph = graph_feature(feat).source\n\n    feat_name = feat.get_name()\n    prim_node = \"0_{}_count\".format(feat_name)\n    groupby_node = \"{}_groupby_log--session_id\".format(feat_name)\n\n    sessions_table = \"\\u2605 sessions (target)\"\n    log_table = \"log\"\n    groupby_edge = 'log:session_id -> \"{}\"'.format(groupby_node)\n    groupby_input = 'log:id -> \"{}\"'.format(groupby_node)\n    prim_input = '\"{}\" -> \"{}\"'.format(groupby_node, prim_node)\n    feat_edge = '\"{}\" -> sessions:\"{}\"'.format(prim_node, feat_name)\n\n    graph_components = [\n        feat_name,\n        prim_node,\n        groupby_node,\n        sessions_table,\n        log_table,\n        groupby_edge,\n        groupby_input,\n        prim_input,\n        feat_edge,\n    ]\n\n    for component in graph_components:\n        assert component in graph\n\n    dataframes = {\n        \"log\": [log_table, \"id\", \"session_id\"],\n        \"sessions\": [sessions_table, feat_name],\n    }\n    for dataframe in dataframes:\n        regex = r\"{} \\[label=<\\n<TABLE.*?</TABLE>>\".format(dataframe)\n        matches = re.findall(regex, graph, re.DOTALL)\n        assert len(matches) == 1\n\n        rows = re.findall(r\"<TR.*?</TR>\", matches[0], re.DOTALL)\n        assert len(rows) == len(dataframes[dataframe])\n        for row in rows:\n            matched = False\n            for i in dataframes[dataframe]:\n                if i in row:\n                    matched = True\n                    dataframes[dataframe].remove(i)\n                    break\n            assert matched\n\n\ndef test_multioutput(es):\n    multioutput = AggregationFeature(\n        IdentityFeature(es[\"log\"].ww[\"zipcode\"]),\n        \"sessions\",\n        NMostCommon,\n    )\n    feat = FeatureOutputSlice(multioutput, 0)\n    graph = graph_feature(feat).source\n\n    feat_name = feat.get_name()\n    prim_node = \"0_{}_n_most_common\".format(multioutput.get_name())\n    groupby_node = \"{}_groupby_log--session_id\".format(multioutput.get_name())\n\n    sessions_table = \"\\u2605 sessions (target)\"\n    log_table = \"log\"\n    groupby_edge = 'log:session_id -> \"{}\"'.format(groupby_node)\n    groupby_input = 'log:zipcode -> \"{}\"'.format(groupby_node)\n    prim_input = '\"{}\" -> \"{}\"'.format(groupby_node, prim_node)\n    feat_edge = '\"{}\" -> sessions:\"{}\"'.format(prim_node, feat_name)\n\n    graph_components = [\n        feat_name,\n        prim_node,\n        groupby_node,\n        sessions_table,\n        log_table,\n        groupby_edge,\n        groupby_input,\n        prim_input,\n        feat_edge,\n    ]\n\n    for component in graph_components:\n        assert component in graph\n\n    dataframes = {\n        \"log\": [log_table, \"zipcode\", \"session_id\"],\n        \"sessions\": [sessions_table, feat_name],\n    }\n    for dataframe in dataframes:\n        regex = r\"{} \\[label=<\\n<TABLE.*?</TABLE>>\".format(dataframe)\n        matches = re.findall(regex, graph, re.DOTALL)\n        assert len(matches) == 1\n\n        rows = re.findall(r\"<TR.*?</TR>\", matches[0], re.DOTALL)\n        assert len(rows) == len(dataframes[dataframe])\n        for row in rows:\n            matched = False\n            for i in dataframes[dataframe]:\n                if i in row:\n                    matched = True\n                    dataframes[dataframe].remove(i)\n                    break\n            assert matched\n\n\ndef test_direct(es):\n    d1 = DirectFeature(\n        IdentityFeature(es[\"customers\"].ww[\"engagement_level\"]),\n        \"sessions\",\n    )\n    d2 = DirectFeature(d1, \"log\")\n    graph = graph_feature(d2).source\n\n    d1_name = d1.get_name()\n    d2_name = d2.get_name()\n    prim_node1 = \"1_{}_join\".format(d1_name)\n    prim_node2 = \"0_{}_join\".format(d2_name)\n\n    log_table = \"\\u2605 log (target)\"\n    sessions_table = \"sessions\"\n    customers_table = \"customers\"\n    groupby_edge1 = '\"{}\" -> sessions:customer_id'.format(prim_node1)\n    groupby_edge2 = '\"{}\" -> log:session_id'.format(prim_node2)\n    groupby_input1 = 'customers:engagement_level -> \"{}\"'.format(prim_node1)\n    groupby_input2 = 'sessions:\"{}\" -> \"{}\"'.format(d1_name, prim_node2)\n    d1_edge = '\"{}\" -> sessions:\"{}\"'.format(prim_node1, d1_name)\n    d2_edge = '\"{}\" -> log:\"{}\"'.format(prim_node2, d2_name)\n\n    graph_components = [\n        d1_name,\n        d2_name,\n        prim_node1,\n        prim_node2,\n        log_table,\n        sessions_table,\n        customers_table,\n        groupby_edge1,\n        groupby_edge2,\n        groupby_input1,\n        groupby_input2,\n        d1_edge,\n        d2_edge,\n    ]\n    for component in graph_components:\n        assert component in graph\n\n    dataframes = {\n        \"customers\": [customers_table, \"engagement_level\"],\n        \"sessions\": [sessions_table, \"customer_id\", d1_name],\n        \"log\": [log_table, \"session_id\", d2_name],\n    }\n\n    for dataframe in dataframes:\n        regex = r\"{} \\[label=<\\n<TABLE.*?</TABLE>>\".format(dataframe)\n        matches = re.findall(regex, graph, re.DOTALL)\n        assert len(matches) == 1\n\n        rows = re.findall(r\"<TR.*?</TR>\", matches[0], re.DOTALL)\n        assert len(rows) == len(dataframes[dataframe])\n        for row in rows:\n            matched = False\n            for i in dataframes[dataframe]:\n                if i in row:\n                    matched = True\n                    dataframes[dataframe].remove(i)\n                    break\n            assert matched\n\n\ndef test_stacked(es, trans_feat):\n    stacked = AggregationFeature(trans_feat, \"cohorts\", Mode)\n    graph = graph_feature(stacked).source\n\n    feat_name = stacked.get_name()\n    intermediate_name = trans_feat.get_name()\n    agg_primitive = \"0_{}_mode\".format(feat_name)\n    trans_primitive = \"1_{}_year\".format(intermediate_name)\n    groupby_node = \"{}_groupby_customers--cohort\".format(feat_name)\n\n    trans_prim_edge = 'customers:cancel_date -> \"{}\"'.format(trans_primitive)\n    intermediate_edge = '\"{}\" -> customers:\"{}\"'.format(\n        trans_primitive,\n        intermediate_name,\n    )\n    groupby_edge = 'customers:cohort -> \"{}\"'.format(groupby_node)\n    groupby_input = 'customers:\"{}\" -> \"{}\"'.format(intermediate_name, groupby_node)\n    agg_input = '\"{}\" -> \"{}\"'.format(groupby_node, agg_primitive)\n    feat_edge = '\"{}\" -> cohorts:\"{}\"'.format(agg_primitive, feat_name)\n\n    graph_components = [\n        feat_name,\n        intermediate_name,\n        agg_primitive,\n        trans_primitive,\n        groupby_node,\n        trans_prim_edge,\n        intermediate_edge,\n        groupby_edge,\n        groupby_input,\n        agg_input,\n        feat_edge,\n    ]\n    for component in graph_components:\n        assert component in graph\n\n    agg_primitive = agg_primitive.replace(\"(\", \"\\\\(\").replace(\")\", \"\\\\)\")\n    agg_node = re.findall('\"{}\" \\\\[label.*'.format(agg_primitive), graph)\n    assert len(agg_node) == 1\n    assert \"Step 2\" in agg_node[0]\n\n    trans_primitive = trans_primitive.replace(\"(\", \"\\\\(\").replace(\")\", \"\\\\)\")\n    trans_node = re.findall('\"{}\" \\\\[label.*'.format(trans_primitive), graph)\n    assert len(trans_node) == 1\n    assert \"Step 1\" in trans_node[0]\n\n\ndef test_description_auto_caption(trans_feat):\n    default_graph = graph_feature(trans_feat, description=True).source\n    default_label = 'label=\"The year of the \\\\\"cancel_date\\\\\".\"'\n    assert default_label in default_graph\n\n\ndef test_description_auto_caption_metadata(trans_feat, tmp_path):\n    feature_descriptions = {\"customers: cancel_date\": \"the date the customer cancelled\"}\n    primitive_templates = {\"year\": \"the year that {} occurred\"}\n    metadata_graph = graph_feature(\n        trans_feat,\n        description=True,\n        feature_descriptions=feature_descriptions,\n        primitive_templates=primitive_templates,\n    ).source\n\n    metadata_label = 'label=\"The year that the date the customer cancelled occurred.\"'\n    assert metadata_label in metadata_graph\n\n    metadata = {\n        \"feature_descriptions\": feature_descriptions,\n        \"primitive_templates\": primitive_templates,\n    }\n    metadata_path = os.path.join(tmp_path, \"description_metadata.json\")\n    with open(metadata_path, \"w\") as f:\n        json.dump(metadata, f)\n    json_metadata_graph = graph_feature(\n        trans_feat,\n        description=True,\n        metadata_file=metadata_path,\n    ).source\n    assert metadata_label in json_metadata_graph\n\n\ndef test_description_custom_caption(trans_feat):\n    custom_description = \"A custom feature description\"\n    custom_description_graph = graph_feature(\n        trans_feat,\n        description=custom_description,\n    ).source\n    custom_description_label = 'label=\"A custom feature description\"'\n    assert custom_description_label in custom_description_graph\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/test_features_deserializer.py",
    "content": "import logging\nfrom unittest.mock import patch\n\nimport pandas as pd\nimport pytest\n\nfrom featuretools import (\n    AggregationFeature,\n    Feature,\n    IdentityFeature,\n    TransformFeature,\n    __version__,\n)\nfrom featuretools.feature_base.features_deserializer import FeaturesDeserializer\nfrom featuretools.primitives import (\n    Count,\n    Max,\n    MultiplyNumericScalar,\n    NMostCommon,\n    NumberOfCommonWords,\n    NumUnique,\n)\nfrom featuretools.primitives.utils import serialize_primitive\nfrom featuretools.utils.schema_utils import FEATURES_SCHEMA_VERSION\n\n\ndef test_single_feature(es):\n    feature = IdentityFeature(es[\"log\"].ww[\"value\"])\n    dictionary = {\n        \"ft_version\": __version__,\n        \"schema_version\": FEATURES_SCHEMA_VERSION,\n        \"entityset\": es.to_dictionary(),\n        \"feature_list\": [feature.unique_name()],\n        \"feature_definitions\": {feature.unique_name(): feature.to_dictionary()},\n        \"primitive_definitions\": {},\n    }\n    deserializer = FeaturesDeserializer(dictionary)\n\n    expected = [feature]\n    assert expected == deserializer.to_list()\n\n\ndef test_multioutput_feature(es):\n    value = IdentityFeature(es[\"log\"].ww[\"product_id\"])\n    threecommon = NMostCommon()\n    num_unique = NumUnique()\n    tc = Feature(value, parent_dataframe_name=\"sessions\", primitive=threecommon)\n\n    features = [tc, value]\n    for i in range(3):\n        features.append(\n            Feature(\n                tc[i],\n                parent_dataframe_name=\"customers\",\n                primitive=num_unique,\n            ),\n        )\n        features.append(tc[i])\n\n    flist = [feat.unique_name() for feat in features]\n    fd = [feat.to_dictionary() for feat in features]\n    fdict = dict(zip(flist, fd))\n\n    dictionary = {\n        \"ft_version\": __version__,\n        \"schema_version\": FEATURES_SCHEMA_VERSION,\n        \"entityset\": es.to_dictionary(),\n        \"feature_list\": flist,\n        \"feature_definitions\": fdict,\n    }\n    dictionary[\"primitive_definitions\"] = {\n        \"0\": serialize_primitive(threecommon),\n        \"1\": serialize_primitive(num_unique),\n    }\n\n    dictionary[\"feature_definitions\"][flist[0]][\"arguments\"][\"primitive\"] = \"0\"\n    dictionary[\"feature_definitions\"][flist[2]][\"arguments\"][\"primitive\"] = \"1\"\n    dictionary[\"feature_definitions\"][flist[4]][\"arguments\"][\"primitive\"] = \"1\"\n    dictionary[\"feature_definitions\"][flist[6]][\"arguments\"][\"primitive\"] = \"1\"\n    deserializer = FeaturesDeserializer(dictionary).to_list()\n\n    for i in range(len(features)):\n        assert features[i].unique_name() == deserializer[i].unique_name()\n\n\ndef test_base_features_in_list(es):\n    max_primitive = Max()\n    value = IdentityFeature(es[\"log\"].ww[\"value\"])\n    max_feat = AggregationFeature(value, \"sessions\", max_primitive)\n    dictionary = {\n        \"ft_version\": __version__,\n        \"schema_version\": FEATURES_SCHEMA_VERSION,\n        \"entityset\": es.to_dictionary(),\n        \"feature_list\": [max_feat.unique_name(), value.unique_name()],\n        \"feature_definitions\": {\n            max_feat.unique_name(): max_feat.to_dictionary(),\n            value.unique_name(): value.to_dictionary(),\n        },\n    }\n    dictionary[\"primitive_definitions\"] = {\"0\": serialize_primitive(max_primitive)}\n    dictionary[\"feature_definitions\"][max_feat.unique_name()][\"arguments\"][\n        \"primitive\"\n    ] = \"0\"\n    deserializer = FeaturesDeserializer(dictionary)\n\n    expected = [max_feat, value]\n    assert expected == deserializer.to_list()\n\n\ndef test_base_features_not_in_list(es):\n    max_primitive = Max()\n    mult_primitive = MultiplyNumericScalar(value=2)\n    value = IdentityFeature(es[\"log\"].ww[\"value\"])\n    value_x2 = TransformFeature(value, mult_primitive)\n    max_feat = AggregationFeature(value_x2, \"sessions\", max_primitive)\n    dictionary = {\n        \"ft_version\": __version__,\n        \"schema_version\": FEATURES_SCHEMA_VERSION,\n        \"entityset\": es.to_dictionary(),\n        \"feature_list\": [max_feat.unique_name()],\n        \"feature_definitions\": {\n            max_feat.unique_name(): max_feat.to_dictionary(),\n            value_x2.unique_name(): value_x2.to_dictionary(),\n            value.unique_name(): value.to_dictionary(),\n        },\n    }\n    dictionary[\"primitive_definitions\"] = {\n        \"0\": serialize_primitive(max_primitive),\n        \"1\": serialize_primitive(mult_primitive),\n    }\n    dictionary[\"feature_definitions\"][max_feat.unique_name()][\"arguments\"][\n        \"primitive\"\n    ] = \"0\"\n    dictionary[\"feature_definitions\"][value_x2.unique_name()][\"arguments\"][\n        \"primitive\"\n    ] = \"1\"\n    deserializer = FeaturesDeserializer(dictionary)\n\n    expected = [max_feat]\n    assert expected == deserializer.to_list()\n\n\n@patch(\"featuretools.utils.schema_utils.FEATURES_SCHEMA_VERSION\", \"1.1.1\")\n@pytest.mark.parametrize(\n    \"hardcoded_schema_version, warns\",\n    [(\"2.1.1\", True), (\"1.2.1\", True), (\"1.1.2\", True), (\"1.0.2\", False)],\n)\ndef test_later_schema_version(es, caplog, hardcoded_schema_version, warns):\n    def test_version(version, warns):\n        if warns:\n            warning_text = (\n                \"The schema version of the saved features\"\n                \"(%s) is greater than the latest supported (%s). \"\n                \"You may need to upgrade featuretools. Attempting to load features ...\"\n                % (version, \"1.1.1\")\n            )\n        else:\n            warning_text = None\n\n        _check_schema_version(version, es, warning_text, caplog, \"warn\")\n\n    test_version(hardcoded_schema_version, warns)\n\n\n@patch(\"featuretools.utils.schema_utils.FEATURES_SCHEMA_VERSION\", \"1.1.1\")\n@pytest.mark.parametrize(\n    \"hardcoded_schema_version, warns\",\n    [(\"0.1.1\", True), (\"1.0.1\", False), (\"1.1.0\", False)],\n)\ndef test_earlier_schema_version(es, caplog, hardcoded_schema_version, warns):\n    def test_version(version, warns):\n        if warns:\n            warning_text = (\n                \"The schema version of the saved features\"\n                \"(%s) is no longer supported by this version \"\n                \"of featuretools. Attempting to load features ...\" % version\n            )\n        else:\n            warning_text = None\n\n        _check_schema_version(version, es, warning_text, caplog, \"log\")\n\n    test_version(hardcoded_schema_version, warns)\n\n\ndef test_unknown_feature_type(es):\n    dictionary = {\n        \"ft_version\": __version__,\n        \"schema_version\": FEATURES_SCHEMA_VERSION,\n        \"entityset\": es.to_dictionary(),\n        \"feature_list\": [\"feature_1\"],\n        \"feature_definitions\": {\n            \"feature_1\": {\"type\": \"FakeFeature\", \"dependencies\": [], \"arguments\": {}},\n        },\n        \"primitive_definitions\": {},\n    }\n\n    deserializer = FeaturesDeserializer(dictionary)\n\n    with pytest.raises(RuntimeError, match='Unrecognized feature type \"FakeFeature\"'):\n        deserializer.to_list()\n\n\ndef test_unknown_primitive_type(es):\n    value = IdentityFeature(es[\"log\"].ww[\"value\"])\n    max_feat = AggregationFeature(value, \"sessions\", Max)\n    primitive_dict = serialize_primitive(Max())\n    primitive_dict[\"type\"] = \"FakePrimitive\"\n    dictionary = {\n        \"ft_version\": __version__,\n        \"schema_version\": FEATURES_SCHEMA_VERSION,\n        \"entityset\": es.to_dictionary(),\n        \"feature_list\": [max_feat.unique_name(), value.unique_name()],\n        \"feature_definitions\": {\n            max_feat.unique_name(): max_feat.to_dictionary(),\n            value.unique_name(): value.to_dictionary(),\n        },\n        \"primitive_definitions\": {\"0\": primitive_dict},\n    }\n\n    with pytest.raises(RuntimeError) as excinfo:\n        FeaturesDeserializer(dictionary)\n\n    error_text = 'Primitive \"FakePrimitive\" in module \"%s\" not found' % Max.__module__\n    assert error_text == str(excinfo.value)\n\n\ndef test_unknown_primitive_module(es):\n    value = IdentityFeature(es[\"log\"].ww[\"value\"])\n    max_feat = AggregationFeature(value, \"sessions\", Max)\n    primitive_dict = serialize_primitive(Max())\n    primitive_dict[\"module\"] = \"fake.module\"\n    dictionary = {\n        \"ft_version\": __version__,\n        \"schema_version\": FEATURES_SCHEMA_VERSION,\n        \"entityset\": es.to_dictionary(),\n        \"feature_list\": [max_feat.unique_name(), value.unique_name()],\n        \"feature_definitions\": {\n            max_feat.unique_name(): max_feat.to_dictionary(),\n            value.unique_name(): value.to_dictionary(),\n        },\n        \"primitive_definitions\": {\"0\": primitive_dict},\n    }\n\n    with pytest.raises(RuntimeError) as excinfo:\n        FeaturesDeserializer(dictionary)\n\n    error_text = 'Primitive \"Max\" in module \"fake.module\" not found'\n    assert error_text == str(excinfo.value)\n\n\ndef test_feature_use_previous_pd_timedelta(es):\n    value = IdentityFeature(es[\"log\"].ww[\"id\"])\n    td = pd.Timedelta(12, \"W\")\n    count_primitive = Count()\n    count_feature = AggregationFeature(\n        value,\n        \"customers\",\n        count_primitive,\n        use_previous=td,\n    )\n    dictionary = {\n        \"ft_version\": __version__,\n        \"schema_version\": FEATURES_SCHEMA_VERSION,\n        \"entityset\": es.to_dictionary(),\n        \"feature_list\": [count_feature.unique_name(), value.unique_name()],\n        \"feature_definitions\": {\n            count_feature.unique_name(): count_feature.to_dictionary(),\n            value.unique_name(): value.to_dictionary(),\n        },\n    }\n    dictionary[\"primitive_definitions\"] = {\"0\": serialize_primitive(count_primitive)}\n    dictionary[\"feature_definitions\"][count_feature.unique_name()][\"arguments\"][\n        \"primitive\"\n    ] = \"0\"\n    deserializer = FeaturesDeserializer(dictionary)\n\n    expected = [count_feature, value]\n    assert expected == deserializer.to_list()\n\n\ndef test_feature_use_previous_pd_dateoffset(es):\n    value = IdentityFeature(es[\"log\"].ww[\"id\"])\n    do = pd.DateOffset(months=3)\n    count_primitive = Count()\n    count_feature = AggregationFeature(\n        value,\n        \"customers\",\n        count_primitive,\n        use_previous=do,\n    )\n    dictionary = {\n        \"ft_version\": __version__,\n        \"schema_version\": FEATURES_SCHEMA_VERSION,\n        \"entityset\": es.to_dictionary(),\n        \"feature_list\": [count_feature.unique_name(), value.unique_name()],\n        \"feature_definitions\": {\n            count_feature.unique_name(): count_feature.to_dictionary(),\n            value.unique_name(): value.to_dictionary(),\n        },\n    }\n    dictionary[\"primitive_definitions\"] = {\"0\": serialize_primitive(count_primitive)}\n    dictionary[\"feature_definitions\"][count_feature.unique_name()][\"arguments\"][\n        \"primitive\"\n    ] = \"0\"\n    deserializer = FeaturesDeserializer(dictionary)\n\n    expected = [count_feature, value]\n    assert expected == deserializer.to_list()\n\n    value = IdentityFeature(es[\"log\"].ww[\"id\"])\n    do = pd.DateOffset(months=3, days=2, minutes=30)\n    count_feature = AggregationFeature(\n        value,\n        \"customers\",\n        count_primitive,\n        use_previous=do,\n    )\n    dictionary = {\n        \"ft_version\": __version__,\n        \"schema_version\": FEATURES_SCHEMA_VERSION,\n        \"entityset\": es.to_dictionary(),\n        \"feature_list\": [count_feature.unique_name(), value.unique_name()],\n        \"feature_definitions\": {\n            count_feature.unique_name(): count_feature.to_dictionary(),\n            value.unique_name(): value.to_dictionary(),\n        },\n    }\n    dictionary[\"primitive_definitions\"] = {\"0\": serialize_primitive(count_primitive)}\n    dictionary[\"feature_definitions\"][count_feature.unique_name()][\"arguments\"][\n        \"primitive\"\n    ] = \"0\"\n    deserializer = FeaturesDeserializer(dictionary)\n\n    expected = [count_feature, value]\n    assert expected == deserializer.to_list()\n\n\ndef test_word_set_in_number_of_common_words_is_deserialized_back_into_a_set(es):\n    id_feat = IdentityFeature(es[\"log\"].ww[\"comments\"])\n    number_of_common_words = NumberOfCommonWords(word_set={\"hello\", \"my\"})\n    transform_feat = TransformFeature(id_feat, number_of_common_words)\n    dictionary = {\n        \"ft_version\": __version__,\n        \"schema_version\": FEATURES_SCHEMA_VERSION,\n        \"entityset\": es.to_dictionary(),\n        \"feature_list\": [id_feat.unique_name(), transform_feat.unique_name()],\n        \"feature_definitions\": {\n            id_feat.unique_name(): id_feat.to_dictionary(),\n            transform_feat.unique_name(): transform_feat.to_dictionary(),\n        },\n        \"primitive_definitions\": {\"0\": serialize_primitive(number_of_common_words)},\n    }\n    dictionary[\"feature_definitions\"][transform_feat.unique_name()][\"arguments\"][\n        \"primitive\"\n    ] = \"0\"\n    deserializer = FeaturesDeserializer(dictionary)\n    assert isinstance(\n        deserializer.features_dict[\"primitive_definitions\"][\"0\"][\"arguments\"][\n            \"word_set\"\n        ],\n        set,\n    )\n\n\ndef _check_schema_version(version, es, warning_text, caplog, warning_type=None):\n    dictionary = {\n        \"ft_version\": __version__,\n        \"schema_version\": version,\n        \"entityset\": es.to_dictionary(),\n        \"feature_list\": [],\n        \"feature_definitions\": {},\n        \"primitive_definitions\": {},\n    }\n\n    if warning_type == \"warn\" and warning_text:\n        with pytest.warns(UserWarning) as record:\n            FeaturesDeserializer(dictionary)\n        assert record[0].message.args[0] == warning_text\n    elif warning_type == \"log\":\n        logger = logging.getLogger(\"featuretools\")\n        logger.propagate = True\n        FeaturesDeserializer(dictionary)\n        if warning_text:\n            assert warning_text in caplog.text\n        else:\n            assert not len(caplog.text)\n        logger.propagate = False\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/test_features_serializer.py",
    "content": "import pandas as pd\n\nfrom featuretools import (\n    AggregationFeature,\n    Feature,\n    IdentityFeature,\n    TransformFeature,\n    __version__,\n)\nfrom featuretools.entityset.deserialize import description_to_entityset\nfrom featuretools.feature_base.features_serializer import FeaturesSerializer\nfrom featuretools.primitives import (\n    Count,\n    Max,\n    MultiplyNumericScalar,\n    NMostCommon,\n    NumUnique,\n)\nfrom featuretools.primitives.utils import serialize_primitive\nfrom featuretools.version import FEATURES_SCHEMA_VERSION\n\n\ndef test_single_feature(es):\n    feature = IdentityFeature(es[\"log\"].ww[\"value\"])\n    serializer = FeaturesSerializer([feature])\n\n    expected = {\n        \"ft_version\": __version__,\n        \"schema_version\": FEATURES_SCHEMA_VERSION,\n        \"entityset\": es.to_dictionary(),\n        \"feature_list\": [feature.unique_name()],\n        \"feature_definitions\": {feature.unique_name(): feature.to_dictionary()},\n        \"primitive_definitions\": {},\n    }\n\n    _compare_feature_dicts(expected, serializer.to_dict())\n\n\ndef test_base_features_in_list(es):\n    value = IdentityFeature(es[\"log\"].ww[\"value\"])\n    max_feature = AggregationFeature(value, \"sessions\", Max)\n    features = [max_feature, value]\n    serializer = FeaturesSerializer(features)\n\n    expected = {\n        \"ft_version\": __version__,\n        \"schema_version\": FEATURES_SCHEMA_VERSION,\n        \"entityset\": es.to_dictionary(),\n        \"feature_list\": [max_feature.unique_name(), value.unique_name()],\n        \"feature_definitions\": {\n            max_feature.unique_name(): max_feature.to_dictionary(),\n            value.unique_name(): value.to_dictionary(),\n        },\n    }\n    expected[\"primitive_definitions\"] = {\n        \"0\": serialize_primitive(max_feature.primitive),\n    }\n    expected[\"feature_definitions\"][max_feature.unique_name()][\"arguments\"][\n        \"primitive\"\n    ] = \"0\"\n\n    actual = serializer.to_dict()\n    _compare_feature_dicts(expected, actual)\n\n\ndef test_multi_output_features(es):\n    product_id = IdentityFeature(es[\"log\"].ww[\"product_id\"])\n    threecommon = NMostCommon()\n    num_unique = NumUnique()\n    tc = Feature(product_id, parent_dataframe_name=\"sessions\", primitive=threecommon)\n\n    features = [tc, product_id]\n    for i in range(3):\n        features.append(\n            Feature(\n                tc[i],\n                parent_dataframe_name=\"customers\",\n                primitive=num_unique,\n            ),\n        )\n        features.append(tc[i])\n\n    serializer = FeaturesSerializer(features)\n\n    flist = [feat.unique_name() for feat in features]\n    fd = [feat.to_dictionary() for feat in features]\n    fdict = dict(zip(flist, fd))\n\n    expected = {\n        \"ft_version\": __version__,\n        \"schema_version\": FEATURES_SCHEMA_VERSION,\n        \"entityset\": es.to_dictionary(),\n        \"feature_list\": flist,\n        \"feature_definitions\": fdict,\n    }\n    expected[\"primitive_definitions\"] = {\n        \"0\": serialize_primitive(tc.primitive),\n        \"1\": serialize_primitive(features[2].primitive),\n    }\n\n    expected[\"feature_definitions\"][flist[0]][\"arguments\"][\"primitive\"] = \"0\"\n    expected[\"feature_definitions\"][flist[2]][\"arguments\"][\"primitive\"] = \"1\"\n    expected[\"feature_definitions\"][flist[4]][\"arguments\"][\"primitive\"] = \"1\"\n    expected[\"feature_definitions\"][flist[6]][\"arguments\"][\"primitive\"] = \"1\"\n\n    actual = serializer.to_dict()\n    _compare_feature_dicts(expected, actual)\n\n\ndef test_base_features_not_in_list(es):\n    max_primitive = Max()\n    mult_primitive = MultiplyNumericScalar(value=2)\n    value = IdentityFeature(es[\"log\"].ww[\"value\"])\n    value_x2 = TransformFeature(value, mult_primitive)\n    max_feature = AggregationFeature(value_x2, \"sessions\", max_primitive)\n    features = [max_feature]\n    serializer = FeaturesSerializer(features)\n\n    expected = {\n        \"ft_version\": __version__,\n        \"schema_version\": FEATURES_SCHEMA_VERSION,\n        \"entityset\": es.to_dictionary(),\n        \"feature_list\": [max_feature.unique_name()],\n        \"feature_definitions\": {\n            max_feature.unique_name(): max_feature.to_dictionary(),\n            value_x2.unique_name(): value_x2.to_dictionary(),\n            value.unique_name(): value.to_dictionary(),\n        },\n    }\n    expected[\"primitive_definitions\"] = {\n        \"0\": serialize_primitive(max_feature.primitive),\n        \"1\": serialize_primitive(value_x2.primitive),\n    }\n    expected[\"feature_definitions\"][max_feature.unique_name()][\"arguments\"][\n        \"primitive\"\n    ] = \"0\"\n    expected[\"feature_definitions\"][value_x2.unique_name()][\"arguments\"][\n        \"primitive\"\n    ] = \"1\"\n\n    actual = serializer.to_dict()\n    _compare_feature_dicts(expected, actual)\n\n\ndef test_where_feature_dependency(es):\n    max_primitive = Max()\n    value = IdentityFeature(es[\"log\"].ww[\"value\"])\n    is_purchased = IdentityFeature(es[\"log\"].ww[\"purchased\"])\n    max_feature = AggregationFeature(\n        value,\n        \"sessions\",\n        max_primitive,\n        where=is_purchased,\n    )\n    features = [max_feature]\n    serializer = FeaturesSerializer(features)\n\n    expected = {\n        \"ft_version\": __version__,\n        \"schema_version\": FEATURES_SCHEMA_VERSION,\n        \"entityset\": es.to_dictionary(),\n        \"feature_list\": [max_feature.unique_name()],\n        \"feature_definitions\": {\n            max_feature.unique_name(): max_feature.to_dictionary(),\n            value.unique_name(): value.to_dictionary(),\n            is_purchased.unique_name(): is_purchased.to_dictionary(),\n        },\n    }\n    expected[\"primitive_definitions\"] = {\n        \"0\": serialize_primitive(max_feature.primitive),\n    }\n    expected[\"feature_definitions\"][max_feature.unique_name()][\"arguments\"][\n        \"primitive\"\n    ] = \"0\"\n\n    actual = serializer.to_dict()\n    _compare_feature_dicts(expected, actual)\n\n\ndef test_feature_use_previous_pd_timedelta(es):\n    value = IdentityFeature(es[\"log\"].ww[\"id\"])\n    td = pd.Timedelta(12, \"W\")\n    count_primitive = Count()\n    count_feature = AggregationFeature(\n        value,\n        \"customers\",\n        count_primitive,\n        use_previous=td,\n    )\n    features = [count_feature, value]\n    serializer = FeaturesSerializer(features)\n\n    expected = {\n        \"ft_version\": __version__,\n        \"schema_version\": FEATURES_SCHEMA_VERSION,\n        \"entityset\": es.to_dictionary(),\n        \"feature_list\": [count_feature.unique_name(), value.unique_name()],\n        \"feature_definitions\": {\n            count_feature.unique_name(): count_feature.to_dictionary(),\n            value.unique_name(): value.to_dictionary(),\n        },\n    }\n    expected[\"primitive_definitions\"] = {\n        \"0\": serialize_primitive(count_feature.primitive),\n    }\n    expected[\"feature_definitions\"][count_feature.unique_name()][\"arguments\"][\n        \"primitive\"\n    ] = \"0\"\n\n    actual = serializer.to_dict()\n    _compare_feature_dicts(expected, actual)\n\n\ndef test_feature_use_previous_pd_dateoffset(es):\n    value = IdentityFeature(es[\"log\"].ww[\"id\"])\n    do = pd.DateOffset(months=3)\n    count_primitive = Count()\n    count_feature = AggregationFeature(\n        value,\n        \"customers\",\n        count_primitive,\n        use_previous=do,\n    )\n    features = [count_feature, value]\n    serializer = FeaturesSerializer(features)\n\n    expected = {\n        \"ft_version\": __version__,\n        \"schema_version\": FEATURES_SCHEMA_VERSION,\n        \"entityset\": es.to_dictionary(),\n        \"feature_list\": [count_feature.unique_name(), value.unique_name()],\n        \"feature_definitions\": {\n            count_feature.unique_name(): count_feature.to_dictionary(),\n            value.unique_name(): value.to_dictionary(),\n        },\n    }\n    expected[\"primitive_definitions\"] = {\n        \"0\": serialize_primitive(count_feature.primitive),\n    }\n    expected[\"feature_definitions\"][count_feature.unique_name()][\"arguments\"][\n        \"primitive\"\n    ] = \"0\"\n\n    actual = serializer.to_dict()\n    _compare_feature_dicts(expected, actual)\n\n    value = IdentityFeature(es[\"log\"].ww[\"id\"])\n    do = pd.DateOffset(months=3, days=2, minutes=30)\n    count_feature = AggregationFeature(\n        value,\n        \"customers\",\n        count_primitive,\n        use_previous=do,\n    )\n    features = [count_feature, value]\n    serializer = FeaturesSerializer(features)\n\n    expected = {\n        \"ft_version\": __version__,\n        \"schema_version\": FEATURES_SCHEMA_VERSION,\n        \"entityset\": es.to_dictionary(),\n        \"feature_list\": [count_feature.unique_name(), value.unique_name()],\n        \"feature_definitions\": {\n            count_feature.unique_name(): count_feature.to_dictionary(),\n            value.unique_name(): value.to_dictionary(),\n        },\n    }\n    expected[\"primitive_definitions\"] = {\n        \"0\": serialize_primitive(count_feature.primitive),\n    }\n    expected[\"feature_definitions\"][count_feature.unique_name()][\"arguments\"][\n        \"primitive\"\n    ] = \"0\"\n    actual = serializer.to_dict()\n    _compare_feature_dicts(expected, actual)\n\n\ndef _compare_feature_dicts(a_dict, b_dict):\n    # We can't compare entityset dictionaries because column lists are not\n    # guaranteed to be in the same order.\n    es_a = description_to_entityset(a_dict.pop(\"entityset\"))\n    es_b = description_to_entityset(b_dict.pop(\"entityset\"))\n    assert es_a == es_b\n\n    assert a_dict == b_dict\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/test_groupby_transform_primitives.py",
    "content": "import numpy as np\nimport pandas as pd\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime\n\nfrom featuretools import (\n    Feature,\n    GroupByTransformFeature,\n    IdentityFeature,\n    calculate_feature_matrix,\n    feature_base,\n)\nfrom featuretools.computational_backends.feature_set import FeatureSet\nfrom featuretools.computational_backends.feature_set_calculator import (\n    FeatureSetCalculator,\n)\nfrom featuretools.primitives import CumCount, CumMax, CumMean, CumMin, CumSum, Last\nfrom featuretools.primitives.base import TransformPrimitive\nfrom featuretools.synthesis import dfs\nfrom featuretools.tests.testing_utils import feature_with_name\n\n\nclass TestCumCount:\n    primitive = CumCount\n\n    def test_order(self):\n        g = pd.Series([\"a\", \"b\", \"a\"])\n\n        answers = ([1, 2], [1])\n\n        function = self.primitive().get_function()\n        for (_, group), answer in zip(g.groupby(g), answers):\n            np.testing.assert_array_equal(function(group), answer)\n\n    def test_regular(self):\n        g = pd.Series([\"a\", \"b\", \"a\", \"c\", \"d\", \"b\"])\n        answers = ([1, 2], [1, 2], [1], [1])\n\n        function = self.primitive().get_function()\n        for (_, group), answer in zip(g.groupby(g), answers):\n            np.testing.assert_array_equal(function(group), answer)\n\n    def test_discrete(self):\n        g = pd.Series([\"a\", \"b\", \"a\", \"c\", \"d\", \"b\"])\n        answers = ([1, 2], [1, 2], [1], [1])\n\n        function = self.primitive().get_function()\n        for (_, group), answer in zip(g.groupby(g), answers):\n            np.testing.assert_array_equal(function(group), answer)\n\n\nclass TestCumSum:\n    primitive = CumSum\n\n    def test_order(self):\n        v = pd.Series([1, 2, 2])\n        g = pd.Series([\"a\", \"b\", \"a\"])\n\n        answers = ([1, 3], [2])\n\n        function = self.primitive().get_function()\n        for (_, group), answer in zip(v.groupby(g), answers):\n            np.testing.assert_array_equal(function(group), answer)\n\n    def test_regular(self):\n        v = pd.Series([101, 102, 103, 104, 105, 106])\n        g = pd.Series([\"a\", \"b\", \"a\", \"c\", \"d\", \"b\"])\n        answers = ([101, 204], [102, 208], [104], [105])\n\n        function = self.primitive().get_function()\n        for (_, group), answer in zip(v.groupby(g), answers):\n            np.testing.assert_array_equal(function(group), answer)\n\n\nclass TestCumMean:\n    primitive = CumMean\n\n    def test_order(self):\n        v = pd.Series([1, 2, 2])\n        g = pd.Series([\"a\", \"b\", \"a\"])\n\n        answers = ([1, 1.5], [2])\n\n        function = self.primitive().get_function()\n        for (_, group), answer in zip(v.groupby(g), answers):\n            np.testing.assert_array_equal(function(group), answer)\n\n    def test_regular(self):\n        v = pd.Series([101, 102, 103, 104, 105, 106])\n        g = pd.Series([\"a\", \"b\", \"a\", \"c\", \"d\", \"b\"])\n        answers = ([101, 102], [102, 104], [104], [105])\n\n        function = self.primitive().get_function()\n        for (_, group), answer in zip(v.groupby(g), answers):\n            np.testing.assert_array_equal(function(group), answer)\n\n\nclass TestCumMax:\n    primitive = CumMax\n\n    def test_order(self):\n        v = pd.Series([1, 2, 2])\n        g = pd.Series([\"a\", \"b\", \"a\"])\n\n        answers = ([1, 2], [2])\n\n        function = self.primitive().get_function()\n        for (_, group), answer in zip(v.groupby(g), answers):\n            np.testing.assert_array_equal(function(group), answer)\n\n    def test_regular(self):\n        v = pd.Series([101, 102, 103, 104, 105, 106])\n        g = pd.Series([\"a\", \"b\", \"a\", \"c\", \"d\", \"b\"])\n        answers = ([101, 103], [102, 106], [104], [105])\n\n        function = self.primitive().get_function()\n        for (_, group), answer in zip(v.groupby(g), answers):\n            np.testing.assert_array_equal(function(group), answer)\n\n\nclass TestCumMin:\n    primitive = CumMin\n\n    def test_order(self):\n        v = pd.Series([1, 2, 2])\n        g = pd.Series([\"a\", \"b\", \"a\"])\n\n        answers = ([1, 1], [2])\n\n        function = self.primitive().get_function()\n        for (_, group), answer in zip(v.groupby(g), answers):\n            np.testing.assert_array_equal(function(group), answer)\n\n    def test_regular(self):\n        v = pd.Series([101, 102, 103, 104, 105, 106, 100])\n        g = pd.Series([\"a\", \"b\", \"a\", \"c\", \"d\", \"b\", \"a\"])\n        answers = ([101, 101, 100], [102, 102], [104], [105])\n\n        function = self.primitive().get_function()\n        for (_, group), answer in zip(v.groupby(g), answers):\n            np.testing.assert_array_equal(function(group), answer)\n\n\ndef test_cum_sum(es):\n    log_value_feat = IdentityFeature(es[\"log\"].ww[\"value\"])\n    dfeat = Feature(\n        IdentityFeature(es[\"sessions\"].ww[\"device_type\"]),\n        dataframe_name=\"log\",\n    )\n    cum_sum = Feature(log_value_feat, groupby=dfeat, primitive=CumSum)\n    features = [cum_sum]\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=range(15),\n    )\n    cvalues = df[cum_sum.get_name()].values\n    assert len(cvalues) == 15\n    cum_sum_values = [0, 5, 15, 30, 50, 0, 1, 3, 6, 6, 50, 55, 55, 62, 76]\n    for i, v in enumerate(cum_sum_values):\n        assert v == cvalues[i]\n\n\ndef test_cum_min(es):\n    log_value_feat = IdentityFeature(es[\"log\"].ww[\"value\"])\n    cum_min = Feature(\n        log_value_feat,\n        groupby=IdentityFeature(es[\"log\"].ww[\"session_id\"]),\n        primitive=CumMin,\n    )\n    features = [cum_min]\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=range(15),\n    )\n    cvalues = df[cum_min.get_name()].values\n    assert len(cvalues) == 15\n    cum_min_values = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n    for i, v in enumerate(cum_min_values):\n        assert v == cvalues[i]\n\n\ndef test_cum_max(es):\n    log_value_feat = IdentityFeature(es[\"log\"].ww[\"value\"])\n    cum_max = Feature(\n        log_value_feat,\n        groupby=IdentityFeature(es[\"log\"].ww[\"session_id\"]),\n        primitive=CumMax,\n    )\n    features = [cum_max]\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=range(15),\n    )\n    cvalues = df[cum_max.get_name()].values\n    assert len(cvalues) == 15\n    cum_max_values = [0, 5, 10, 15, 20, 0, 1, 2, 3, 0, 0, 5, 0, 7, 14]\n    for i, v in enumerate(cum_max_values):\n        assert v == cvalues[i]\n\n\ndef test_cum_sum_group_on_nan(es):\n    log_value_feat = IdentityFeature(es[\"log\"].ww[\"value\"])\n    es[\"log\"][\"product_id\"] = (\n        [\"coke zero\"] * 3\n        + [\"car\"] * 2\n        + [\"toothpaste\"] * 3\n        + [\"brown bag\"] * 2\n        + [\"shoes\"]\n        + [np.nan] * 4\n        + [\"coke_zero\"] * 2\n    )\n    es[\"log\"][\"value\"][16] = 10\n    cum_sum = Feature(\n        log_value_feat,\n        groupby=IdentityFeature(es[\"log\"].ww[\"product_id\"]),\n        primitive=CumSum,\n    )\n    features = [cum_sum]\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=range(17),\n    )\n    cvalues = df[cum_sum.get_name()].values\n    assert len(cvalues) == 17\n    cum_sum_values = [\n        0,\n        5,\n        15,\n        15,\n        35,\n        0,\n        1,\n        3,\n        3,\n        3,\n        0,\n        np.nan,\n        np.nan,\n        np.nan,\n        np.nan,\n        np.nan,\n        10,\n    ]\n\n    assert len(cvalues) == len(cum_sum_values)\n    for i, v in enumerate(cum_sum_values):\n        if np.isnan(v):\n            assert np.isnan(cvalues[i])\n        else:\n            assert v == cvalues[i]\n\n\ndef test_cum_sum_numpy_group_on_nan(es):\n    class CumSumNumpy(TransformPrimitive):\n        \"\"\"Returns the cumulative sum after grouping\"\"\"\n\n        name = \"cum_sum\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n        uses_full_dataframe = True\n\n        def get_function(self):\n            def cum_sum(values):\n                return values.cumsum().values\n\n            return cum_sum\n\n    log_value_feat = IdentityFeature(es[\"log\"].ww[\"value\"])\n    es[\"log\"][\"product_id\"] = (\n        [\"coke zero\"] * 3\n        + [\"car\"] * 2\n        + [\"toothpaste\"] * 3\n        + [\"brown bag\"] * 2\n        + [\"shoes\"]\n        + [np.nan] * 4\n        + [\"coke_zero\"] * 2\n    )\n    es[\"log\"][\"value\"][16] = 10\n    cum_sum = Feature(\n        log_value_feat,\n        groupby=IdentityFeature(es[\"log\"].ww[\"product_id\"]),\n        primitive=CumSumNumpy,\n    )\n    assert cum_sum.get_name() == \"CUM_SUM(value) by product_id\"\n    features = [cum_sum]\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=range(17),\n    )\n    cvalues = df[cum_sum.get_name()].values\n    assert len(cvalues) == 17\n    cum_sum_values = [\n        0,\n        5,\n        15,\n        15,\n        35,\n        0,\n        1,\n        3,\n        3,\n        3,\n        0,\n        np.nan,\n        np.nan,\n        np.nan,\n        np.nan,\n        np.nan,\n        10,\n    ]\n\n    assert len(cvalues) == len(cum_sum_values)\n    for i, v in enumerate(cum_sum_values):\n        if np.isnan(v):\n            assert np.isnan(cvalues[i])\n        else:\n            assert v == cvalues[i]\n\n\ndef test_cum_handles_uses_full_dataframe(es):\n    def check(feature):\n        feature_set = FeatureSet([feature])\n        calculator = FeatureSetCalculator(\n            es,\n            feature_set=feature_set,\n            time_last=None,\n        )\n        df_1 = calculator.run(np.array([0, 1, 2]))\n        df_2 = calculator.run(np.array([2, 4]))\n\n        # check that the value for instance id 2 matches\n        assert (df_2.loc[2] == df_1.loc[2]).all()\n\n    for primitive in [CumSum, CumMean, CumMax, CumMin]:\n        check(\n            Feature(\n                es[\"log\"].ww[\"value\"],\n                groupby=IdentityFeature(es[\"log\"].ww[\"session_id\"]),\n                primitive=primitive,\n            ),\n        )\n\n    check(\n        Feature(\n            es[\"log\"].ww[\"product_id\"],\n            groupby=Feature(es[\"log\"].ww[\"product_id\"]),\n            primitive=CumCount,\n        ),\n    )\n\n\ndef test_cum_mean(es):\n    log_value_feat = IdentityFeature(es[\"log\"].ww[\"value\"])\n    cum_mean = Feature(\n        log_value_feat,\n        groupby=IdentityFeature(es[\"log\"].ww[\"session_id\"]),\n        primitive=CumMean,\n    )\n    features = [cum_mean]\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=range(15),\n    )\n    cvalues = df[cum_mean.get_name()].values\n    assert len(cvalues) == 15\n    cum_mean_values = [0, 2.5, 5, 7.5, 10, 0, 0.5, 1, 1.5, 0, 0, 2.5, 0, 3.5, 7]\n    for i, v in enumerate(cum_mean_values):\n        assert v == cvalues[i]\n\n\ndef test_cum_count(es):\n    cum_count = Feature(\n        IdentityFeature(es[\"log\"].ww[\"product_id\"]),\n        groupby=IdentityFeature(es[\"log\"].ww[\"product_id\"]),\n        primitive=CumCount,\n    )\n    features = [cum_count]\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=range(15),\n    )\n    cvalues = df[cum_count.get_name()].values\n    assert len(cvalues) == 15\n    cum_count_values = [1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 4, 5, 6, 7]\n    for i, v in enumerate(cum_count_values):\n        assert v == cvalues[i]\n\n\ndef test_rename(es):\n    cum_count = Feature(\n        IdentityFeature(es[\"log\"].ww[\"product_id\"]),\n        groupby=IdentityFeature(es[\"log\"].ww[\"product_id\"]),\n        primitive=CumCount,\n    )\n    copy_feat = cum_count.rename(\"rename_test\")\n    assert cum_count.unique_name() != copy_feat.unique_name()\n    assert cum_count.get_name() != copy_feat.get_name()\n    assert all(\n        [\n            x.generate_name() == y.generate_name()\n            for x, y in zip(cum_count.base_features, copy_feat.base_features)\n        ],\n    )\n    assert cum_count.dataframe_name == copy_feat.dataframe_name\n\n\ndef test_groupby_no_data(es):\n    cum_count = Feature(\n        IdentityFeature(es[\"log\"].ww[\"product_id\"]),\n        groupby=IdentityFeature(es[\"log\"].ww[\"product_id\"]),\n        primitive=CumCount,\n    )\n    last_feat = Feature(cum_count, parent_dataframe_name=\"customers\", primitive=Last)\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=[last_feat],\n        cutoff_time=pd.Timestamp(\"2011-04-08\"),\n    )\n    cvalues = df[last_feat.get_name()].values\n    assert len(cvalues) == 2\n    assert all([pd.isnull(value) for value in cvalues])\n\n\ndef test_groupby_uses_calc_time(es):\n    def projected_amount_left(amount, timestamp, time=None):\n        # cumulative sum of amount, with timedelta *  constant subtracted\n        delta = time - timestamp\n        delta_seconds = delta / np.timedelta64(1, \"s\")\n        return amount.cumsum() - (delta_seconds)\n\n    class ProjectedAmountRemaining(TransformPrimitive):\n        name = \"projected_amount_remaining\"\n        uses_calc_time = True\n        input_types = [\n            ColumnSchema(semantic_tags={\"numeric\"}),\n            ColumnSchema(logical_type=Datetime, semantic_tags={\"time_index\"}),\n        ]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n        uses_full_dataframe = True\n\n        def get_function(self):\n            return projected_amount_left\n\n    time_since_product = GroupByTransformFeature(\n        [\n            IdentityFeature(es[\"log\"].ww[\"value\"]),\n            IdentityFeature(es[\"log\"].ww[\"datetime\"]),\n        ],\n        groupby=IdentityFeature(es[\"log\"].ww[\"product_id\"]),\n        primitive=ProjectedAmountRemaining,\n    )\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=[time_since_product],\n        cutoff_time=pd.Timestamp(\"2011-04-10 11:10:30\"),\n    )\n    answers = [\n        -88830,\n        -88819,\n        -88803,\n        -88797,\n        -88771,\n        -88770,\n        -88760,\n        -88749,\n        -88740,\n        -88227,\n        -1830,\n        -1809,\n        -1750,\n        -1740,\n        -1723,\n        np.nan,\n        np.nan,\n    ]\n\n    for x, y in zip(df[time_since_product.get_name()], answers):\n        assert (pd.isnull(x) and pd.isnull(y)) or x == y\n\n\ndef test_groupby_multi_output_stacking(es):\n    class TestTime(TransformPrimitive):\n        name = \"test_time\"\n        input_types = [ColumnSchema(logical_type=Datetime)]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n        number_output_features = 6\n\n    fl = dfs(\n        entityset=es,\n        target_dataframe_name=\"sessions\",\n        agg_primitives=[\"sum\"],\n        groupby_trans_primitives=[TestTime],\n        features_only=True,\n        max_depth=4,\n    )\n\n    for i in range(6):\n        f = \"SUM(log.TEST_TIME(datetime)[%d] by product_id)\" % i\n        assert feature_with_name(fl, f)\n        assert (\"customers.SUM(log.TEST_TIME(datetime)[%d] by session_id)\" % i) in fl\n\n\ndef test_serialization(es):\n    value = IdentityFeature(es[\"log\"].ww[\"value\"])\n    zipcode = IdentityFeature(es[\"log\"].ww[\"zipcode\"])\n    primitive = CumSum()\n    groupby = feature_base.GroupByTransformFeature(value, primitive, zipcode)\n\n    dictionary = {\n        \"name\": \"CUM_SUM(value) by zipcode\",\n        \"base_features\": [value.unique_name()],\n        \"primitive\": primitive,\n        \"groupby\": zipcode.unique_name(),\n    }\n\n    assert dictionary == groupby.get_arguments()\n    dependencies = {\n        value.unique_name(): value,\n        zipcode.unique_name(): zipcode,\n    }\n    assert groupby == feature_base.GroupByTransformFeature.from_dictionary(\n        dictionary,\n        es,\n        dependencies,\n        primitive,\n    )\n\n\ndef test_groupby_with_multioutput_primitive(es):\n    class MultiCumSum(TransformPrimitive):\n        name = \"multi_cum_sum\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n        number_output_features = 3\n\n        def get_function(self):\n            def multi_cum_sum(x):\n                return x.cumsum(), x.cummax(), x.cummin()\n\n            return multi_cum_sum\n\n    fm, _ = dfs(\n        entityset=es,\n        target_dataframe_name=\"customers\",\n        trans_primitives=[],\n        agg_primitives=[],\n        groupby_trans_primitives=[MultiCumSum, CumSum, CumMax, CumMin],\n    )\n\n    # Calculate output in a separate DFS call to make sure the multi-output code\n    # does not alter any values\n    fm2, _ = dfs(\n        entityset=es,\n        target_dataframe_name=\"customers\",\n        trans_primitives=[],\n        agg_primitives=[],\n        groupby_trans_primitives=[CumSum, CumMax, CumMin],\n    )\n\n    answer_cols = [\n        [\"CUM_SUM(age) by cohort\", \"CUM_SUM(age) by région_id\"],\n        [\"CUM_MAX(age) by cohort\", \"CUM_MAX(age) by région_id\"],\n        [\"CUM_MIN(age) by cohort\", \"CUM_MIN(age) by région_id\"],\n    ]\n\n    for i in range(3):\n        # Check that multi-output gives correct answers\n        f = \"MULTI_CUM_SUM(age)[%d] by cohort\" % i\n        assert f in fm.columns\n        for x, y in zip(fm[f].values, fm[answer_cols[i][0]].values):\n            assert x == y\n        f = \"MULTI_CUM_SUM(age)[%d] by région_id\" % i\n        assert f in fm.columns\n        for x, y in zip(fm[f].values, fm[answer_cols[i][1]].values):\n            assert x == y\n        # Verify single output results are unchanged by inclusion of\n        # multi-output primitive\n        for x, y in zip(fm[answer_cols[i][0]], fm2[answer_cols[i][0]]):\n            assert x == y\n        for x, y in zip(fm[answer_cols[i][1]], fm2[answer_cols[i][1]]):\n            assert x == y\n\n\ndef test_groupby_with_multioutput_primitive_custom_names(es):\n    class MultiCumSum(TransformPrimitive):\n        name = \"multi_cum_sum\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n        number_output_features = 3\n\n        def get_function(self):\n            def multi_cum_sum(x):\n                return x.cumsum(), x.cummax(), x.cummin()\n\n            return multi_cum_sum\n\n        def generate_names(primitive, base_feature_names):\n            return [\"CUSTOM_SUM\", \"CUSTOM_MAX\", \"CUSTOM_MIN\"]\n\n    fm, _ = dfs(\n        entityset=es,\n        target_dataframe_name=\"customers\",\n        trans_primitives=[],\n        agg_primitives=[],\n        groupby_trans_primitives=[MultiCumSum, CumSum, CumMax, CumMin],\n    )\n\n    answer_cols = [\n        [\"CUM_SUM(age) by cohort\", \"CUM_SUM(age) by région_id\"],\n        [\"CUM_MAX(age) by cohort\", \"CUM_MAX(age) by région_id\"],\n        [\"CUM_MIN(age) by cohort\", \"CUM_MIN(age) by région_id\"],\n    ]\n\n    expected_names = [\n        [\"CUSTOM_SUM by cohort\", \"CUSTOM_SUM by région_id\"],\n        [\"CUSTOM_MAX by cohort\", \"CUSTOM_MAX by région_id\"],\n        [\"CUSTOM_MIN by cohort\", \"CUSTOM_MIN by région_id\"],\n    ]\n\n    for i in range(3):\n        f = expected_names[i][0]\n        assert f in fm.columns\n        for x, y in zip(fm[f].values, fm[answer_cols[i][0]].values):\n            assert x == y\n        f = expected_names[i][1]\n        assert f in fm.columns\n        for x, y in zip(fm[f].values, fm[answer_cols[i][1]].values):\n            assert x == y\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/test_identity_features.py",
    "content": "from featuretools import IdentityFeature\nfrom featuretools.primitives.utils import PrimitivesDeserializer\n\n\ndef test_relationship_path(es):\n    value = IdentityFeature(es[\"log\"].ww[\"value\"])\n    assert len(value.relationship_path) == 0\n\n\ndef test_serialization(es):\n    value = IdentityFeature(es[\"log\"].ww[\"value\"])\n\n    dictionary = {\n        \"name\": \"value\",\n        \"column_name\": \"value\",\n        \"dataframe_name\": \"log\",\n    }\n\n    assert dictionary == value.get_arguments()\n    assert value == IdentityFeature.from_dictionary(\n        dictionary,\n        es,\n        {},\n        PrimitivesDeserializer,\n    )\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/test_overrides.py",
    "content": "from featuretools import Feature, calculate_feature_matrix\nfrom featuretools.primitives import (\n    AddNumeric,\n    AddNumericScalar,\n    Count,\n    DivideByFeature,\n    DivideNumeric,\n    DivideNumericScalar,\n    Equal,\n    EqualScalar,\n    GreaterThan,\n    GreaterThanEqualTo,\n    GreaterThanEqualToScalar,\n    GreaterThanScalar,\n    LessThan,\n    LessThanEqualTo,\n    LessThanEqualToScalar,\n    LessThanScalar,\n    ModuloByFeature,\n    ModuloNumeric,\n    ModuloNumericScalar,\n    MultiplyNumeric,\n    MultiplyNumericScalar,\n    Negate,\n    NotEqual,\n    NotEqualScalar,\n    ScalarSubtractNumericFeature,\n    SubtractNumeric,\n    SubtractNumericScalar,\n    Sum,\n)\n\n\ndef test_overrides(es):\n    value = Feature(es[\"log\"].ww[\"value\"])\n    value2 = Feature(es[\"log\"].ww[\"value_2\"])\n\n    feats = [\n        AddNumeric,\n        SubtractNumeric,\n        MultiplyNumeric,\n        DivideNumeric,\n        ModuloNumeric,\n        GreaterThan,\n        LessThan,\n        Equal,\n        NotEqual,\n        GreaterThanEqualTo,\n        LessThanEqualTo,\n    ]\n    assert Feature(value, primitive=Negate).unique_name() == (-value).unique_name()\n\n    compares = [(value, value), (value, value2)]\n    overrides = [\n        value + value,\n        value - value,\n        value * value,\n        value / value,\n        value % value,\n        value > value,\n        value < value,\n        value == value,\n        value != value,\n        value >= value,\n        value <= value,\n        value + value2,\n        value - value2,\n        value * value2,\n        value / value2,\n        value % value2,\n        value > value2,\n        value < value2,\n        value == value2,\n        value != value2,\n        value >= value2,\n        value <= value2,\n    ]\n\n    for left, right in compares:\n        for feat in feats:\n            f = Feature([left, right], primitive=feat)\n            o = overrides.pop(0)\n            assert o.unique_name() == f.unique_name()\n\n\ndef test_override_boolean(es):\n    count = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Count,\n    )\n    count_lo = Feature(count, primitive=GreaterThanScalar(1))\n    count_hi = Feature(count, primitive=LessThanScalar(10))\n\n    to_test = [[True, True, True], [True, True, False], [False, False, True]]\n\n    features = []\n    features.append(count_lo.OR(count_hi))\n    features.append(count_lo.AND(count_hi))\n    features.append(~(count_lo.AND(count_hi)))\n\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=[0, 1, 2],\n    )\n    for i, test in enumerate(to_test):\n        v = df[features[i].get_name()].tolist()\n        assert v == test\n\n\ndef test_scalar_overrides(es):\n    value = Feature(es[\"log\"].ww[\"value\"])\n\n    feats = [\n        AddNumericScalar,\n        SubtractNumericScalar,\n        MultiplyNumericScalar,\n        DivideNumericScalar,\n        ModuloNumericScalar,\n        GreaterThanScalar,\n        LessThanScalar,\n        EqualScalar,\n        NotEqualScalar,\n        GreaterThanEqualToScalar,\n        LessThanEqualToScalar,\n    ]\n\n    overrides = [\n        value + 2,\n        value - 2,\n        value * 2,\n        value / 2,\n        value % 2,\n        value > 2,\n        value < 2,\n        value == 2,\n        value != 2,\n        value >= 2,\n        value <= 2,\n    ]\n\n    for feat in feats:\n        f = Feature(value, primitive=feat(2))\n        o = overrides.pop(0)\n        assert o.unique_name() == f.unique_name()\n\n    value2 = Feature(es[\"log\"].ww[\"value_2\"])\n\n    reverse_feats = [\n        AddNumericScalar,\n        ScalarSubtractNumericFeature,\n        MultiplyNumericScalar,\n        DivideByFeature,\n        ModuloByFeature,\n        GreaterThanScalar,\n        LessThanScalar,\n        EqualScalar,\n        NotEqualScalar,\n        GreaterThanEqualToScalar,\n        LessThanEqualToScalar,\n    ]\n    reverse_overrides = [\n        2 + value2,\n        2 - value2,\n        2 * value2,\n        2 / value2,\n        2 % value2,\n        2 < value2,\n        2 > value2,\n        2 == value2,\n        2 != value2,\n        2 <= value2,\n        2 >= value2,\n    ]\n    for feat in reverse_feats:\n        f = Feature(value2, primitive=feat(2))\n        o = reverse_overrides.pop(0)\n        assert o.unique_name() == f.unique_name()\n\n\ndef test_override_cmp_from_column(es):\n    count_lo = Feature(es[\"log\"].ww[\"value\"]) > 1\n\n    to_test = [False, True, True]\n\n    features = [count_lo]\n\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=[0, 1, 2],\n    )\n    v = df[count_lo.get_name()].tolist()\n    for i, test in enumerate(to_test):\n        assert v[i] == test\n\n\ndef test_override_cmp(es):\n    count = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Count,\n    )\n    _sum = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Sum,\n    )\n    gt_lo = count > 1\n    gt_other = count > _sum\n    ge_lo = count >= 1\n    ge_other = count >= _sum\n    lt_hi = count < 10\n    lt_other = count < _sum\n    le_hi = count <= 10\n    le_other = count <= _sum\n    ne_lo = count != 1\n    ne_other = count != _sum\n\n    to_test = [\n        [True, True, False],\n        [False, False, True],\n        [True, True, True],\n        [False, False, True],\n        [True, True, True],\n        [True, True, False],\n        [True, True, True],\n        [True, True, False],\n    ]\n    features = [\n        gt_lo,\n        gt_other,\n        ge_lo,\n        ge_other,\n        lt_hi,\n        lt_other,\n        le_hi,\n        le_other,\n        ne_lo,\n        ne_other,\n    ]\n\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=[0, 1, 2],\n    )\n    for i, test in enumerate(to_test):\n        v = df[features[i].get_name()].tolist()\n        assert v == test\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/test_primitive_base.py",
    "content": "from datetime import datetime\n\nimport numpy as np\nimport pandas as pd\nfrom pytest import raises\n\nfrom featuretools.primitives import Haversine, IsIn, IsNull, Max, TimeSinceLast\nfrom featuretools.primitives.base import TransformPrimitive\n\n\ndef test_call_agg():\n    primitive = Max()\n\n    # the assert is run twice on purpose\n    for _ in range(2):\n        assert 5 == primitive(range(6))\n\n\ndef test_call_trans():\n    primitive = IsNull()\n    for _ in range(2):\n        assert pd.Series([False] * 6).equals(primitive(range(6)))\n\n\ndef test_uses_calc_time():\n    primitive = TimeSinceLast()\n    primitive_h = TimeSinceLast(unit=\"hours\")\n    datetimes = pd.Series([datetime(2015, 6, 6), datetime(2015, 6, 7)])\n    answer = 86400.0\n    answer_h = 24.0\n    assert answer == primitive(datetimes, time=datetime(2015, 6, 8))\n    assert answer_h == primitive_h(datetimes, time=datetime(2015, 6, 8))\n\n\ndef test_call_multiple_args():\n    primitive = Haversine()\n    data1 = [(42.4, -71.1), (40.0, -122.4)]\n    data2 = [(40.0, -122.4), (41.2, -96.75)]\n    answer = [2631.231, 1343.289]\n\n    for _ in range(2):\n        assert np.round(primitive(data1, data2), 3).tolist() == answer\n\n\ndef test_get_function_called_once():\n    class TestPrimitive(TransformPrimitive):\n        def __init__(self):\n            self.get_function_call_count = 0\n\n        def get_function(self):\n            self.get_function_call_count += 1\n\n            def test(x):\n                return x\n\n            return test\n\n    primitive = TestPrimitive()\n\n    for _ in range(2):\n        primitive(range(6))\n\n    assert primitive.get_function_call_count == 1\n\n\ndef test_multiple_arg_string():\n    class Primitive(TransformPrimitive):\n        def __init__(self, bool=True, int=0, float=None):\n            self.bool = bool\n            self.int = int\n            self.float = float\n\n    primitive = Primitive(bool=True, int=4, float=0.1)\n    string = primitive.get_args_string()\n    assert string == \", int=4, float=0.1\"\n\n\ndef test_single_args_string():\n    assert IsIn([1, 2, 3]).get_args_string() == \", list_of_outputs=[1, 2, 3]\"\n\n\ndef test_args_string_default():\n    assert IsIn().get_args_string() == \"\"\n\n\ndef test_args_string_mixed():\n    class Primitive(TransformPrimitive):\n        def __init__(self, bool=True, int=0, float=None):\n            self.bool = bool\n            self.int = int\n            self.float = float\n\n    primitive = Primitive(bool=False, int=0)\n    string = primitive.get_args_string()\n    assert string == \", bool=False\"\n\n\ndef test_args_string_undefined():\n    string = Max().get_args_string()\n    assert string == \"\"\n\n\ndef test_args_string_error():\n    class Primitive(TransformPrimitive):\n        def __init__(self, bool=True, int=0, float=None):\n            pass\n\n    with raises(AssertionError, match=\"must be attribute\"):\n        Primitive(bool=True, int=4, float=0.1).get_args_string()\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/test_primitive_utils.py",
    "content": "import os\n\nimport pytest\n\nfrom featuretools import list_primitives, summarize_primitives\nfrom featuretools.primitives import (\n    AddNumericScalar,\n    Age,\n    Count,\n    Day,\n    Diff,\n    GreaterThan,\n    Haversine,\n    IsFreeEmailDomain,\n    IsNull,\n    Last,\n    Max,\n    Mean,\n    Min,\n    Mode,\n    Month,\n    MultiplyBoolean,\n    NMostCommon,\n    NumCharacters,\n    NumericLag,\n    NumUnique,\n    NumWords,\n    PercentTrue,\n    Skew,\n    Std,\n    Sum,\n    Weekday,\n    Year,\n    get_aggregation_primitives,\n    get_default_aggregation_primitives,\n    get_default_transform_primitives,\n    get_transform_primitives,\n)\nfrom featuretools.primitives.base import PrimitiveBase\nfrom featuretools.primitives.base.transform_primitive_base import TransformPrimitive\nfrom featuretools.primitives.utils import (\n    _check_input_types,\n    _get_descriptions,\n    _get_summary_primitives,\n    _get_unique_input_types,\n    list_primitive_files,\n    load_primitive_from_file,\n)\n\n\ndef test_list_primitives_order():\n    df = list_primitives()\n    all_primitives = get_transform_primitives()\n    all_primitives.update(get_aggregation_primitives())\n\n    for name, primitive in all_primitives.items():\n        assert name in df[\"name\"].values\n        row = df.loc[df[\"name\"] == name].iloc[0]\n        actual_desc = _get_descriptions([primitive])[0]\n        if actual_desc:\n            assert actual_desc == row[\"description\"]\n        assert row[\"valid_inputs\"] == \", \".join(\n            _get_unique_input_types(primitive.input_types),\n        )\n        expected_return_type = (\n            str(primitive.return_type) if primitive.return_type is not None else None\n        )\n        assert row[\"return_type\"] == expected_return_type\n\n    types = df[\"type\"].values\n    assert \"aggregation\" in types\n    assert \"transform\" in types\n\n\ndef test_valid_input_types():\n    actual = _get_unique_input_types(Haversine.input_types)\n    assert actual == {\"<ColumnSchema (Logical Type = LatLong)>\"}\n    actual = _get_unique_input_types(MultiplyBoolean.input_types)\n    assert actual == {\n        \"<ColumnSchema (Logical Type = Boolean)>\",\n        \"<ColumnSchema (Logical Type = BooleanNullable)>\",\n    }\n    actual = _get_unique_input_types(Sum.input_types)\n    assert actual == {\"<ColumnSchema (Semantic Tags = ['numeric'])>\"}\n\n\ndef test_descriptions():\n    primitives = {\n        NumCharacters: \"Calculates the number of characters in a given string, including whitespace and punctuation.\",\n        Day: \"Determines the day of the month from a datetime.\",\n        Last: \"Determines the last value in a list.\",\n        GreaterThan: \"Determines if values in one list are greater than another list.\",\n    }\n    assert _get_descriptions(list(primitives.keys())) == list(primitives.values())\n\n\ndef test_get_descriptions_doesnt_truncate_primitive_description():\n    # single line\n    descr = _get_descriptions([IsNull])\n    assert descr[0] == \"Determines if a value is null.\"\n\n    # multiple line; one sentence\n    descr = _get_descriptions([Diff])\n    assert (\n        descr[0]\n        == \"Computes the difference between the value in a list and the previous value in that list.\"\n    )\n\n    # multiple lines; multiple sentences\n    class TestPrimitive(TransformPrimitive):\n        \"\"\"This is text that continues on after the line break\n            and ends in a period.\n            This is text on one line without a period\n\n        Examples:\n            >>> absolute = Absolute()\n            >>> absolute([3.0, -5.0, -2.4]).tolist()\n            [3.0, 5.0, 2.4]\n        \"\"\"\n\n        name = \"test_primitive\"\n\n    descr = _get_descriptions([TestPrimitive])\n    assert (\n        descr[0]\n        == \"This is text that continues on after the line break and ends in a period. This is text on one line without a period\"\n    )\n\n    # docstring ends after description\n    class TestPrimitive2(TransformPrimitive):\n        \"\"\"This is text that continues on after the line break\n        and ends in a period.\n        This is text on one line without a period\n        \"\"\"\n\n        name = \"test_primitive\"\n\n    descr = _get_descriptions([TestPrimitive2])\n    assert (\n        descr[0]\n        == \"This is text that continues on after the line break and ends in a period. This is text on one line without a period\"\n    )\n\n\ndef test_get_default_aggregation_primitives():\n    primitives = get_default_aggregation_primitives()\n    expected_primitives = [\n        Sum,\n        Std,\n        Max,\n        Skew,\n        Min,\n        Mean,\n        Count,\n        PercentTrue,\n        NumUnique,\n        Mode,\n    ]\n    assert set(primitives) == set(expected_primitives)\n\n\ndef test_get_default_transform_primitives():\n    primitives = get_default_transform_primitives()\n    expected_primitives = [\n        Age,\n        Day,\n        Year,\n        Month,\n        Weekday,\n        Haversine,\n        NumWords,\n        NumCharacters,\n    ]\n    assert set(primitives) == set(expected_primitives)\n\n\n@pytest.fixture\ndef this_dir():\n    return os.path.dirname(os.path.abspath(__file__))\n\n\n@pytest.fixture\ndef primitives_to_install_dir(this_dir):\n    return os.path.join(this_dir, \"primitives_to_install\")\n\n\n@pytest.fixture\ndef bad_primitives_files_dir(this_dir):\n    return os.path.join(this_dir, \"bad_primitive_files\")\n\n\ndef test_list_primitive_files(primitives_to_install_dir):\n    files = list_primitive_files(primitives_to_install_dir)\n    custom_max_file = os.path.join(primitives_to_install_dir, \"custom_max.py\")\n    custom_mean_file = os.path.join(primitives_to_install_dir, \"custom_mean.py\")\n    custom_sum_file = os.path.join(primitives_to_install_dir, \"custom_sum.py\")\n    assert {custom_max_file, custom_mean_file, custom_sum_file}.issubset(set(files))\n\n\ndef test_load_primitive_from_file(primitives_to_install_dir):\n    primitve_file = os.path.join(primitives_to_install_dir, \"custom_max.py\")\n    primitive_name, primitive_obj = load_primitive_from_file(primitve_file)\n    assert issubclass(primitive_obj, PrimitiveBase)\n\n\ndef test_errors_more_than_one_primitive_in_file(bad_primitives_files_dir):\n    primitive_file = os.path.join(bad_primitives_files_dir, \"multiple_primitives.py\")\n    error_text = \"More than one primitive defined in file {}\".format(primitive_file)\n    with pytest.raises(RuntimeError) as excinfo:\n        load_primitive_from_file(primitive_file)\n    assert str(excinfo.value) == error_text\n\n\ndef test_errors_no_primitive_in_file(bad_primitives_files_dir):\n    primitive_file = os.path.join(bad_primitives_files_dir, \"no_primitives.py\")\n    error_text = \"No primitive defined in file {}\".format(primitive_file)\n    with pytest.raises(RuntimeError) as excinfo:\n        load_primitive_from_file(primitive_file)\n    assert str(excinfo.value) == error_text\n\n\ndef test_check_input_types():\n    primitives = [Sum, Weekday, PercentTrue, Day, Std, NumericLag]\n    log_in_type_checks = set()\n    sem_tag_type_checks = set()\n    unique_input_types = set()\n    expected_log_in_check = {\n        \"boolean_nullable\",\n        \"boolean\",\n        \"datetime\",\n    }\n    expected_sem_tag_type_check = {\"numeric\", \"time_index\"}\n    expected_unique_input_types = {\n        \"<ColumnSchema (Logical Type = BooleanNullable)>\",\n        \"<ColumnSchema (Semantic Tags = ['numeric'])>\",\n        \"<ColumnSchema (Logical Type = Boolean)>\",\n        \"<ColumnSchema (Logical Type = Datetime)>\",\n        \"<ColumnSchema (Semantic Tags = ['time_index'])>\",\n    }\n    for prim in primitives:\n        input_types_flattened = prim.flatten_nested_input_types(prim.input_types)\n        _check_input_types(\n            input_types_flattened,\n            log_in_type_checks,\n            sem_tag_type_checks,\n            unique_input_types,\n        )\n\n    assert log_in_type_checks == expected_log_in_check\n    assert sem_tag_type_checks == expected_sem_tag_type_check\n    assert unique_input_types == expected_unique_input_types\n\n\ndef test_get_summary_primitives():\n    primitives = [\n        Sum,\n        Weekday,\n        PercentTrue,\n        Day,\n        Std,\n        NumericLag,\n        AddNumericScalar,\n        IsFreeEmailDomain,\n        NMostCommon,\n    ]\n    primitives_summary = _get_summary_primitives(primitives)\n    expected_unique_input_types = 7\n    expected_unique_output_types = 6\n    expected_uses_multi_input = 2\n    expected_uses_multi_output = 1\n    expected_uses_external_data = 1\n    expected_controllable = 3\n    expected_datetime_inputs = 2\n    expected_bool = 1\n    expected_bool_nullable = 1\n    expected_time_index_tag = 1\n\n    assert (\n        primitives_summary[\"general_metrics\"][\"unique_input_types\"]\n        == expected_unique_input_types\n    )\n    assert (\n        primitives_summary[\"general_metrics\"][\"unique_output_types\"]\n        == expected_unique_output_types\n    )\n    assert (\n        primitives_summary[\"general_metrics\"][\"uses_multi_input\"]\n        == expected_uses_multi_input\n    )\n    assert (\n        primitives_summary[\"general_metrics\"][\"uses_multi_output\"]\n        == expected_uses_multi_output\n    )\n    assert (\n        primitives_summary[\"general_metrics\"][\"uses_external_data\"]\n        == expected_uses_external_data\n    )\n    assert (\n        primitives_summary[\"general_metrics\"][\"are_controllable\"]\n        == expected_controllable\n    )\n    assert (\n        primitives_summary[\"semantic_tag_metrics\"][\"time_index\"]\n        == expected_time_index_tag\n    )\n    assert (\n        primitives_summary[\"logical_type_input_metrics\"][\"datetime\"]\n        == expected_datetime_inputs\n    )\n    assert primitives_summary[\"logical_type_input_metrics\"][\"boolean\"] == expected_bool\n    assert (\n        primitives_summary[\"logical_type_input_metrics\"][\"boolean_nullable\"]\n        == expected_bool_nullable\n    )\n\n\ndef test_summarize_primitives():\n    df = summarize_primitives()\n    trans_prims = get_transform_primitives()\n    agg_prims = get_aggregation_primitives()\n    tot_trans = len(trans_prims)\n    tot_agg = len(agg_prims)\n    tot_prims = tot_trans + tot_agg\n\n    assert df[\"Count\"].iloc[0] == tot_prims\n    assert df[\"Count\"].iloc[1] == tot_agg\n    assert df[\"Count\"].iloc[2] == tot_trans\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/test_rolling_primitive_utils.py",
    "content": "from unittest.mock import patch\n\nimport numpy as np\nimport pandas as pd\nimport pytest\n\nfrom featuretools.primitives import (\n    RollingCount,\n    RollingMax,\n    RollingMean,\n    RollingMin,\n    RollingSTD,\n    RollingTrend,\n)\nfrom featuretools.primitives.standard.transform.time_series.utils import (\n    _get_rolled_series_without_gap,\n    apply_roll_with_offset_gap,\n    roll_series_with_gap,\n)\nfrom featuretools.tests.primitive_tests.utils import get_number_from_offset\n\n\ndef test_get_rolled_series_without_gap(window_series):\n    # Data is daily, so number of rows should be number of days not included in the gap\n    assert len(_get_rolled_series_without_gap(window_series, \"11D\")) == 9\n    assert len(_get_rolled_series_without_gap(window_series, \"0D\")) == 20\n    assert len(_get_rolled_series_without_gap(window_series, \"48H\")) == 18\n    assert len(_get_rolled_series_without_gap(window_series, \"4H\")) == 19\n\n\ndef test_get_rolled_series_without_gap_not_uniform(window_series):\n    non_uniform_series = window_series.iloc[[0, 2, 5, 6, 8, 9]]\n\n    assert len(_get_rolled_series_without_gap(non_uniform_series, \"10D\")) == 0\n    assert len(_get_rolled_series_without_gap(non_uniform_series, \"0D\")) == 6\n    assert len(_get_rolled_series_without_gap(non_uniform_series, \"48H\")) == 4\n    assert len(_get_rolled_series_without_gap(non_uniform_series, \"4H\")) == 5\n    assert len(_get_rolled_series_without_gap(non_uniform_series, \"4D\")) == 3\n    assert len(_get_rolled_series_without_gap(non_uniform_series, \"4D2H\")) == 2\n\n\ndef test_get_rolled_series_without_gap_empty_series(window_series):\n    empty_series = pd.Series([], dtype=\"object\")\n    assert len(_get_rolled_series_without_gap(empty_series, \"1D\")) == 0\n    assert len(_get_rolled_series_without_gap(empty_series, \"0D\")) == 0\n\n\ndef test_get_rolled_series_without_gap_large_bound(window_series):\n    assert len(_get_rolled_series_without_gap(window_series, \"100D\")) == 0\n    assert (\n        len(\n            _get_rolled_series_without_gap(\n                window_series.iloc[[0, 2, 5, 6, 8, 9]],\n                \"20D\",\n            ),\n        )\n        == 0\n    )\n\n\n@pytest.mark.parametrize(\n    \"window_length, gap\",\n    [\n        (3, 2),\n        (3, 4),  # gap larger than window\n        (2, 0),  # gap explicitly set to 0\n        (\"3d\", \"2d\"),  # using offset aliases\n        (\"3d\", \"4d\"),\n        (\"4d\", \"0d\"),\n    ],\n)\ndef test_roll_series_with_gap(window_length, gap, window_series):\n    rolling_max = roll_series_with_gap(\n        window_series,\n        window_length,\n        gap=gap,\n        min_periods=1,\n    ).max()\n    rolling_min = roll_series_with_gap(\n        window_series,\n        window_length,\n        gap=gap,\n        min_periods=1,\n    ).min()\n\n    assert len(rolling_max) == len(window_series)\n    assert len(rolling_min) == len(window_series)\n\n    gap_num = get_number_from_offset(gap)\n    window_length_num = get_number_from_offset(window_length)\n    for i in range(len(window_series)):\n        start_idx = i - gap_num - window_length_num + 1\n\n        if isinstance(gap, str):\n            # No gap functionality is happening, so gap isn't taken account in the end index\n            # it's like the gap is 0; it includes the row itself\n            end_idx = i\n        else:\n            end_idx = i - gap_num\n\n        # If start and end are negative, they're entirely before\n        if start_idx < 0 and end_idx < 0:\n            assert pd.isnull(rolling_max.iloc[i])\n            assert pd.isnull(rolling_min.iloc[i])\n            continue\n\n        if start_idx < 0:\n            start_idx = 0\n\n        # Because the row values are a range from 0 to 20, the rolling min will be the start index\n        # and the rolling max will be the end idx\n        assert rolling_min.iloc[i] == start_idx\n        assert rolling_max.iloc[i] == end_idx\n\n\n@pytest.mark.parametrize(\"window_length\", [3, \"3d\"])\ndef test_roll_series_with_no_gap(window_length, window_series):\n    actual_rolling = roll_series_with_gap(\n        window_series,\n        window_length,\n        gap=0,\n        min_periods=1,\n    ).mean()\n    expected_rolling = window_series.rolling(window_length, min_periods=1).mean()\n\n    pd.testing.assert_series_equal(actual_rolling, expected_rolling)\n\n\n@pytest.mark.parametrize(\n    \"window_length, gap\",\n    [\n        (6, 2),\n        (6, 0),  # No gap - changes early values\n        (\"6d\", \"0d\"),  # Uses offset aliases\n        (\"6d\", \"2d\"),\n    ],\n)\ndef test_roll_series_with_gap_early_values(window_length, gap, window_series):\n    gap_num = get_number_from_offset(gap)\n    window_length_num = get_number_from_offset(window_length)\n\n    # Default min periods is 1 - will include all\n    default_partial_values = roll_series_with_gap(\n        window_series,\n        window_length,\n        gap=gap,\n        min_periods=1,\n    ).count()\n    num_empty_aggregates = len(default_partial_values.loc[default_partial_values == 0])\n    num_partial_aggregates = len(\n        (default_partial_values.loc[default_partial_values != 0]).loc[\n            default_partial_values < window_length_num\n        ],\n    )\n\n    assert num_partial_aggregates == window_length_num - 1\n    if isinstance(gap, str):\n        # gap isn't handled, so we'll always at least include the row itself\n        assert num_empty_aggregates == 0\n    else:\n        assert num_empty_aggregates == gap_num\n\n    # Make min periods the size of the window\n    no_partial_values = roll_series_with_gap(\n        window_series,\n        window_length,\n        gap=gap,\n        min_periods=window_length_num,\n    ).count()\n    num_null_aggregates = len(no_partial_values.loc[pd.isna(no_partial_values)])\n    num_partial_aggregates = len(\n        no_partial_values.loc[no_partial_values < window_length_num],\n    )\n\n    # because we shift, gap is included as nan values in the series.\n    # Count treats nans in a window as values that don't get counted,\n    # so the gap rows get included in the count for whether a window has \"min periods\".\n    # This is different than max, for example, which does not count nans in a window as values towards \"min periods\"\n    assert num_null_aggregates == window_length_num - 1\n    if isinstance(gap, str):\n        # gap isn't handled, so we'll never have any partial aggregates\n        assert num_partial_aggregates == 0\n    else:\n        assert num_partial_aggregates == gap_num\n\n\ndef test_roll_series_with_gap_nullable_types(window_series):\n    window_length = 3\n    gap = 2\n    min_periods = 1\n    # Because we're inserting nans, confirm that nullability of the dtype doesn't have an impact on the results\n    nullable_series = window_series.astype(\"Int64\")\n    non_nullable_series = window_series.astype(\"int64\")\n\n    nullable_rolling_max = roll_series_with_gap(\n        nullable_series,\n        window_length,\n        gap=gap,\n        min_periods=min_periods,\n    ).max()\n    non_nullable_rolling_max = roll_series_with_gap(\n        non_nullable_series,\n        window_length,\n        gap=gap,\n        min_periods=min_periods,\n    ).max()\n\n    pd.testing.assert_series_equal(nullable_rolling_max, non_nullable_rolling_max)\n\n\ndef test_roll_series_with_gap_nullable_types_with_nans(window_series):\n    window_length = 3\n    gap = 2\n    min_periods = 1\n    nullable_floats = window_series.astype(\"float64\").replace(\n        {1: np.nan, 3: np.nan},\n    )\n    nullable_ints = nullable_floats.astype(\"Int64\")\n\n    nullable_ints_rolling_max = roll_series_with_gap(\n        nullable_ints,\n        window_length,\n        gap=gap,\n        min_periods=min_periods,\n    ).max()\n    nullable_floats_rolling_max = roll_series_with_gap(\n        nullable_floats,\n        window_length,\n        gap=gap,\n        min_periods=min_periods,\n    ).max()\n\n    pd.testing.assert_series_equal(\n        nullable_ints_rolling_max,\n        nullable_floats_rolling_max,\n    )\n\n    expected_early_values = [np.nan, np.nan, 0, 0, 2, 2, 4] + list(\n        range(7 - gap, len(window_series) - gap),\n    )\n    for i in range(len(window_series)):\n        actual = nullable_floats_rolling_max.iloc[i]\n        expected = expected_early_values[i]\n\n        if pd.isnull(actual):\n            assert pd.isnull(expected)\n        else:\n            assert actual == expected\n\n\n@pytest.mark.parametrize(\n    \"window_length, gap\",\n    [\n        (\"3d\", \"2d\"),\n        (\"3d\", \"4d\"),\n        (\"4d\", \"0d\"),\n    ],\n)\ndef test_apply_roll_with_offset_gap(window_length, gap, window_series):\n    def max_wrapper(sub_s):\n        return apply_roll_with_offset_gap(sub_s, gap, max, min_periods=1)\n\n    rolling_max_obj = roll_series_with_gap(\n        window_series,\n        window_length,\n        gap=gap,\n        min_periods=1,\n    )\n    rolling_max_series = rolling_max_obj.apply(max_wrapper)\n\n    def min_wrapper(sub_s):\n        return apply_roll_with_offset_gap(sub_s, gap, min, min_periods=1)\n\n    rolling_min_obj = roll_series_with_gap(\n        window_series,\n        window_length,\n        gap=gap,\n        min_periods=1,\n    )\n    rolling_min_series = rolling_min_obj.apply(min_wrapper)\n\n    assert len(rolling_max_series) == len(window_series)\n    assert len(rolling_min_series) == len(window_series)\n\n    gap_num = get_number_from_offset(gap)\n    window_length_num = get_number_from_offset(window_length)\n    for i in range(len(window_series)):\n        start_idx = i - gap_num - window_length_num + 1\n        # Now that we have the _apply call, this acts as expected\n        end_idx = i - gap_num\n\n        # If start and end are negative, they're entirely before\n        if start_idx < 0 and end_idx < 0:\n            assert pd.isnull(rolling_max_series.iloc[i])\n            assert pd.isnull(rolling_min_series.iloc[i])\n            continue\n\n        if start_idx < 0:\n            start_idx = 0\n\n        # Because the row values are a range from 0 to 20, the rolling min will be the start index\n        # and the rolling max will be the end idx\n        assert rolling_min_series.iloc[i] == start_idx\n        assert rolling_max_series.iloc[i] == end_idx\n\n\n@pytest.mark.parametrize(\n    \"min_periods\",\n    [1, 0, None],\n)\ndef test_apply_roll_with_offset_gap_default_min_periods(min_periods, window_series):\n    window_length = \"5d\"\n    window_length_num = 5\n    gap = \"3d\"\n    gap_num = 3\n\n    def count_wrapper(sub_s):\n        return apply_roll_with_offset_gap(sub_s, gap, len, min_periods=min_periods)\n\n    rolling_count_obj = roll_series_with_gap(\n        window_series,\n        window_length,\n        gap=gap,\n        min_periods=min_periods,\n    )\n    rolling_count_series = rolling_count_obj.apply(count_wrapper)\n\n    # gap essentially creates a rolling series that has no elements; which should be nan\n    # to differentiate from when a window only has null values\n    num_empty_aggregates = rolling_count_series.isna().sum()\n    num_partial_aggregates = len(\n        (rolling_count_series.loc[rolling_count_series != 0]).loc[\n            rolling_count_series < window_length_num\n        ],\n    )\n\n    assert num_empty_aggregates == gap_num\n    assert num_partial_aggregates == window_length_num - 1\n\n\n@pytest.mark.parametrize(\n    \"min_periods\",\n    [2, 3, 4, 5],\n)\ndef test_apply_roll_with_offset_gap_min_periods(min_periods, window_series):\n    window_length = \"5d\"\n    window_length_num = 5\n    gap = \"3d\"\n    gap_num = 3\n\n    def count_wrapper(sub_s):\n        return apply_roll_with_offset_gap(sub_s, gap, len, min_periods=min_periods)\n\n    rolling_count_obj = roll_series_with_gap(\n        window_series,\n        window_length,\n        gap=gap,\n        min_periods=min_periods,\n    )\n    rolling_count_series = rolling_count_obj.apply(count_wrapper)\n\n    # gap essentially creates rolling series that have no elements; which should be nan\n    # to differentiate from when a window only has null values\n    num_empty_aggregates = rolling_count_series.isna().sum()\n    num_partial_aggregates = len(\n        (rolling_count_series.loc[rolling_count_series != 0]).loc[\n            rolling_count_series < window_length_num\n        ],\n    )\n\n    assert num_empty_aggregates == min_periods - 1 + gap_num\n    assert num_partial_aggregates == window_length_num - min_periods\n\n\ndef test_apply_roll_with_offset_gap_non_uniform():\n    window_length = \"3d\"\n    gap = \"3d\"\n    min_periods = 1\n    # When the data isn't uniform, this impacts the number of values in each rolling window\n    datetimes = (\n        list(pd.date_range(start=\"2017-01-01\", freq=\"1d\", periods=7))\n        + list(pd.date_range(start=\"2017-02-01\", freq=\"2d\", periods=7))\n        + list(pd.date_range(start=\"2017-03-01\", freq=\"1d\", periods=7))\n    )\n    no_freq_series = pd.Series(range(len(datetimes)), index=datetimes)\n\n    assert pd.infer_freq(no_freq_series.index) is None\n\n    expected_series = pd.Series(\n        [None, None, None, 1, 2, 3, 3]\n        + [None, None, 1, 1, 1, 1, 1]\n        + [None, None, None, 1, 2, 3, 3],\n        index=datetimes,\n    )\n\n    def count_wrapper(sub_s):\n        return apply_roll_with_offset_gap(sub_s, gap, len, min_periods=min_periods)\n\n    rolling_count_obj = roll_series_with_gap(\n        no_freq_series,\n        window_length,\n        gap=gap,\n        min_periods=min_periods,\n    )\n    rolling_count_series = rolling_count_obj.apply(count_wrapper)\n\n    pd.testing.assert_series_equal(rolling_count_series, expected_series)\n\n\ndef test_apply_roll_with_offset_data_frequency_higher_than_parameters_frequency():\n    window_length = \"5D\"  # 120 hours\n    window_length_num = 5\n    # In order for min periods to be the length of the window, we multiply 24hours*5\n    min_periods = window_length_num * 24\n\n    datetimes = list(pd.date_range(start=\"2017-01-01\", freq=\"1H\", periods=200))\n    high_frequency_series = pd.Series(range(200), index=datetimes)\n\n    # Check without gap\n    gap = \"0d\"\n    gap_num = 0\n\n    def max_wrapper(sub_s):\n        return apply_roll_with_offset_gap(sub_s, gap, max, min_periods=min_periods)\n\n    rolling_max_obj = roll_series_with_gap(\n        high_frequency_series,\n        window_length,\n        min_periods=min_periods,\n        gap=gap,\n    )\n    rolling_max_series = rolling_max_obj.apply(max_wrapper)\n\n    assert rolling_max_series.isna().sum() == (min_periods - 1) + gap_num\n\n    # Check with small gap\n    gap = \"3H\"\n    gap_num = 3\n\n    def max_wrapper(sub_s):\n        return apply_roll_with_offset_gap(sub_s, gap, max, min_periods=min_periods)\n\n    rolling_max_obj = roll_series_with_gap(\n        high_frequency_series,\n        window_length,\n        min_periods=min_periods,\n        gap=gap,\n    )\n    rolling_max_series = rolling_max_obj.apply(max_wrapper)\n\n    assert rolling_max_series.isna().sum() == (min_periods - 1) + gap_num\n\n    # Check with large gap - in terms of days, so we'll multiply by 24hours for number of nans\n    gap = \"2D\"\n    gap_num = 2\n\n    def max_wrapper(sub_s):\n        return apply_roll_with_offset_gap(sub_s, gap, max, min_periods=min_periods)\n\n    rolling_max_obj = roll_series_with_gap(\n        high_frequency_series,\n        window_length,\n        min_periods=min_periods,\n        gap=gap,\n    )\n    rolling_max_series = rolling_max_obj.apply(max_wrapper)\n\n    assert rolling_max_series.isna().sum() == (min_periods - 1) + (gap_num * 24)\n\n\ndef test_apply_roll_with_offset_data_min_periods_too_big(window_series):\n    window_length = \"5D\"\n    gap = \"2d\"\n\n    # Since the data has a daily frequency, there will only be, at most, 5 rows in the window\n    min_periods = 6\n\n    def max_wrapper(sub_s):\n        return apply_roll_with_offset_gap(sub_s, gap, max, min_periods=min_periods)\n\n    rolling_max_obj = roll_series_with_gap(\n        window_series,\n        window_length,\n        min_periods=min_periods,\n        gap=gap,\n    )\n    rolling_max_series = rolling_max_obj.apply(max_wrapper)\n\n    # The resulting series is comprised entirely of nans\n    assert rolling_max_series.isna().sum() == len(window_series)\n\n\ndef test_roll_series_with_gap_different_input_types_same_result_uniform(\n    window_series,\n):\n    # Offset inputs will only produce the same results as numeric inputs\n    # when the data has a uniform frequency\n    offset_gap = \"2d\"\n    offset_window_length = \"5d\"\n    int_gap = 2\n    int_window_length = 5\n    min_periods = 1\n\n    # Rolling series' with matching input types\n    expected_rolling_numeric = roll_series_with_gap(\n        window_series,\n        window_length=int_window_length,\n        gap=int_gap,\n        min_periods=min_periods,\n    ).max()\n\n    def count_wrapper(sub_s):\n        return apply_roll_with_offset_gap(\n            sub_s,\n            offset_gap,\n            max,\n            min_periods=min_periods,\n        )\n\n    rolling_count_obj = roll_series_with_gap(\n        window_series,\n        window_length=offset_window_length,\n        gap=offset_gap,\n        min_periods=min_periods,\n    )\n    expected_rolling_offset = rolling_count_obj.apply(count_wrapper)\n\n    # confirm that the offset and gap results are equal to one another\n    pd.testing.assert_series_equal(expected_rolling_numeric, expected_rolling_offset)\n\n    # Rolling series' with mismatched input types\n    mismatched_numeric_gap = roll_series_with_gap(\n        window_series,\n        window_length=offset_window_length,\n        gap=int_gap,\n        min_periods=min_periods,\n    ).max()\n    # Confirm the mismatched results also produce the same results\n    pd.testing.assert_series_equal(expected_rolling_numeric, mismatched_numeric_gap)\n\n\ndef test_roll_series_with_gap_incorrect_types(window_series):\n    error = \"Window length must be either an offset string or an integer.\"\n    with pytest.raises(TypeError, match=error):\n        (\n            roll_series_with_gap(\n                window_series,\n                window_length=4.2,\n                gap=4,\n                min_periods=1,\n            ),\n        )\n\n    error = \"Gap must be either an offset string or an integer.\"\n    with pytest.raises(TypeError, match=error):\n        roll_series_with_gap(window_series, window_length=4, gap=4.2, min_periods=1)\n\n\ndef test_roll_series_with_gap_negative_inputs(window_series):\n    error = \"Window length must be greater than zero.\"\n    with pytest.raises(ValueError, match=error):\n        roll_series_with_gap(window_series, window_length=-4, gap=4, min_periods=1)\n\n    error = \"Gap must be greater than or equal to zero.\"\n    with pytest.raises(ValueError, match=error):\n        roll_series_with_gap(window_series, window_length=4, gap=-4, min_periods=1)\n\n\ndef test_roll_series_with_non_offset_string_inputs(window_series):\n    error = \"Cannot roll series. The specified gap, test, is not a valid offset alias.\"\n    with pytest.raises(ValueError, match=error):\n        roll_series_with_gap(\n            window_series,\n            window_length=\"4D\",\n            gap=\"test\",\n            min_periods=1,\n        )\n\n    error = \"Cannot roll series. The specified window length, test, is not a valid offset alias.\"\n    with pytest.raises(ValueError, match=error):\n        roll_series_with_gap(\n            window_series,\n            window_length=\"test\",\n            gap=\"7D\",\n            min_periods=1,\n        )\n\n    # Test mismatched types error\n    error = (\n        \"Cannot roll series with offset gap, 2d, and numeric window length, 7. \"\n        \"If an offset alias is used for gap, the window length must also be defined as an offset alias. \"\n        \"Please either change gap to be numeric or change window length to be an offset alias.\"\n    )\n    with pytest.raises(TypeError, match=error):\n        roll_series_with_gap(\n            window_series,\n            window_length=7,\n            gap=\"2d\",\n            min_periods=1,\n        ).max()\n\n\n@pytest.mark.parametrize(\n    \"primitive\",\n    [RollingCount, RollingMax, RollingMin, RollingMean, RollingSTD, RollingTrend],\n)\n@patch(\n    \"featuretools.primitives.standard.transform.time_series.utils.apply_roll_with_offset_gap\",\n)\ndef test_no_call_to_apply_roll_with_offset_gap_with_numeric(\n    mock_apply_roll,\n    primitive,\n    window_series,\n):\n    assert not mock_apply_roll.called\n\n    fully_numeric_primitive = primitive(window_length=3, gap=1)\n    primitive_func = fully_numeric_primitive.get_function()\n    if isinstance(fully_numeric_primitive, RollingCount):\n        pd.Series(primitive_func(window_series.index))\n    else:\n        pd.Series(\n            primitive_func(\n                window_series.index,\n                pd.Series(window_series.values),\n            ),\n        )\n\n    assert not mock_apply_roll.called\n\n    offset_window_primitive = primitive(window_length=\"3d\", gap=1)\n    primitive_func = offset_window_primitive.get_function()\n    if isinstance(offset_window_primitive, RollingCount):\n        pd.Series(primitive_func(window_series.index))\n    else:\n        pd.Series(\n            primitive_func(\n                window_series.index,\n                pd.Series(window_series.values),\n            ),\n        )\n\n    assert not mock_apply_roll.called\n\n    no_gap_specified_primitive = primitive(window_length=\"3d\")\n    primitive_func = no_gap_specified_primitive.get_function()\n    if isinstance(no_gap_specified_primitive, RollingCount):\n        pd.Series(primitive_func(window_series.index))\n    else:\n        pd.Series(\n            primitive_func(\n                window_series.index,\n                pd.Series(window_series.values),\n            ),\n        )\n\n    assert not mock_apply_roll.called\n\n    no_gap_specified_primitive = primitive(window_length=\"3d\", gap=\"1d\")\n    primitive_func = no_gap_specified_primitive.get_function()\n    if isinstance(no_gap_specified_primitive, RollingCount):\n        pd.Series(primitive_func(window_series.index))\n    else:\n        pd.Series(\n            primitive_func(\n                window_series.index,\n                pd.Series(window_series.values),\n            ),\n        )\n\n    assert mock_apply_roll.called\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/test_transform_features.py",
    "content": "from inspect import isclass\n\nimport numpy as np\nimport pandas as pd\nimport pytest\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import (\n    Boolean,\n    BooleanNullable,\n    Categorical,\n    Datetime,\n    Double,\n    Integer,\n    IntegerNullable,\n)\n\nfrom featuretools import (\n    AggregationFeature,\n    EntitySet,\n    Feature,\n    IdentityFeature,\n    TransformFeature,\n    calculate_feature_matrix,\n    dfs,\n    primitives,\n)\nfrom featuretools.computational_backends.feature_set import FeatureSet\nfrom featuretools.computational_backends.feature_set_calculator import (\n    FeatureSetCalculator,\n)\nfrom featuretools.primitives import (\n    Absolute,\n    AddNumeric,\n    AddNumericScalar,\n    Age,\n    Count,\n    Day,\n    Diff,\n    DiffDatetime,\n    DivideByFeature,\n    DivideNumeric,\n    DivideNumericScalar,\n    Equal,\n    EqualScalar,\n    FileExtension,\n    First,\n    FullNameToFirstName,\n    FullNameToLastName,\n    FullNameToTitle,\n    GreaterThan,\n    GreaterThanEqualTo,\n    GreaterThanEqualToScalar,\n    GreaterThanScalar,\n    Haversine,\n    Hour,\n    IsIn,\n    IsNull,\n    Lag,\n    Latitude,\n    LessThan,\n    LessThanEqualTo,\n    LessThanEqualToScalar,\n    LessThanScalar,\n    Longitude,\n    Mode,\n    MultiplyBoolean,\n    MultiplyNumeric,\n    MultiplyNumericBoolean,\n    MultiplyNumericScalar,\n    Not,\n    NotEqual,\n    NotEqualScalar,\n    NumCharacters,\n    NumericLag,\n    NumWords,\n    Percentile,\n    ScalarSubtractNumericFeature,\n    SubtractNumeric,\n    SubtractNumericScalar,\n    Sum,\n    TimeSince,\n    TransformPrimitive,\n    get_transform_primitives,\n)\nfrom featuretools.synthesis.deep_feature_synthesis import match\n\n\ndef test_init_and_name(es):\n    log = es[\"log\"]\n    rating = Feature(IdentityFeature(es[\"products\"].ww[\"rating\"]), \"log\")\n    log_features = [Feature(es[\"log\"].ww[col]) for col in log.columns] + [\n        Feature(rating, primitive=GreaterThanScalar(2.5)),\n        Feature(rating, primitive=GreaterThanScalar(3.5)),\n    ]\n    # Add Timedelta feature\n    # features.append(pd.Timestamp.now() - Feature(log['datetime']))\n    customers_features = [\n        Feature(es[\"customers\"].ww[col]) for col in es[\"customers\"].columns\n    ]\n\n    # check all transform primitives have a name\n    for attribute_string in dir(primitives):\n        attr = getattr(primitives, attribute_string)\n        if isclass(attr):\n            if issubclass(attr, TransformPrimitive) and attr != TransformPrimitive:\n                assert getattr(attr, \"name\") is not None\n\n    trans_primitives = get_transform_primitives().values()\n\n    for transform_prim in trans_primitives:\n        # skip automated testing if a few special cases\n        features_to_use = log_features\n        if transform_prim in [NotEqual, Equal, FileExtension]:\n            continue\n        if transform_prim in [\n            Age,\n            FullNameToFirstName,\n            FullNameToLastName,\n            FullNameToTitle,\n        ]:\n            features_to_use = customers_features\n\n        # use the input_types matching function from DFS\n        input_types = transform_prim.input_types\n        if isinstance(input_types[0], list):\n            matching_inputs = match(input_types[0], features_to_use)\n        else:\n            matching_inputs = match(input_types, features_to_use)\n        if len(matching_inputs) == 0:\n            raise Exception(\"Transform Primitive %s not tested\" % transform_prim.name)\n        for prim in matching_inputs:\n            instance = Feature(prim, primitive=transform_prim)\n\n            # try to get name and calculate\n            instance.get_name()\n            calculate_feature_matrix([instance], entityset=es)\n\n\ndef test_relationship_path(es):\n    f = TransformFeature(Feature(es[\"log\"].ww[\"datetime\"]), Hour)\n\n    assert len(f.relationship_path) == 0\n\n\ndef test_serialization(es):\n    value = IdentityFeature(es[\"log\"].ww[\"value\"])\n    primitive = MultiplyNumericScalar(value=2)\n    value_x2 = TransformFeature(value, primitive)\n\n    dictionary = {\n        \"name\": value_x2.get_name(),\n        \"base_features\": [value.unique_name()],\n        \"primitive\": primitive,\n    }\n\n    assert dictionary == value_x2.get_arguments()\n    assert value_x2 == TransformFeature.from_dictionary(\n        dictionary,\n        es,\n        {value.unique_name(): value},\n        primitive,\n    )\n\n\ndef test_make_trans_feat(es):\n    f = Feature(es[\"log\"].ww[\"datetime\"], primitive=Hour)\n\n    feature_set = FeatureSet([f])\n    calculator = FeatureSetCalculator(es, feature_set=feature_set)\n    df = calculator.run(np.array([0]))\n    v = df[f.get_name()][0]\n    assert v == 10\n\n\n@pytest.fixture\ndef simple_es():\n    df = pd.DataFrame(\n        {\n            \"id\": range(4),\n            \"value\": pd.Categorical([\"a\", \"c\", \"b\", \"d\"]),\n            \"value2\": pd.Categorical([\"a\", \"b\", \"a\", \"d\"]),\n            \"object\": [\"time1\", \"time2\", \"time3\", \"time4\"],\n            \"datetime\": pd.Series(\n                [\n                    pd.Timestamp(\"2001-01-01\"),\n                    pd.Timestamp(\"2001-01-02\"),\n                    pd.Timestamp(\"2001-01-03\"),\n                    pd.Timestamp(\"2001-01-04\"),\n                ],\n            ),\n        },\n    )\n\n    es = EntitySet(\"equal_test\")\n    es.add_dataframe(dataframe_name=\"values\", dataframe=df, index=\"id\")\n\n    return es\n\n\ndef test_equal_categorical(simple_es):\n    f1 = Feature(\n        [\n            IdentityFeature(simple_es[\"values\"].ww[\"value\"]),\n            IdentityFeature(simple_es[\"values\"].ww[\"value2\"]),\n        ],\n        primitive=Equal,\n    )\n\n    df = calculate_feature_matrix(entityset=simple_es, features=[f1])\n    assert set(simple_es[\"values\"][\"value\"].cat.categories) != set(\n        simple_es[\"values\"][\"value2\"].cat.categories,\n    )\n    assert df[\"value = value2\"].to_list() == [\n        True,\n        False,\n        False,\n        True,\n    ]\n\n\ndef test_equal_different_dtypes(simple_es):\n    f1 = Feature(\n        [\n            IdentityFeature(simple_es[\"values\"].ww[\"object\"]),\n            IdentityFeature(simple_es[\"values\"].ww[\"datetime\"]),\n        ],\n        primitive=Equal,\n    )\n    f2 = Feature(\n        [\n            IdentityFeature(simple_es[\"values\"].ww[\"datetime\"]),\n            IdentityFeature(simple_es[\"values\"].ww[\"object\"]),\n        ],\n        primitive=Equal,\n    )\n\n    # verify that equals works for different dtypes regardless of order\n    df = calculate_feature_matrix(entityset=simple_es, features=[f1, f2])\n\n    assert df[\"object = datetime\"].to_list() == [False, False, False, False]\n    assert df[\"datetime = object\"].to_list() == [False, False, False, False]\n\n\ndef test_not_equal_categorical(simple_es):\n    f1 = Feature(\n        [\n            IdentityFeature(simple_es[\"values\"].ww[\"value\"]),\n            IdentityFeature(simple_es[\"values\"].ww[\"value2\"]),\n        ],\n        primitive=NotEqual,\n    )\n\n    df = calculate_feature_matrix(entityset=simple_es, features=[f1])\n\n    assert set(simple_es[\"values\"][\"value\"].cat.categories) != set(\n        simple_es[\"values\"][\"value2\"].cat.categories,\n    )\n    assert df[\"value != value2\"].to_list() == [\n        False,\n        True,\n        True,\n        False,\n    ]\n\n\ndef test_not_equal_different_dtypes(simple_es):\n    f1 = Feature(\n        [\n            IdentityFeature(simple_es[\"values\"].ww[\"object\"]),\n            IdentityFeature(simple_es[\"values\"].ww[\"datetime\"]),\n        ],\n        primitive=NotEqual,\n    )\n    f2 = Feature(\n        [\n            IdentityFeature(simple_es[\"values\"].ww[\"datetime\"]),\n            IdentityFeature(simple_es[\"values\"].ww[\"object\"]),\n        ],\n        primitive=NotEqual,\n    )\n\n    # verify that equals works for different dtypes regardless of order\n    df = calculate_feature_matrix(entityset=simple_es, features=[f1, f2])\n\n    assert df[\"object != datetime\"].to_list() == [True, True, True, True]\n    assert df[\"datetime != object\"].to_list() == [True, True, True, True]\n\n\ndef test_diff(es):\n    value = Feature(es[\"log\"].ww[\"value\"])\n    customer_id_feat = Feature(es[\"sessions\"].ww[\"customer_id\"], \"log\")\n    diff1 = Feature(\n        value,\n        groupby=Feature(es[\"log\"].ww[\"session_id\"]),\n        primitive=Diff,\n    )\n    diff2 = Feature(value, groupby=customer_id_feat, primitive=Diff)\n\n    feature_set = FeatureSet([diff1, diff2])\n    calculator = FeatureSetCalculator(es, feature_set=feature_set)\n    df = calculator.run(np.array(range(15)))\n\n    val1 = df[diff1.get_name()].tolist()\n    val2 = df[diff2.get_name()].tolist()\n\n    correct_vals1 = [\n        np.nan,\n        5,\n        5,\n        5,\n        5,\n        np.nan,\n        1,\n        1,\n        1,\n        np.nan,\n        np.nan,\n        5,\n        np.nan,\n        7,\n        7,\n    ]\n    correct_vals2 = [np.nan, 5, 5, 5, 5, -20, 1, 1, 1, -3, np.nan, 5, -5, 7, 7]\n    np.testing.assert_equal(val1, correct_vals1)\n    np.testing.assert_equal(val2, correct_vals2)\n\n\ndef test_diff_shift(es):\n    value = Feature(es[\"log\"].ww[\"value\"])\n    customer_id_feat = Feature(es[\"sessions\"].ww[\"customer_id\"], \"log\")\n    diff_periods = Feature(value, groupby=customer_id_feat, primitive=Diff(periods=1))\n\n    feature_set = FeatureSet([diff_periods])\n    calculator = FeatureSetCalculator(es, feature_set=feature_set)\n    df = calculator.run(np.array(range(15)))\n    val3 = df[diff_periods.get_name()].tolist()\n\n    correct_vals3 = [np.nan, np.nan, 5, 5, 5, 5, -20, 1, 1, 1, np.nan, np.nan, 5, -5, 7]\n    np.testing.assert_equal(val3, correct_vals3)\n\n\ndef test_diff_single_value(es):\n    diff = Feature(\n        es[\"stores\"].ww[\"num_square_feet\"],\n        groupby=Feature(es[\"stores\"].ww[\"région_id\"]),\n        primitive=Diff,\n    )\n    feature_set = FeatureSet([diff])\n    calculator = FeatureSetCalculator(es, feature_set=feature_set)\n    df = calculator.run(np.array([4]))\n    assert df[diff.get_name()][4] == 6000.0\n\n\ndef test_diff_reordered(es):\n    sum_feat = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Sum,\n    )\n    diff = Feature(sum_feat, primitive=Diff)\n    feature_set = FeatureSet([diff])\n    calculator = FeatureSetCalculator(es, feature_set=feature_set)\n    df = calculator.run(np.array([4, 2]))\n    assert df[diff.get_name()][4] == 16\n    assert df[diff.get_name()][2] == -6\n\n\ndef test_diff_single_value_is_nan(es):\n    diff = Feature(\n        es[\"stores\"].ww[\"num_square_feet\"],\n        groupby=Feature(es[\"stores\"].ww[\"région_id\"]),\n        primitive=Diff,\n    )\n    feature_set = FeatureSet([diff])\n    calculator = FeatureSetCalculator(es, feature_set=feature_set)\n    df = calculator.run(np.array([5]))\n    assert df.shape[0] == 1\n    assert df[diff.get_name()].dropna().shape[0] == 0\n\n\ndef test_diff_datetime(es):\n    diff = Feature(\n        es[\"log\"].ww[\"datetime\"],\n        primitive=DiffDatetime,\n    )\n    feature_set = FeatureSet([diff])\n    calculator = FeatureSetCalculator(es, feature_set=feature_set)\n    df = calculator.run(np.array(range(15)))\n    vals = pd.Series(df[diff.get_name()].tolist())\n    expected_vals = pd.Series(\n        [\n            pd.NaT,\n            pd.Timedelta(seconds=6),\n            pd.Timedelta(seconds=6),\n            pd.Timedelta(seconds=6),\n            pd.Timedelta(seconds=6),\n            pd.Timedelta(seconds=36),\n            pd.Timedelta(seconds=9),\n            pd.Timedelta(seconds=9),\n            pd.Timedelta(seconds=9),\n            pd.Timedelta(minutes=8, seconds=33),\n            pd.Timedelta(days=1),\n            pd.Timedelta(seconds=1),\n            pd.Timedelta(seconds=59),\n            pd.Timedelta(seconds=3),\n            pd.Timedelta(seconds=3),\n        ],\n    )\n    pd.testing.assert_series_equal(vals, expected_vals)\n\n\ndef test_diff_datetime_shift(es):\n    diff = Feature(\n        es[\"log\"].ww[\"datetime\"],\n        primitive=DiffDatetime(periods=1),\n    )\n    feature_set = FeatureSet([diff])\n    calculator = FeatureSetCalculator(es, feature_set=feature_set)\n    df = calculator.run(np.array(range(6)))\n    vals = pd.Series(df[diff.get_name()].tolist())\n    expected_vals = pd.Series(\n        [\n            pd.NaT,\n            pd.NaT,\n            pd.Timedelta(seconds=6),\n            pd.Timedelta(seconds=6),\n            pd.Timedelta(seconds=6),\n            pd.Timedelta(seconds=6),\n        ],\n    )\n    pd.testing.assert_series_equal(vals, expected_vals)\n\n\ndef test_compare_of_identity(es):\n    to_test = [\n        (EqualScalar, [False, False, True, False]),\n        (NotEqualScalar, [True, True, False, True]),\n        (LessThanScalar, [True, True, False, False]),\n        (LessThanEqualToScalar, [True, True, True, False]),\n        (GreaterThanScalar, [False, False, False, True]),\n        (GreaterThanEqualToScalar, [False, False, True, True]),\n    ]\n\n    features = []\n    for test in to_test:\n        features.append(Feature(es[\"log\"].ww[\"value\"], primitive=test[0](10)))\n\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=[0, 1, 2, 3],\n    )\n\n    for i, test in enumerate(to_test):\n        v = df[features[i].get_name()].tolist()\n        assert v == test[1]\n\n\ndef test_compare_of_direct(es):\n    log_rating = Feature(es[\"products\"].ww[\"rating\"], \"log\")\n    to_test = [\n        (EqualScalar, [False, False, False, False]),\n        (NotEqualScalar, [True, True, True, True]),\n        (LessThanScalar, [False, False, False, True]),\n        (LessThanEqualToScalar, [False, False, False, True]),\n        (GreaterThanScalar, [True, True, True, False]),\n        (GreaterThanEqualToScalar, [True, True, True, False]),\n    ]\n\n    features = []\n    for test in to_test:\n        features.append(Feature(log_rating, primitive=test[0](4.5)))\n\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=[0, 1, 2, 3],\n    )\n\n    for i, test in enumerate(to_test):\n        v = df[features[i].get_name()].tolist()\n        assert v == test[1]\n\n\ndef test_compare_of_transform(es):\n    day = Feature(es[\"log\"].ww[\"datetime\"], primitive=Day)\n    to_test = [\n        (EqualScalar, [False, True]),\n        (NotEqualScalar, [True, False]),\n    ]\n\n    features = []\n    for test in to_test:\n        features.append(Feature(day, primitive=test[0](10)))\n\n    df = calculate_feature_matrix(entityset=es, features=features, instance_ids=[0, 14])\n\n    for i, test in enumerate(to_test):\n        v = df[features[i].get_name()].tolist()\n        assert v == test[1]\n\n\ndef test_compare_of_agg(es):\n    count_logs = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Count,\n    )\n\n    to_test = [\n        (EqualScalar, [False, False, False, True]),\n        (NotEqualScalar, [True, True, True, False]),\n        (LessThanScalar, [False, False, True, False]),\n        (LessThanEqualToScalar, [False, False, True, True]),\n        (GreaterThanScalar, [True, True, False, False]),\n        (GreaterThanEqualToScalar, [True, True, False, True]),\n    ]\n\n    features = []\n    for test in to_test:\n        features.append(Feature(count_logs, primitive=test[0](2)))\n\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=[0, 1, 2, 3],\n    )\n\n    for i, test in enumerate(to_test):\n        v = df[features[i].get_name()].tolist()\n        assert v == test[1]\n\n\ndef test_compare_all_nans(es):\n    nan_feat = Feature(\n        es[\"log\"].ww[\"product_id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Mode,\n    )\n    compare = nan_feat == \"brown bag\"\n\n    # before all data\n    time_last = pd.Timestamp(\"1/1/1993\")\n\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=[nan_feat, compare],\n        instance_ids=[0, 1, 2],\n        cutoff_time=time_last,\n    )\n\n    assert df[nan_feat.get_name()].dropna().shape[0] == 0\n    assert not df[compare.get_name()].any()\n\n\ndef test_arithmetic_of_val(es):\n    to_test = [\n        (AddNumericScalar, [2.0, 7.0, 12.0, 17.0]),\n        (SubtractNumericScalar, [-2.0, 3.0, 8.0, 13.0]),\n        (ScalarSubtractNumericFeature, [2.0, -3.0, -8.0, -13.0]),\n        (MultiplyNumericScalar, [0, 10, 20, 30]),\n        (DivideNumericScalar, [0, 2.5, 5, 7.5]),\n        (DivideByFeature, [np.inf, 0.4, 0.2, 2 / 15.0]),\n    ]\n\n    features = []\n    for test in to_test:\n        features.append(Feature(es[\"log\"].ww[\"value\"], primitive=test[0](2)))\n\n    features.append(Feature(es[\"log\"].ww[\"value\"]) / 0)\n\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=[0, 1, 2, 3],\n    )\n\n    for f, test in zip(features, to_test):\n        v = df[f.get_name()].tolist()\n        assert v == test[1]\n\n    test = [np.nan, np.inf, np.inf, np.inf]\n    v = df[features[-1].get_name()].tolist()\n    assert np.isnan(v[0])\n    assert v[1:] == test[1:]\n\n\ndef test_arithmetic_two_vals_fails(es):\n    error_text = \"Not a feature\"\n    with pytest.raises(Exception, match=error_text):\n        Feature([2, 2], primitive=AddNumeric)\n\n\ndef test_arithmetic_of_identity(es):\n    to_test = [\n        (AddNumeric, [0.0, 7.0, 14.0, 21.0]),\n        (SubtractNumeric, [0, 3, 6, 9]),\n        (MultiplyNumeric, [0, 10, 40, 90]),\n        (DivideNumeric, [np.nan, 2.5, 2.5, 2.5]),\n    ]\n\n    features = []\n    for test in to_test:\n        features.append(\n            Feature(\n                [\n                    Feature(es[\"log\"].ww[\"value\"]),\n                    Feature(es[\"log\"].ww[\"value_2\"]),\n                ],\n                primitive=test[0],\n            ),\n        )\n\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=[0, 1, 2, 3],\n    )\n\n    for i, test in enumerate(to_test[:-1]):\n        v = df[features[i].get_name()].tolist()\n        assert v == test[1]\n    i, test = -1, to_test[-1]\n    v = df[features[i].get_name()].tolist()\n    assert np.isnan(v[0])\n    assert v[1:] == test[1][1:]\n\n\ndef test_arithmetic_of_direct(es):\n    rating = Feature(es[\"products\"].ww[\"rating\"])\n    log_rating = Feature(rating, \"log\")\n    customer_age = Feature(es[\"customers\"].ww[\"age\"])\n    session_age = Feature(customer_age, \"sessions\")\n    log_age = Feature(session_age, \"log\")\n\n    to_test = [\n        (AddNumeric, [38, 37, 37.5, 37.5]),\n        (SubtractNumeric, [28, 29, 28.5, 28.5]),\n        (MultiplyNumeric, [165, 132, 148.5, 148.5]),\n        (DivideNumeric, [6.6, 8.25, 22.0 / 3, 22.0 / 3]),\n    ]\n\n    features = []\n    for test in to_test:\n        features.append(Feature([log_age, log_rating], primitive=test[0]))\n\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=[0, 3, 5, 7],\n    )\n\n    for i, test in enumerate(to_test):\n        v = df[features[i].get_name()].tolist()\n        assert v == test[1]\n\n\n@pytest.fixture\ndef boolean_mult_es():\n    es = EntitySet()\n    df = pd.DataFrame(\n        {\n            \"index\": [0, 1, 2],\n            \"bool\": pd.Series([True, False, True]),\n            \"numeric\": [2, 3, np.nan],\n        },\n    )\n\n    es.add_dataframe(\n        dataframe_name=\"test\",\n        dataframe=df,\n        index=\"index\",\n        logical_types={\"numeric\": Double},\n    )\n\n    return es\n\n\ndef test_boolean_multiply(boolean_mult_es):\n    es = boolean_mult_es\n    to_test = [\n        (\"numeric\", \"numeric\"),\n        (\"numeric\", \"bool\"),\n        (\"bool\", \"numeric\"),\n        (\"bool\", \"bool\"),\n    ]\n    features = []\n    for row in to_test:\n        features.append(Feature(es[\"test\"].ww[row[0]]) * Feature(es[\"test\"].ww[row[1]]))\n\n    fm = calculate_feature_matrix(entityset=es, features=features)\n\n    df = es[\"test\"]\n\n    for row in to_test:\n        col_name = \"{} * {}\".format(row[0], row[1])\n        if row[0] == \"bool\" and row[1] == \"bool\":\n            assert fm[col_name].equals((df[row[0]] & df[row[1]]).astype(\"boolean\"))\n        else:\n            assert fm[col_name].equals(df[row[0]] * df[row[1]])\n\n\ndef test_arithmetic_of_transform(es):\n    diff1 = Feature([Feature(es[\"log\"].ww[\"value\"])], primitive=Diff)\n    diff2 = Feature([Feature(es[\"log\"].ww[\"value_2\"])], primitive=Diff)\n\n    to_test = [\n        (AddNumeric, [np.nan, 7.0, -7.0, 10.0]),\n        (SubtractNumeric, [np.nan, 3.0, -3.0, 4.0]),\n        (MultiplyNumeric, [np.nan, 10.0, 10.0, 21.0]),\n        (DivideNumeric, [np.nan, 2.5, 2.5, 2.3333333333333335]),\n    ]\n\n    features = []\n    for test in to_test:\n        features.append(Feature([diff1, diff2], primitive=test[0]()))\n\n    feature_set = FeatureSet(features)\n    calculator = FeatureSetCalculator(es, feature_set=feature_set)\n    df = calculator.run(np.array([0, 2, 12, 13]))\n    for i, test in enumerate(to_test):\n        v = df[features[i].get_name()].tolist()\n        assert np.isnan(v.pop(0))\n        assert np.isnan(test[1].pop(0))\n        assert v == test[1]\n\n\ndef test_not_feature(es):\n    not_feat = Feature(es[\"customers\"].ww[\"loves_ice_cream\"], primitive=Not)\n    features = [not_feat]\n    df = calculate_feature_matrix(entityset=es, features=features, instance_ids=[0, 1])\n    v = df[not_feat.get_name()].values\n    assert not v[0]\n    assert v[1]\n\n\ndef test_arithmetic_of_agg(es):\n    customer_id_feat = Feature(es[\"customers\"].ww[\"id\"])\n    store_id_feat = Feature(es[\"stores\"].ww[\"id\"])\n    count_customer = Feature(\n        customer_id_feat,\n        parent_dataframe_name=\"régions\",\n        primitive=Count,\n    )\n    count_stores = Feature(\n        store_id_feat,\n        parent_dataframe_name=\"régions\",\n        primitive=Count,\n    )\n    to_test = [\n        (AddNumeric, [6, 2]),\n        (SubtractNumeric, [0, -2]),\n        (MultiplyNumeric, [9, 0]),\n        (DivideNumeric, [1, 0]),\n    ]\n\n    features = []\n    for test in to_test:\n        features.append(Feature([count_customer, count_stores], primitive=test[0]()))\n\n    ids = [\"United States\", \"Mexico\"]\n    df = calculate_feature_matrix(entityset=es, features=features, instance_ids=ids)\n    df = df.loc[ids]\n\n    for i, test in enumerate(to_test):\n        v = df[features[i].get_name()].tolist()\n        assert v == test[1]\n\n\ndef test_latlong(es):\n    log_latlong_feat = Feature(es[\"log\"].ww[\"latlong\"])\n    latitude = Feature(log_latlong_feat, primitive=Latitude)\n    longitude = Feature(log_latlong_feat, primitive=Longitude)\n    features = [latitude, longitude]\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=range(15),\n    )\n    latvalues = df[latitude.get_name()].values\n    lonvalues = df[longitude.get_name()].values\n    assert len(latvalues) == 15\n    assert len(lonvalues) == 15\n    real_lats = [0, 5, 10, 15, 20, 0, 1, 2, 3, 0, 0, 5, 0, 7, 14]\n    real_lons = [0, 2, 4, 6, 8, 0, 1, 2, 3, 0, 0, 2, 0, 3, 6]\n    for (\n        i,\n        v,\n    ) in enumerate(real_lats):\n        assert v == latvalues[i]\n    for (\n        i,\n        v,\n    ) in enumerate(real_lons):\n        assert v == lonvalues[i]\n\n\ndef test_latlong_with_nan(es):\n    df = es[\"log\"]\n    df[\"latlong\"][0] = np.nan\n    df[\"latlong\"][1] = (10, np.nan)\n    df[\"latlong\"][2] = (np.nan, 4)\n    df[\"latlong\"][3] = (np.nan, np.nan)\n    es.replace_dataframe(dataframe_name=\"log\", df=df)\n    log_latlong_feat = Feature(es[\"log\"].ww[\"latlong\"])\n    latitude = Feature(log_latlong_feat, primitive=Latitude)\n    longitude = Feature(log_latlong_feat, primitive=Longitude)\n    features = [latitude, longitude]\n    fm = calculate_feature_matrix(entityset=es, features=features)\n    latvalues = fm[latitude.get_name()].values\n    lonvalues = fm[longitude.get_name()].values\n    assert len(latvalues) == 17\n    assert len(lonvalues) == 17\n    real_lats = [\n        np.nan,\n        10,\n        np.nan,\n        np.nan,\n        20,\n        0,\n        1,\n        2,\n        3,\n        0,\n        0,\n        5,\n        0,\n        7,\n        14,\n        np.nan,\n        np.nan,\n    ]\n    real_lons = [\n        np.nan,\n        np.nan,\n        4,\n        np.nan,\n        8,\n        0,\n        1,\n        2,\n        3,\n        0,\n        0,\n        2,\n        0,\n        3,\n        6,\n        np.nan,\n        np.nan,\n    ]\n    assert np.allclose(latvalues, real_lats, atol=0.0001, equal_nan=True)\n    assert np.allclose(lonvalues, real_lons, atol=0.0001, equal_nan=True)\n\n\ndef test_haversine(es):\n    log_latlong_feat = Feature(es[\"log\"].ww[\"latlong\"])\n    log_latlong_feat2 = Feature(es[\"log\"].ww[\"latlong2\"])\n    haversine = Feature([log_latlong_feat, log_latlong_feat2], primitive=Haversine)\n    features = [haversine]\n\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=range(15),\n    )\n    values = df[haversine.get_name()].values\n    real = [\n        0,\n        525.318462,\n        1045.32190304,\n        1554.56176802,\n        2047.3294327,\n        0,\n        138.16578931,\n        276.20524822,\n        413.99185444,\n        0,\n        0,\n        525.318462,\n        0,\n        741.57941183,\n        1467.52760175,\n    ]\n    assert len(values) == 15\n    assert np.allclose(values, real, atol=0.0001)\n\n    haversine = Feature(\n        [log_latlong_feat, log_latlong_feat2],\n        primitive=Haversine(unit=\"kilometers\"),\n    )\n    features = [haversine]\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=range(15),\n    )\n    values = df[haversine.get_name()].values\n    real_km = [\n        0,\n        845.41812212,\n        1682.2825471,\n        2501.82467535,\n        3294.85736668,\n        0,\n        222.35628593,\n        444.50926278,\n        666.25531268,\n        0,\n        0,\n        845.41812212,\n        0,\n        1193.45638714,\n        2361.75676089,\n    ]\n    assert len(values) == 15\n    assert np.allclose(values, real_km, atol=0.0001)\n    error_text = \"Invalid unit inches provided. Must be one of\"\n    with pytest.raises(ValueError, match=error_text):\n        Haversine(unit=\"inches\")\n\n\ndef test_haversine_with_nan(es):\n    # Check some `nan` values\n    df = es[\"log\"]\n    df[\"latlong\"][0] = np.nan\n    df[\"latlong\"][1] = (10, np.nan)\n    es.replace_dataframe(dataframe_name=\"log\", df=df)\n    log_latlong_feat = Feature(es[\"log\"].ww[\"latlong\"])\n    log_latlong_feat2 = Feature(es[\"log\"].ww[\"latlong2\"])\n    haversine = Feature([log_latlong_feat, log_latlong_feat2], primitive=Haversine)\n    features = [haversine]\n\n    df = calculate_feature_matrix(entityset=es, features=features)\n    values = df[haversine.get_name()].values\n    real = [\n        np.nan,\n        np.nan,\n        1045.32190304,\n        1554.56176802,\n        2047.3294327,\n        0,\n        138.16578931,\n        276.20524822,\n        413.99185444,\n        0,\n        0,\n        525.318462,\n        0,\n        741.57941183,\n        1467.52760175,\n        np.nan,\n        np.nan,\n    ]\n\n    assert np.allclose(values, real, atol=0.0001, equal_nan=True)\n\n    # Check all `nan` values\n    df = es[\"log\"]\n    df[\"latlong2\"] = np.nan\n    es.replace_dataframe(dataframe_name=\"log\", df=df)\n    log_latlong_feat = Feature(es[\"log\"].ww[\"latlong\"])\n    log_latlong_feat2 = Feature(es[\"log\"].ww[\"latlong2\"])\n    haversine = Feature([log_latlong_feat, log_latlong_feat2], primitive=Haversine)\n    features = [haversine]\n\n    df = calculate_feature_matrix(entityset=es, features=features)\n    values = df[haversine.get_name()].values\n    real = [np.nan] * es[\"log\"].shape[0]\n\n    assert np.allclose(values, real, atol=0.0001, equal_nan=True)\n\n\ndef test_text_primitives(es):\n    words = Feature(es[\"log\"].ww[\"comments\"], primitive=NumWords)\n    chars = Feature(es[\"log\"].ww[\"comments\"], primitive=NumCharacters)\n\n    features = [words, chars]\n\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=range(15),\n    )\n\n    word_counts = [532, 3, 3, 653, 1306, 1305, 174, 173, 79, 246, 1253, 3, 3, 3, 3]\n    char_counts = [\n        3392,\n        10,\n        10,\n        4116,\n        7961,\n        7580,\n        992,\n        957,\n        437,\n        1325,\n        6322,\n        10,\n        10,\n        10,\n        10,\n    ]\n    word_values = df[words.get_name()].values\n    char_values = df[chars.get_name()].values\n    assert len(word_values) == 15\n    for i, v in enumerate(word_values):\n        assert v == word_counts[i]\n    for i, v in enumerate(char_values):\n        assert v == char_counts[i]\n\n\ndef test_isin_feat(es):\n    isin = Feature(\n        es[\"log\"].ww[\"product_id\"],\n        primitive=IsIn(list_of_outputs=[\"toothpaste\", \"coke zero\"]),\n    )\n    features = [isin]\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=range(8),\n    )\n    true = [True, True, True, False, False, True, True, True]\n    v = df[isin.get_name()].tolist()\n    assert true == v\n\n\ndef test_isin_feat_other_syntax(es):\n    isin = Feature(es[\"log\"].ww[\"product_id\"]).isin([\"toothpaste\", \"coke zero\"])\n    features = [isin]\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=range(8),\n    )\n    true = [True, True, True, False, False, True, True, True]\n    v = df[isin.get_name()].tolist()\n    assert true == v\n\n\ndef test_isin_feat_other_syntax_int(es):\n    isin = Feature(es[\"log\"].ww[\"value\"]).isin([5, 10])\n    features = [isin]\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=range(8),\n    )\n    true = [False, True, True, False, False, False, False, False]\n    v = df[isin.get_name()].tolist()\n    assert true == v\n\n\ndef test_isin_feat_custom(es):\n    class CustomIsIn(TransformPrimitive):\n        name = \"is_in\"\n        input_types = [ColumnSchema()]\n        return_type = ColumnSchema(logical_type=Boolean)\n\n        def __init__(self, list_of_outputs=None):\n            self.list_of_outputs = list_of_outputs\n\n        def get_function(self):\n            def pd_is_in(array):\n                return array.isin(self.list_of_outputs)\n\n            return pd_is_in\n\n    isin = Feature(\n        es[\"log\"].ww[\"product_id\"],\n        primitive=CustomIsIn(list_of_outputs=[\"toothpaste\", \"coke zero\"]),\n    )\n    features = [isin]\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=range(8),\n    )\n    true = [True, True, True, False, False, True, True, True]\n    v = df[isin.get_name()].tolist()\n    assert true == v\n\n    isin = Feature(es[\"log\"].ww[\"product_id\"]).isin([\"toothpaste\", \"coke zero\"])\n    features = [isin]\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=range(8),\n    )\n    true = [True, True, True, False, False, True, True, True]\n    v = df[isin.get_name()].tolist()\n    assert true == v\n\n    isin = Feature(es[\"log\"].ww[\"value\"]).isin([5, 10])\n    features = [isin]\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=range(8),\n    )\n    true = [False, True, True, False, False, False, False, False]\n    v = df[isin.get_name()].tolist()\n    assert true == v\n\n\ndef test_isnull_feat(es):\n    value = Feature(es[\"log\"].ww[\"value\"])\n    diff = Feature(\n        value,\n        groupby=Feature(es[\"log\"].ww[\"session_id\"]),\n        primitive=Diff,\n    )\n    isnull = Feature(diff, primitive=IsNull)\n    features = [isnull]\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=features,\n        instance_ids=range(15),\n    )\n\n    correct_vals = [\n        True,\n        False,\n        False,\n        False,\n        False,\n        True,\n        False,\n        False,\n        False,\n        True,\n        True,\n        False,\n        True,\n        False,\n        False,\n    ]\n    values = df[isnull.get_name()].tolist()\n    assert correct_vals == values\n\n\ndef test_percentile(es):\n    v = Feature(es[\"log\"].ww[\"value\"])\n    p = Feature(v, primitive=Percentile)\n    feature_set = FeatureSet([p])\n    calculator = FeatureSetCalculator(es, feature_set)\n    df = calculator.run(np.array(range(10, 17)))\n    true = es[\"log\"][v.get_name()].rank(pct=True)\n    true = true.loc[range(10, 17)]\n    for t, a in zip(true.values, df[p.get_name()].values):\n        assert (pd.isnull(t) and pd.isnull(a)) or t == a\n\n\ndef test_dependent_percentile(es):\n    v = Feature(es[\"log\"].ww[\"value\"])\n    p = Feature(v, primitive=Percentile)\n    p2 = Feature(p - 1, primitive=Percentile)\n    feature_set = FeatureSet([p, p2])\n    calculator = FeatureSetCalculator(es, feature_set)\n    df = calculator.run(np.array(range(10, 17)))\n    true = es[\"log\"][v.get_name()].rank(pct=True)\n    true = true.loc[range(10, 17)]\n    for t, a in zip(true.values, df[p.get_name()].values):\n        assert (pd.isnull(t) and pd.isnull(a)) or t == a\n\n\ndef test_agg_percentile(es):\n    v = Feature(es[\"log\"].ww[\"value\"])\n    p = Feature(v, primitive=Percentile)\n    agg = Feature(p, parent_dataframe_name=\"sessions\", primitive=Sum)\n    feature_set = FeatureSet([agg])\n    calculator = FeatureSetCalculator(es, feature_set)\n    df = calculator.run(np.array([0, 1]))\n    log_vals = es[\"log\"][[v.get_name(), \"session_id\"]]\n    log_vals[\"percentile\"] = log_vals[v.get_name()].rank(pct=True)\n    true_p = log_vals.groupby(\"session_id\")[\"percentile\"].sum()[[0, 1]]\n    for t, a in zip(true_p.values, df[agg.get_name()].values):\n        assert (pd.isnull(t) and pd.isnull(a)) or t == a\n\n\ndef test_percentile_agg_percentile(es):\n    v = Feature(es[\"log\"].ww[\"value\"])\n    p = Feature(v, primitive=Percentile)\n    agg = Feature(p, parent_dataframe_name=\"sessions\", primitive=Sum)\n    pagg = Feature(agg, primitive=Percentile)\n    feature_set = FeatureSet([pagg])\n    calculator = FeatureSetCalculator(es, feature_set)\n    df = calculator.run(np.array([0, 1]))\n\n    log_vals = es[\"log\"][[v.get_name(), \"session_id\"]]\n    log_vals[\"percentile\"] = log_vals[v.get_name()].rank(pct=True)\n    true_p = log_vals.groupby(\"session_id\")[\"percentile\"].sum().fillna(0)\n    true_p = true_p.rank(pct=True)[[0, 1]]\n\n    for t, a in zip(true_p.values, df[pagg.get_name()].values):\n        assert (pd.isnull(t) and pd.isnull(a)) or t == a\n\n\ndef test_percentile_agg(es):\n    v = Feature(es[\"log\"].ww[\"value\"])\n    agg = Feature(v, parent_dataframe_name=\"sessions\", primitive=Sum)\n    pagg = Feature(agg, primitive=Percentile)\n    feature_set = FeatureSet([pagg])\n    calculator = FeatureSetCalculator(es, feature_set)\n    df = calculator.run(np.array([0, 1]))\n\n    log_vals = es[\"log\"][[v.get_name(), \"session_id\"]]\n    true_p = log_vals.groupby(\"session_id\")[v.get_name()].sum().fillna(0)\n    true_p = true_p.rank(pct=True)[[0, 1]]\n\n    for t, a in zip(true_p.values, df[pagg.get_name()].values):\n        assert (pd.isnull(t) and pd.isnull(a)) or t == a\n\n\ndef test_direct_percentile(es):\n    v = Feature(es[\"customers\"].ww[\"age\"])\n    p = Feature(v, primitive=Percentile)\n    d = Feature(p, \"sessions\")\n    feature_set = FeatureSet([d])\n    calculator = FeatureSetCalculator(es, feature_set)\n    df = calculator.run(np.array([0, 1]))\n\n    cust_vals = es[\"customers\"][[v.get_name()]]\n    cust_vals[\"percentile\"] = cust_vals[v.get_name()].rank(pct=True)\n    true_p = cust_vals[\"percentile\"].loc[[0, 0]]\n    for t, a in zip(true_p.values, df[d.get_name()].values):\n        assert (pd.isnull(t) and pd.isnull(a)) or t == a\n\n\ndef test_direct_agg_percentile(es):\n    v = Feature(es[\"log\"].ww[\"value\"])\n    p = Feature(v, primitive=Percentile)\n    agg = Feature(p, parent_dataframe_name=\"customers\", primitive=Sum)\n    d = Feature(agg, \"sessions\")\n    feature_set = FeatureSet([d])\n    calculator = FeatureSetCalculator(es, feature_set)\n    df = calculator.run(np.array([0, 1]))\n\n    log_vals = es[\"log\"][[v.get_name(), \"session_id\"]]\n    log_vals[\"percentile\"] = log_vals[v.get_name()].rank(pct=True)\n    log_vals[\"customer_id\"] = [0] * 10 + [1] * 5 + [2] * 2\n    true_p = log_vals.groupby(\"customer_id\")[\"percentile\"].sum().fillna(0)\n    true_p = true_p[[0, 0]]\n    for t, a in zip(true_p.values, df[d.get_name()].values):\n        assert (pd.isnull(t) and pd.isnull(a)) or round(t, 3) == round(a, 3)\n\n\ndef test_percentile_with_cutoff(es):\n    v = Feature(es[\"log\"].ww[\"value\"])\n    p = Feature(v, primitive=Percentile)\n    feature_set = FeatureSet([p])\n    calculator = FeatureSetCalculator(\n        es,\n        feature_set,\n        pd.Timestamp(\"2011/04/09 10:30:13\"),\n    )\n    df = calculator.run(np.array([2]))\n    assert df[p.get_name()].tolist()[0] == 1.0\n\n\ndef test_two_kinds_of_dependents(es):\n    v = Feature(es[\"log\"].ww[\"value\"])\n    product = Feature(es[\"log\"].ww[\"product_id\"])\n    agg = Feature(\n        v,\n        parent_dataframe_name=\"customers\",\n        where=product == \"coke zero\",\n        primitive=Sum,\n    )\n    p = Feature(agg, primitive=Percentile)\n    g = Feature(agg, primitive=Absolute)\n    agg2 = Feature(\n        v,\n        parent_dataframe_name=\"sessions\",\n        where=product == \"coke zero\",\n        primitive=Sum,\n    )\n    agg3 = Feature(agg2, parent_dataframe_name=\"customers\", primitive=Sum)\n    feature_set = FeatureSet([p, g, agg3])\n    calculator = FeatureSetCalculator(es, feature_set)\n    df = calculator.run(np.array([0, 1]))\n    assert df[p.get_name()].tolist() == [2.0 / 3, 1.0]\n    assert df[g.get_name()].tolist() == [15, 26]\n\n\ndef test_get_filepath(es):\n    class Mod4(TransformPrimitive):\n        \"\"\"Return base feature modulo 4\"\"\"\n\n        name = \"mod4\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n        def get_function(self):\n            filepath = self.get_filepath(\"featuretools_unit_test_example.csv\")\n            reference = pd.read_csv(filepath, header=None).squeeze(\"columns\")\n\n            def map_to_word(x):\n                def _map(x):\n                    if pd.isnull(x):\n                        return x\n                    return reference[int(x) % 4]\n\n                return x.apply(_map)\n\n            return map_to_word\n\n    feat = Feature(es[\"log\"].ww[\"value\"], primitive=Mod4)\n    df = calculate_feature_matrix(features=[feat], entityset=es, instance_ids=range(17))\n    assert pd.isnull(df[\"MOD4(value)\"][15])\n    assert df[\"MOD4(value)\"][0] == 0\n    assert df[\"MOD4(value)\"][14] == 2\n\n    fm, fl = dfs(\n        entityset=es,\n        target_dataframe_name=\"log\",\n        agg_primitives=[],\n        trans_primitives=[Mod4],\n    )\n    assert fm[\"MOD4(value)\"][0] == 0\n    assert fm[\"MOD4(value)\"][14] == 2\n    assert pd.isnull(fm[\"MOD4(value)\"][15])\n\n\ndef test_override_multi_feature_names(es):\n    def gen_custom_names(primitive, base_feature_names):\n        return [\n            \"Above18(%s)\" % base_feature_names,\n            \"Above21(%s)\" % base_feature_names,\n            \"Above65(%s)\" % base_feature_names,\n        ]\n\n    class IsGreater(TransformPrimitive):\n        name = \"is_greater\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n        number_output_features = 3\n\n        def get_function(self):\n            def is_greater(x):\n                return x > 18, x > 21, x > 65\n\n            return is_greater\n\n        def generate_names(primitive, base_feature_names):\n            return gen_custom_names(primitive, base_feature_names)\n\n    fm, features = dfs(\n        entityset=es,\n        target_dataframe_name=\"customers\",\n        instance_ids=[0, 1, 2],\n        agg_primitives=[],\n        trans_primitives=[IsGreater],\n    )\n\n    expected_names = gen_custom_names(IsGreater, [\"age\"])\n\n    for name in expected_names:\n        assert name in fm.columns\n\n\ndef test_time_since_primitive_matches_all_datetime_types(es):\n    fm, fl = dfs(\n        target_dataframe_name=\"customers\",\n        entityset=es,\n        trans_primitives=[TimeSince],\n        agg_primitives=[],\n        max_depth=1,\n    )\n\n    customers_datetime_cols = [\n        id\n        for id, t in es[\"customers\"].ww.logical_types.items()\n        if isinstance(t, Datetime)\n    ]\n    expected_names = [f\"TIME_SINCE({v})\" for v in customers_datetime_cols]\n\n    for name in expected_names:\n        assert name in fm.columns\n\n\ndef test_cfm_with_numeric_lag_and_non_nullable_column(es):\n    # fill nans so we can use non nullable numeric logical type in the EntitySet\n    new_log = es[\"log\"].copy()\n    new_log[\"value\"] = new_log[\"value\"].fillna(0)\n    new_log.ww.init(\n        logical_types={\"value\": \"Integer\", \"product_id\": \"Categorical\"},\n        index=\"id\",\n        time_index=\"datetime\",\n        name=\"new_log\",\n    )\n    es.add_dataframe(new_log)\n    rels = [\n        (\"sessions\", \"id\", \"new_log\", \"session_id\"),\n        (\"products\", \"id\", \"new_log\", \"product_id\"),\n    ]\n    es = es.add_relationships(rels)\n\n    assert isinstance(es[\"new_log\"].ww.logical_types[\"value\"], Integer)\n\n    periods = 5\n    lag_primitive = NumericLag(periods=periods)\n    cutoff_times = es[\"new_log\"][[\"id\", \"datetime\"]]\n    fm, _ = dfs(\n        target_dataframe_name=\"new_log\",\n        entityset=es,\n        agg_primitives=[],\n        trans_primitives=[lag_primitive],\n        cutoff_time=cutoff_times,\n    )\n    assert fm[\"NUMERIC_LAG(datetime, value, periods=5)\"].head(periods).isnull().all()\n    assert fm[\"NUMERIC_LAG(datetime, value, periods=5)\"].isnull().sum() == periods\n\n    assert \"NUMERIC_LAG(datetime, value_2, periods=5)\" in fm.columns\n\n    assert \"NUMERIC_LAG(datetime, products.rating, periods=5)\" in fm.columns\n    assert (\n        fm[\"NUMERIC_LAG(datetime, products.rating, periods=5)\"]\n        .head(periods)\n        .isnull()\n        .all()\n    )\n\n\ndef test_cfm_with_lag_and_non_nullable_columns(es):\n    # fill nans so we can use non nullable numeric logical type in the EntitySet\n    new_log = es[\"log\"].copy()\n    new_log[\"value\"] = new_log[\"value\"].fillna(0)\n    new_log[\"value_double\"] = new_log[\"value\"]\n    new_log[\"purchased_with_nulls\"] = new_log[\"purchased\"]\n    new_log[\"purchased_with_nulls\"][0:4] = None\n    new_log.ww.init(\n        logical_types={\n            \"value\": \"Integer\",\n            \"value_2\": \"IntegerNullable\",\n            \"product_id\": \"Categorical\",\n            \"value_double\": \"Double\",\n            \"purchased_with_nulls\": \"BooleanNullable\",\n        },\n        index=\"id\",\n        time_index=\"datetime\",\n        name=\"new_log\",\n    )\n    es.add_dataframe(new_log)\n    rels = [\n        (\"sessions\", \"id\", \"new_log\", \"session_id\"),\n        (\"products\", \"id\", \"new_log\", \"product_id\"),\n    ]\n    es = es.add_relationships(rels)\n\n    assert isinstance(es[\"new_log\"].ww.logical_types[\"value\"], Integer)\n\n    periods = 5\n    lag_primitive = Lag(periods=periods)\n    cutoff_times = es[\"new_log\"][[\"id\", \"datetime\"]]\n    fm, _ = dfs(\n        target_dataframe_name=\"new_log\",\n        entityset=es,\n        agg_primitives=[],\n        trans_primitives=[lag_primitive],\n        cutoff_time=cutoff_times,\n    )\n    # Integer\n    assert fm[\"LAG(value, datetime, periods=5)\"].head(periods).isnull().all()\n    assert fm[\"LAG(value, datetime, periods=5)\"].isnull().sum() == periods\n    assert isinstance(\n        fm.ww.schema.logical_types[\"LAG(value, datetime, periods=5)\"],\n        IntegerNullable,\n    )\n\n    # IntegerNullable\n    assert \"LAG(value_2, datetime, periods=5)\" in fm.columns\n    assert fm[\"LAG(value_2, datetime, periods=5)\"].head(periods).isnull().all()\n    assert isinstance(\n        fm.ww.schema.logical_types[\"LAG(value_2, datetime, periods=5)\"],\n        IntegerNullable,\n    )\n\n    # Categorical\n    assert \"LAG(product_id, datetime, periods=5)\" in fm.columns\n    assert fm[\"LAG(product_id, datetime, periods=5)\"].head(periods).isnull().all()\n    assert isinstance(\n        fm.ww.schema.logical_types[\"LAG(product_id, datetime, periods=5)\"],\n        Categorical,\n    )\n\n    # Double\n    assert \"LAG(value_double, datetime, periods=5)\" in fm.columns\n    assert fm[\"LAG(value_double, datetime, periods=5)\"].head(periods).isnull().all()\n    assert isinstance(\n        fm.ww.schema.logical_types[\"LAG(value_double, datetime, periods=5)\"],\n        Double,\n    )\n\n    # Boolean\n    assert \"LAG(purchased, datetime, periods=5)\" in fm.columns\n    assert fm[\"LAG(purchased, datetime, periods=5)\"].head(periods).isnull().all()\n    assert isinstance(\n        fm.ww.schema.logical_types[\"LAG(purchased, datetime, periods=5)\"],\n        BooleanNullable,\n    )\n\n    # BooleanNullable\n    assert \"LAG(purchased_with_nulls, datetime, periods=5)\" in fm.columns\n    assert (\n        fm[\"LAG(purchased_with_nulls, datetime, periods=5)\"]\n        .head(periods)\n        .isnull()\n        .all()\n    )\n    assert isinstance(\n        fm.ww.schema.logical_types[\"LAG(purchased_with_nulls, datetime, periods=5)\"],\n        BooleanNullable,\n    )\n\n\ndef test_comparisons_with_ordinal_valid_inputs_that_dont_work_but_should(es):\n    # TODO: Remvoe this test once the correct behavior is implemented in CFM\n    # The following test covers a scenario where an intermediate feature doesn't have the correct type\n    # because Woodwork has not yet been initialized. This calculation should work and return valid True/False\n    # values. This should be fixed in a future PR, but until a fix is implemented null values are returned to\n    # prevent calculate_feature_matrix from raising an Error when calculating features generated by DFS.\n\n    priority_level = Feature(es[\"log\"].ww[\"priority_level\"])\n    first_priority = AggregationFeature(\n        priority_level,\n        parent_dataframe_name=\"customers\",\n        primitive=First,\n    )\n    engagement = Feature(es[\"customers\"].ww[\"engagement_level\"])\n    invalid_but_should_be_valid = [\n        TransformFeature([engagement, first_priority], primitive=LessThan),\n        TransformFeature([engagement, first_priority], primitive=LessThanEqualTo),\n        TransformFeature([engagement, first_priority], primitive=GreaterThan),\n        TransformFeature([engagement, first_priority], primitive=GreaterThanEqualTo),\n    ]\n    fm = calculate_feature_matrix(\n        entityset=es,\n        features=invalid_but_should_be_valid,\n    )\n\n    feature_cols = [f.get_name() for f in invalid_but_should_be_valid]\n    for col in feature_cols:\n        assert fm[col].isnull().all()\n\n\ndef test_multiply_numeric_boolean():\n    test_cases = [\n        {\"val\": 100, \"mask\": True, \"expected\": 100},\n        {\"val\": 100, \"mask\": False, \"expected\": 0},\n        {\"val\": 0, \"mask\": False, \"expected\": 0},\n        {\"val\": 100, \"mask\": pd.NA, \"expected\": pd.NA},\n        {\"val\": pd.NA, \"mask\": pd.NA, \"expected\": pd.NA},\n        {\"val\": pd.NA, \"mask\": True, \"expected\": pd.NA},\n        {\"val\": pd.NA, \"mask\": False, \"expected\": pd.NA},\n    ]\n\n    multiply_numeric_boolean = MultiplyNumericBoolean()\n    for input in test_cases:\n        vals = pd.Series(input[\"val\"]).astype(\"Int64\")\n        mask = pd.Series(input[\"mask\"])\n        actual = multiply_numeric_boolean(vals, mask).tolist()[0]\n        expected = input[\"expected\"]\n        if pd.isnull(expected):\n            assert pd.isnull(actual)\n        else:\n            assert actual == input[\"expected\"]\n\n\ndef test_multiply_numeric_boolean_multiple_dtypes_no_nulls():\n    # Test without null values\n    vals = pd.Series([1, 2, 3])\n    bools = pd.Series([True, False, True])\n    multiply_numeric_boolean = MultiplyNumericBoolean()\n    numeric_dtypes = [\"float64\", \"int64\", \"Int64\"]\n    boolean_dtypes = [\"bool\", \"boolean\"]\n\n    for numeric_dtype in numeric_dtypes:\n        for boolean_dtype in boolean_dtypes:\n            actual = multiply_numeric_boolean(\n                vals.astype(numeric_dtype),\n                bools.astype(boolean_dtype),\n            )\n            expected = pd.Series([1, 0, 3])\n            pd.testing.assert_series_equal(actual, expected, check_dtype=False)\n\n\ndef test_multiply_numeric_boolean_multiple_dtypes_with_nulls():\n    # Test with null values\n    vals = pd.Series([np.nan, 2, 3])\n    bools = pd.Series([True, False, pd.NA], dtype=\"boolean\")\n    multiply_numeric_boolean = MultiplyNumericBoolean()\n    numeric_dtypes = [\"float64\", \"Int64\"]\n\n    for numeric_dtype in numeric_dtypes:\n        actual = multiply_numeric_boolean(vals.astype(numeric_dtype), bools)\n        expected = pd.Series([np.nan, 0, np.nan])\n        pd.testing.assert_series_equal(actual, expected, check_dtype=False)\n\n\ndef test_feature_multiplication(es):\n    numeric_ft = Feature(es[\"customers\"].ww[\"age\"])\n    boolean_ft = Feature(es[\"customers\"].ww[\"loves_ice_cream\"])\n\n    mult_numeric = numeric_ft * numeric_ft\n    mult_boolean = boolean_ft * boolean_ft\n    mult_numeric_boolean = numeric_ft * boolean_ft\n    mult_numeric_boolean2 = boolean_ft * numeric_ft\n\n    assert issubclass(type(mult_numeric.primitive), MultiplyNumeric)\n    assert issubclass(type(mult_boolean.primitive), MultiplyBoolean)\n    assert issubclass(type(mult_numeric_boolean.primitive), MultiplyNumericBoolean)\n    assert issubclass(type(mult_numeric_boolean2.primitive), MultiplyNumericBoolean)\n\n    # Test with nullable types\n    es[\"customers\"].ww.set_types(\n        logical_types={\"age\": \"IntegerNullable\", \"loves_ice_cream\": \"BooleanNullable\"},\n    )\n    numeric_ft = Feature(es[\"customers\"].ww[\"age\"])\n    boolean_ft = Feature(es[\"customers\"].ww[\"loves_ice_cream\"])\n    mult_numeric = numeric_ft * numeric_ft\n    mult_boolean = boolean_ft * boolean_ft\n    mult_numeric_boolean = numeric_ft * boolean_ft\n    mult_numeric_boolean2 = boolean_ft * numeric_ft\n\n    assert issubclass(type(mult_numeric.primitive), MultiplyNumeric)\n    assert issubclass(type(mult_boolean.primitive), MultiplyBoolean)\n    assert issubclass(type(mult_numeric_boolean.primitive), MultiplyNumericBoolean)\n    assert issubclass(type(mult_numeric_boolean2.primitive), MultiplyNumericBoolean)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/transform_primitive_tests/__init__.py",
    "content": ""
  },
  {
    "path": "featuretools/tests/primitive_tests/transform_primitive_tests/test_cumulative_time_since.py",
    "content": "from datetime import datetime\n\nimport numpy as np\nimport pandas as pd\n\nfrom featuretools.primitives import (\n    CumulativeTimeSinceLastFalse,\n    CumulativeTimeSinceLastTrue,\n)\nfrom featuretools.tests.primitive_tests.utils import (\n    PrimitiveTestBase,\n    find_applicable_primitives,\n    valid_dfs,\n)\n\n\nclass TestCumulativeTimeSinceLastTrue(PrimitiveTestBase):\n    primitive = CumulativeTimeSinceLastTrue\n    booleans = pd.Series([False, True, False, True, False, False])\n    datetimes = pd.Series(\n        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(len(booleans))],\n    )\n    answer = pd.Series([np.nan, 0, 6, 0, 6, 12])\n\n    def test_regular(self):\n        primitive_func = self.primitive().get_function()\n        given_answer = primitive_func(self.datetimes, self.booleans)\n        assert given_answer.equals(self.answer)\n\n    def test_all_false(self):\n        primitive_func = self.primitive().get_function()\n        booleans = pd.Series([False, False, False])\n        datetimes = pd.Series(\n            [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(len(booleans))],\n        )\n        given_answer = primitive_func(datetimes, booleans)\n        answer = pd.Series([np.nan] * 3)\n        assert given_answer.equals(answer)\n\n    def test_all_nan(self):\n        primitive_func = self.primitive().get_function()\n        datetimes = pd.Series([np.nan] * 4)\n        booleans = pd.Series([np.nan] * 4)\n        given_answer = primitive_func(datetimes, booleans)\n        answer = pd.Series([np.nan] * 4)\n        assert given_answer.equals(answer)\n\n    def test_some_nans(self):\n        primitive_func = self.primitive().get_function()\n        booleans = pd.Series(\n            [\n                False,\n                True,\n                False,\n                True,\n                False,\n                False,\n                True,\n                True,\n                False,\n                False,\n            ],\n        )\n        datetimes = pd.Series([np.nan] * 2)\n        datetimes = pd.concat([datetimes, self.datetimes])\n        datetimes = pd.concat([datetimes, pd.Series([np.nan] * 2)])\n        datetimes = datetimes.reset_index(drop=True)\n        answer = pd.Series(\n            [\n                np.nan,\n                np.nan,\n                np.nan,\n                0,\n                6,\n                12,\n                0,\n                0,\n                np.nan,\n                np.nan,\n            ],\n        )\n        given_answer = primitive_func(datetimes, booleans)\n        assert given_answer.equals(answer)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        transform.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n\n\nclass TestCumulativeTimeSinceLastFalse(PrimitiveTestBase):\n    primitive = CumulativeTimeSinceLastFalse\n    booleans = pd.Series([True, False, True, False, True, True])\n    datetimes = pd.Series(\n        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(len(booleans))],\n    )\n    answer = pd.Series([np.nan, 0, 6, 0, 6, 12])\n\n    def test_regular(self):\n        primitive_func = self.primitive().get_function()\n        given_answer = primitive_func(self.datetimes, self.booleans)\n        assert given_answer.equals(self.answer)\n\n    def test_all_true(self):\n        primitive_func = self.primitive().get_function()\n        booleans = pd.Series([True, True, True])\n        datetimes = pd.Series(\n            [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(len(booleans))],\n        )\n        given_answer = primitive_func(datetimes, booleans)\n        answer = pd.Series([np.nan] * 3)\n        assert given_answer.equals(answer)\n\n    def test_all_nan(self):\n        primitive_func = self.primitive().get_function()\n        datetimes = pd.Series([np.nan] * 4)\n        booleans = pd.Series([np.nan] * 4)\n        given_answer = primitive_func(datetimes, booleans)\n        answer = pd.Series([np.nan] * 4)\n        assert given_answer.equals(answer)\n\n    def test_some_nans(self):\n        primitive_func = self.primitive().get_function()\n        booleans = pd.Series(\n            [\n                True,\n                False,\n                True,\n                False,\n                True,\n                True,\n                False,\n                False,\n                True,\n                True,\n            ],\n        )\n        datetimes = pd.Series([np.nan] * 2)\n        datetimes = pd.concat([datetimes, self.datetimes])\n        datetimes = pd.concat([datetimes, pd.Series([np.nan] * 2)])\n        datetimes = datetimes.reset_index(drop=True)\n        answer = pd.Series(\n            [\n                np.nan,\n                np.nan,\n                np.nan,\n                0,\n                6,\n                12,\n                0,\n                0,\n                np.nan,\n                np.nan,\n            ],\n        )\n        given_answer = primitive_func(datetimes, booleans)\n        assert given_answer.equals(answer)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        transform.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/transform_primitive_tests/test_datetoholiday_primitive.py",
    "content": "from datetime import datetime\n\nimport numpy as np\nimport pandas as pd\nimport pytest\n\nfrom featuretools.primitives import DateToHoliday\n\n\ndef test_datetoholiday():\n    date_to_holiday = DateToHoliday()\n\n    dates = pd.Series(\n        [\n            datetime(2016, 1, 1),\n            datetime(2016, 2, 27),\n            datetime(2017, 5, 29, 10, 30, 5),\n            datetime(2018, 7, 4),\n        ],\n    )\n\n    holiday_series = date_to_holiday(dates).tolist()\n\n    assert holiday_series[0] == \"New Year's Day\"\n    assert np.isnan(holiday_series[1])\n    assert holiday_series[2] == \"Memorial Day\"\n    assert holiday_series[3] == \"Independence Day\"\n\n\ndef test_datetoholiday_error():\n    error_text = r\"must be one of the available countries.*\"\n    with pytest.raises(ValueError, match=error_text):\n        DateToHoliday(country=\"UNK\")\n\n\ndef test_nat():\n    date_to_holiday = DateToHoliday()\n    case = pd.Series(\n        [\n            \"2019-10-14\",\n            \"NaT\",\n            \"2016-02-15\",\n            \"NaT\",\n        ],\n    ).astype(\"datetime64[ns]\")\n    answer = [\"Columbus Day\", np.nan, \"Washington's Birthday\", np.nan]\n    given_answer = date_to_holiday(case).astype(\"str\")\n    np.testing.assert_array_equal(given_answer, answer)\n\n\ndef test_valid_country():\n    date_to_holiday = DateToHoliday(country=\"Canada\")\n    case = pd.Series(\n        [\n            \"2016-07-01\",\n            \"2016-11-11\",\n            \"2018-12-25\",\n        ],\n    ).astype(\"datetime64[ns]\")\n    answer = [\"Canada Day\", np.nan, \"Christmas Day\"]\n    given_answer = date_to_holiday(case).astype(\"str\")\n    np.testing.assert_array_equal(given_answer, answer)\n\n\ndef test_multiple_countries():\n    dth_mexico = DateToHoliday(country=\"Mexico\")\n\n    case = pd.Series([datetime(2000, 9, 16), datetime(2005, 1, 1)])\n    assert len(dth_mexico(case)) > 1\n\n    dth_india = DateToHoliday(country=\"IND\")\n    case = pd.Series([datetime(2048, 1, 1), datetime(2048, 10, 2)])\n    assert len(dth_india(case)) > 1\n\n    dth_uk = DateToHoliday(country=\"UK\")\n    case = pd.Series([datetime(2048, 3, 17), datetime(2048, 4, 6)])\n    assert len(dth_uk(case)) > 1\n\n    countries = [\n        \"Argentina\",\n        \"AU\",\n        \"Austria\",\n        \"BY\",\n        \"Belgium\",\n        \"Brazil\",\n        \"Canada\",\n        \"Colombia\",\n        \"Croatia\",\n        \"England\",\n        \"Finland\",\n        \"FRA\",\n        \"Germany\",\n        \"Germany\",\n        \"Italy\",\n        \"NewZealand\",\n        \"PortugalExt\",\n        \"PTE\",\n        \"Spain\",\n        \"ES\",\n        \"Switzerland\",\n        \"UnitedStates\",\n        \"US\",\n        \"UK\",\n        \"UA\",\n        \"CH\",\n        \"SE\",\n        \"ZA\",\n    ]\n    for x in countries:\n        DateToHoliday(country=x)\n\n\ndef test_with_timezone_aware_datetimes():\n    df = pd.DataFrame(\n        {\n            \"non_timezone_aware_with_time\": pd.date_range(\n                \"2018-07-03 09:00\",\n                periods=3,\n            ),\n            \"non_timezone_aware_no_time\": pd.date_range(\"2018-07-03\", periods=3),\n            \"timezone_aware_with_time\": pd.date_range(\n                \"2018-07-03 09:00\",\n                periods=3,\n            ).tz_localize(tz=\"US/Eastern\"),\n            \"timezone_aware_no_time\": pd.date_range(\n                \"2018-07-03\",\n                periods=3,\n            ).tz_localize(tz=\"US/Eastern\"),\n        },\n    )\n\n    date_to_holiday = DateToHoliday(country=\"US\")\n    expected = [np.nan, \"Independence Day\", np.nan]\n    for col in df.columns:\n        actual = date_to_holiday(df[col]).astype(\"str\")\n        np.testing.assert_array_equal(actual, expected)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/transform_primitive_tests/test_distancetoholiday_primitive.py",
    "content": "from datetime import datetime\n\nimport numpy as np\nimport pandas as pd\nimport pytest\n\nfrom featuretools.primitives import DistanceToHoliday\n\n\ndef test_distanceholiday():\n    distance_to_holiday = DistanceToHoliday(\"New Year's Day\")\n    dates = pd.Series(\n        [\n            datetime(2010, 1, 1),\n            datetime(2012, 5, 31),\n            datetime(2017, 7, 31),\n            datetime(2020, 12, 31),\n        ],\n    )\n\n    expected = [0, -151, 154, 1]\n    output = distance_to_holiday(dates).tolist()\n    np.testing.assert_array_equal(output, expected)\n\n\ndef test_unknown_country_error():\n    error_text = r\"must be one of the available countries.*\"\n    with pytest.raises(ValueError, match=error_text):\n        DistanceToHoliday(\"Victoria Day\", country=\"UNK\")\n\n\ndef test_unknown_holiday_error():\n    error_text = r\"must be one of the available holidays.*\"\n    with pytest.raises(ValueError, match=error_text):\n        DistanceToHoliday(\"Alteryx Day\")\n\n\ndef test_nat():\n    date_to_holiday = DistanceToHoliday(\"New Year's Day\")\n    case = pd.Series(\n        [\n            \"2010-01-01\",\n            \"NaT\",\n            \"2012-05-31\",\n            \"NaT\",\n        ],\n    ).astype(\"datetime64[ns]\")\n    answer = [0, np.nan, -151, np.nan]\n    given_answer = date_to_holiday(case).astype(\"float\")\n    np.testing.assert_array_equal(given_answer, answer)\n\n\ndef test_valid_country():\n    distance_to_holiday = DistanceToHoliday(\"Canada Day\", country=\"Canada\")\n    case = pd.Series(\n        [\n            \"2010-01-01\",\n            \"2012-05-31\",\n            \"2017-07-31\",\n            \"2020-12-31\",\n        ],\n    ).astype(\"datetime64[ns]\")\n    answer = [181, 31, -30, 182]\n    given_answer = distance_to_holiday(case).astype(\"float\")\n    np.testing.assert_array_equal(given_answer, answer)\n\n\ndef test_with_timezone_aware_datetimes():\n    df = pd.DataFrame(\n        {\n            \"non_timezone_aware_with_time\": pd.date_range(\n                \"2018-07-03 09:00\",\n                periods=3,\n            ),\n            \"non_timezone_aware_no_time\": pd.date_range(\"2018-07-03\", periods=3),\n            \"timezone_aware_with_time\": pd.date_range(\n                \"2018-07-03 09:00\",\n                periods=3,\n            ).tz_localize(tz=\"US/Eastern\"),\n            \"timezone_aware_no_time\": pd.date_range(\n                \"2018-07-03\",\n                periods=3,\n            ).tz_localize(tz=\"US/Eastern\"),\n        },\n    )\n\n    distance_to_holiday = DistanceToHoliday(\"Independence Day\", country=\"US\")\n    expected = [1, 0, -1]\n    for col in df.columns:\n        actual = distance_to_holiday(df[col])\n        np.testing.assert_array_equal(actual, expected)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/transform_primitive_tests/test_expanding_primitives.py",
    "content": "import numpy as np\nimport pandas as pd\nimport pytest\n\nfrom featuretools.primitives.standard.transform.time_series.expanding import (\n    ExpandingCount,\n    ExpandingMax,\n    ExpandingMean,\n    ExpandingMin,\n    ExpandingSTD,\n    ExpandingTrend,\n)\nfrom featuretools.primitives.standard.transform.time_series.utils import (\n    _apply_gap_for_expanding_primitives,\n)\nfrom featuretools.utils import calculate_trend\n\n\n@pytest.mark.parametrize(\n    \"min_periods, gap\",\n    [\n        (5, 2),\n        (5, 0),\n        (0, 0),\n    ],\n)\ndef test_expanding_count_series(window_series, min_periods, gap):\n    test = window_series.shift(gap)\n    expected = test.expanding(min_periods=min_periods).count()\n    num_nans = gap + min_periods - 1\n    expected[range(num_nans)] = np.nan\n    primitive_instance = ExpandingCount(min_periods=min_periods, gap=gap).get_function()\n    actual = primitive_instance(window_series.index)\n    pd.testing.assert_series_equal(pd.Series(actual), expected)\n\n\n@pytest.mark.parametrize(\n    \"min_periods, gap\",\n    [\n        (5, 2),\n        (5, 0),\n        (0, 0),\n        (0, 1),\n    ],\n)\ndef test_expanding_count_date_range(window_date_range, min_periods, gap):\n    test = _apply_gap_for_expanding_primitives(gap=gap, x=window_date_range)\n    expected = test.expanding(min_periods=min_periods).count()\n    num_nans = gap + min_periods - 1\n    expected[range(num_nans)] = np.nan\n    primitive_instance = ExpandingCount(min_periods=min_periods, gap=gap).get_function()\n    actual = primitive_instance(window_date_range)\n    pd.testing.assert_series_equal(pd.Series(actual), expected)\n\n\n@pytest.mark.parametrize(\n    \"min_periods, gap\",\n    [\n        (5, 2),\n        (5, 0),\n        (0, 0),\n        (0, 1),\n    ],\n)\ndef test_expanding_min(window_series, min_periods, gap):\n    test = window_series.shift(gap)\n    expected = test.expanding(min_periods=min_periods).min().values\n    primitive_instance = ExpandingMin(min_periods=min_periods, gap=gap).get_function()\n    actual = primitive_instance(\n        numeric=window_series,\n        datetime=window_series.index,\n    )\n    pd.testing.assert_series_equal(pd.Series(actual), pd.Series(expected))\n\n\n@pytest.mark.parametrize(\n    \"min_periods, gap\",\n    [\n        (5, 2),\n        (5, 0),\n        (0, 0),\n        (0, 1),\n    ],\n)\ndef test_expanding_max(window_series, min_periods, gap):\n    test = window_series.shift(gap)\n    expected = test.expanding(min_periods=min_periods).max().values\n    primitive_instance = ExpandingMax(min_periods=min_periods, gap=gap).get_function()\n    actual = primitive_instance(\n        numeric=window_series,\n        datetime=window_series.index,\n    )\n    pd.testing.assert_series_equal(pd.Series(actual), pd.Series(expected))\n\n\n@pytest.mark.parametrize(\n    \"min_periods, gap\",\n    [\n        (5, 2),\n        (5, 0),\n        (0, 0),\n        (0, 1),\n    ],\n)\ndef test_expanding_std(window_series, min_periods, gap):\n    test = window_series.shift(gap)\n    expected = test.expanding(min_periods=min_periods).std().values\n    primitive_instance = ExpandingSTD(min_periods=min_periods, gap=gap).get_function()\n    actual = primitive_instance(\n        numeric=window_series,\n        datetime=window_series.index,\n    )\n    pd.testing.assert_series_equal(pd.Series(actual), pd.Series(expected))\n\n\n@pytest.mark.parametrize(\n    \"min_periods, gap\",\n    [\n        (5, 2),\n        (5, 0),\n        (0, 0),\n        (0, 1),\n    ],\n)\ndef test_expanding_mean(window_series, min_periods, gap):\n    test = window_series.shift(gap)\n    expected = test.expanding(min_periods=min_periods).mean().values\n    primitive_instance = ExpandingMean(min_periods=min_periods, gap=gap).get_function()\n    actual = primitive_instance(\n        numeric=window_series,\n        datetime=window_series.index,\n    )\n    pd.testing.assert_series_equal(pd.Series(actual), pd.Series(expected))\n\n\n@pytest.mark.parametrize(\n    \"min_periods, gap\",\n    [\n        (5, 2),\n        (5, 0),\n        (0, 0),\n        (0, 1),\n    ],\n)\ndef test_expanding_trend(window_series, min_periods, gap):\n    test = window_series.shift(gap)\n    expected = test.expanding(min_periods=min_periods).aggregate(calculate_trend).values\n    primitive_instance = ExpandingTrend(min_periods=min_periods, gap=gap).get_function()\n    actual = primitive_instance(\n        numeric=window_series,\n        datetime=window_series.index,\n    )\n    pd.testing.assert_series_equal(pd.Series(actual), pd.Series(expected))\n\n\n@pytest.mark.parametrize(\n    \"primitive\",\n    [\n        ExpandingMax,\n        ExpandingMean,\n        ExpandingMin,\n        ExpandingSTD,\n        ExpandingTrend,\n    ],\n)\ndef test_expanding_primitives_throw_error_when_given_string_offset(\n    window_series,\n    primitive,\n):\n    error_msg = (\n        \"String offsets are not supported for the gap parameter in Expanding primitives\"\n    )\n    with pytest.raises(TypeError, match=error_msg):\n        primitive(gap=\"2H\").get_function()(\n            numeric=window_series,\n            datetime=window_series.index,\n        )\n\n\ndef test_apply_gap_for_expanding_primitives_throws_error_when_given_string_offset(\n    window_series,\n):\n    error_msg = (\n        \"String offsets are not supported for the gap parameter in Expanding primitives\"\n    )\n    with pytest.raises(TypeError, match=error_msg):\n        _apply_gap_for_expanding_primitives(window_series, gap=\"2H\")\n\n\n@pytest.mark.parametrize(\n    \"gap\",\n    [\n        2,\n        5,\n        3,\n        0,\n    ],\n)\ndef test_apply_gap_for_expanding_primitives(window_series, gap):\n    actual = _apply_gap_for_expanding_primitives(window_series, gap).values\n    expected = window_series.shift(gap).values\n    pd.testing.assert_series_equal(pd.Series(actual), pd.Series(expected))\n\n\n@pytest.mark.parametrize(\n    \"gap\",\n    [\n        2,\n        5,\n        3,\n        0,\n    ],\n)\ndef test_apply_gap_for_expanding_primitives_handles_date_range(\n    window_date_range,\n    gap,\n):\n    actual = pd.Series(\n        _apply_gap_for_expanding_primitives(window_date_range, gap).values,\n    )\n    expected = pd.Series(window_date_range.to_series().shift(gap).values)\n    pd.testing.assert_series_equal(actual, expected)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/transform_primitive_tests/test_exponential_primitives.py",
    "content": "import numpy as np\nimport pandas as pd\n\nfrom featuretools.primitives import (\n    ExponentialWeightedAverage,\n    ExponentialWeightedSTD,\n    ExponentialWeightedVariance,\n)\n\n\ndef test_regular_com_avg():\n    primitive_instance = ExponentialWeightedAverage(com=0.5)\n    primitive_func = primitive_instance.get_function()\n    array = pd.Series([1, 2, 7, 5])\n    answer = pd.Series(primitive_func(array))\n    correct_answer = pd.Series([1.0, 1.75, 5.384615384615384, 5.125])\n    pd.testing.assert_series_equal(answer, correct_answer)\n\n\ndef test_regular_span_avg():\n    primitive_instance = ExponentialWeightedAverage(span=1.5)\n    primitive_func = primitive_instance.get_function()\n    array = pd.Series([1, 2, 7, 5])\n    answer = pd.Series(primitive_func(array))\n    correct_answer = pd.Series([1.0, 1.8333333333333335, 6.0, 5.198717948717948])\n    pd.testing.assert_series_equal(answer, correct_answer)\n\n\ndef test_regular_halflife_avg():\n    primitive_instance = ExponentialWeightedAverage(halflife=2.7)\n    primitive_func = primitive_instance.get_function()\n    array = pd.Series([1, 2, 7, 5])\n    answer = pd.Series(primitive_func(array))\n    correct_answer = pd.Series(\n        [1.0, 1.563830114594977, 3.8556233149044865, 4.2592901785684205],\n    )\n    pd.testing.assert_series_equal(answer, correct_answer)\n\n\ndef test_regular_alpha_avg():\n    primitive_instance = ExponentialWeightedAverage(alpha=0.8)\n    primitive_func = primitive_instance.get_function()\n    array = pd.Series([1, 2, 7, 5])\n    answer = pd.Series(primitive_func(array))\n    correct_answer = pd.Series([1.0, 1.8333333333333335, 6.0, 5.198717948717948])\n    pd.testing.assert_series_equal(answer, correct_answer)\n\n\ndef test_na_avg():\n    primitive_instance = ExponentialWeightedAverage(com=0.5)\n    primitive_func = primitive_instance.get_function()\n    array = pd.Series([1, 2, 7, np.nan, 5])\n    answer = pd.Series(primitive_func(array))\n    correct_answer = pd.Series(\n        [1.0, 1.75, 5.384615384615384, 5.384615384615384, 5.053191489361702],\n    )\n    pd.testing.assert_series_equal(answer, correct_answer)\n\n\ndef test_ignorena_true_avg():\n    primitive_instance = ExponentialWeightedAverage(com=0.5, ignore_na=True)\n    primitive_func = primitive_instance.get_function()\n    array = pd.Series([1, 2, 7, np.nan, 5])\n    answer = pd.Series(primitive_func(array))\n    correct_answer = pd.Series(\n        [1.0, 1.75, 5.384615384615384, 5.384615384615384, 5.125],\n    )\n    pd.testing.assert_series_equal(answer, correct_answer)\n\n\ndef test_regular_com_std():\n    primitive_instance = ExponentialWeightedSTD(com=0.5)\n    primitive_func = primitive_instance.get_function()\n    array = pd.Series([1, 2, 7, 5])\n    answer = pd.Series(primitive_func(array))\n    correct_answer = pd.Series(\n        [np.nan, 0.7071067811865475, 3.584153156068229, 2.0048019276803304],\n    )\n    pd.testing.assert_series_equal(answer, correct_answer)\n\n\ndef test_regular_span_std():\n    primitive_instance = ExponentialWeightedSTD(span=1.5)\n    primitive_func = primitive_instance.get_function()\n    array = pd.Series([1, 2, 7, 5])\n    answer = pd.Series(primitive_func(array))\n    correct_answer = pd.Series(\n        [np.nan, 0.7071067811865476, 3.6055512754639887, 1.7311551816712718],\n    )\n    pd.testing.assert_series_equal(answer, correct_answer)\n\n\ndef test_regular_halflife_std():\n    primitive_instance = ExponentialWeightedSTD(halflife=2.7)\n    primitive_func = primitive_instance.get_function()\n    array = pd.Series([1, 2, 7, 5])\n    answer = pd.Series(primitive_func(array))\n    correct_answer = pd.Series(\n        [np.nan, 0.7071067811865475, 3.3565236098585416, 2.631776826295855],\n    )\n    pd.testing.assert_series_equal(answer, correct_answer)\n\n\ndef test_regular_alpha_std():\n    primitive_instance = ExponentialWeightedSTD(alpha=0.8)\n    primitive_func = primitive_instance.get_function()\n    array = pd.Series([1, 2, 7, 5])\n    answer = pd.Series(primitive_func(array))\n    correct_answer = pd.Series(\n        [np.nan, 0.7071067811865476, 3.6055512754639887, 1.7311551816712718],\n    )\n    pd.testing.assert_series_equal(answer, correct_answer)\n\n\ndef test_na_std():\n    primitive_instance = ExponentialWeightedSTD(com=0.5)\n    primitive_func = primitive_instance.get_function()\n    array = pd.Series([1, 2, 7, np.nan, 5])\n    answer = pd.Series(primitive_func(array))\n    correct_answer = pd.Series(\n        [\n            np.nan,\n            0.7071067811865475,\n            3.584153156068229,\n            3.5841531560682287,\n            1.8408520483016189,\n        ],\n    )\n    pd.testing.assert_series_equal(answer, correct_answer)\n\n\ndef test_ignorena_true_std():\n    primitive_instance = ExponentialWeightedSTD(com=0.5, ignore_na=True)\n    primitive_func = primitive_instance.get_function()\n    array = pd.Series([1, 2, 7, np.nan, 5])\n    answer = pd.Series(primitive_func(array))\n    correct_answer = pd.Series(\n        [\n            np.nan,\n            0.7071067811865475,\n            3.584153156068229,\n            3.584153156068229,\n            2.0048019276803304,\n        ],\n    )\n    pd.testing.assert_series_equal(answer, correct_answer)\n\n\ndef test_regular_com_var():\n    primitive_instance = ExponentialWeightedVariance(com=0.5)\n    primitive_func = primitive_instance.get_function()\n    array = pd.Series([1, 2, 7, 5])\n    answer = pd.Series(primitive_func(array))\n    correct_answer = pd.Series(\n        [np.nan, 0.49999999999999983, 12.846153846153847, 4.019230769230769],\n    )\n    pd.testing.assert_series_equal(answer, correct_answer)\n\n\ndef test_regular_span_var():\n    primitive_instance = ExponentialWeightedVariance(span=1.5)\n    primitive_func = primitive_instance.get_function()\n    array = pd.Series([1, 2, 7, 5])\n    answer = pd.Series(primitive_func(array))\n    correct_answer = pd.Series([np.nan, 0.5, 12.999999999999996, 2.996898263027294])\n    pd.testing.assert_series_equal(answer, correct_answer)\n\n\ndef test_regular_halflife_var():\n    primitive_instance = ExponentialWeightedVariance(halflife=2.7)\n    primitive_func = primitive_instance.get_function()\n    array = pd.Series([1, 2, 7, 5])\n    answer = pd.Series(primitive_func(array))\n    correct_answer = pd.Series(\n        [np.nan, 0.49999999999999994, 11.266250743537816, 6.926249263427883],\n    )\n    pd.testing.assert_series_equal(answer, correct_answer)\n\n\ndef test_regular_alpha_var():\n    primitive_instance = ExponentialWeightedVariance(alpha=0.8)\n    primitive_func = primitive_instance.get_function()\n    array = pd.Series([1, 2, 7, 5])\n    answer = pd.Series(primitive_func(array))\n    correct_answer = pd.Series([np.nan, 0.5, 12.999999999999996, 2.996898263027294])\n    pd.testing.assert_series_equal(answer, correct_answer)\n\n\ndef test_na_var():\n    primitive_instance = ExponentialWeightedVariance(com=0.5)\n    primitive_func = primitive_instance.get_function()\n    array = pd.Series([1, 2, 7, np.nan, 5])\n    answer = pd.Series(primitive_func(array))\n    correct_answer = pd.Series(\n        [\n            np.nan,\n            0.49999999999999983,\n            12.846153846153847,\n            12.846153846153843,\n            3.3887362637362655,\n        ],\n    )\n    pd.testing.assert_series_equal(answer, correct_answer)\n\n\ndef test_ignorena_true_var():\n    primitive_instance = ExponentialWeightedVariance(com=0.5, ignore_na=True)\n    primitive_func = primitive_instance.get_function()\n    array = pd.Series([1, 2, 7, np.nan, 5])\n    answer = pd.Series(primitive_func(array))\n    correct_answer = pd.Series(\n        [\n            np.nan,\n            0.49999999999999983,\n            12.846153846153847,\n            12.846153846153847,\n            4.019230769230769,\n        ],\n    )\n    pd.testing.assert_series_equal(answer, correct_answer)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/transform_primitive_tests/test_full_name_primitives.py",
    "content": "import numpy as np\nimport pandas as pd\n\nfrom featuretools.primitives import (\n    FullNameToFirstName,\n    FullNameToLastName,\n    FullNameToTitle,\n)\nfrom featuretools.tests.primitive_tests.utils import (\n    PrimitiveTestBase,\n    find_applicable_primitives,\n    valid_dfs,\n)\n\n\nclass TestFullNameToFirstName(PrimitiveTestBase):\n    primitive = FullNameToFirstName\n\n    def test_urls(self):\n        # note this implementation incorrectly identifies the first\n        # name for 'Oliva y Ocana, Dona. Fermina'\n        primitive_func = self.primitive().get_function()\n        names = pd.Series(\n            [\n                \"Spector, Mr. Woolf\",\n                \"Oliva y Ocana, Dona. Fermina\",\n                \"Saether, Mr. Simon Sivertsen\",\n                \"Ware, Mr. Frederick\",\n                \"Peter, Master. Michael J\",\n            ],\n        )\n        answer = pd.Series([\"Woolf\", \"Oliva\", \"Simon\", \"Frederick\", \"Michael\"])\n        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)\n\n    def test_no_title(self):\n        primitive_func = self.primitive().get_function()\n        names = pd.Series(\n            [\n                \"Peter, Michael J\",\n                \"James Masters\",\n                \"Kate Elizabeth Brown-Jones\",\n            ],\n        )\n        answer = pd.Series([\"Michael\", \"James\", \"Kate\"], dtype=object)\n        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)\n\n    def test_empty_string(self):\n        primitive_func = self.primitive().get_function()\n        names = pd.Series(\n            [\n                \"Peter, Michael J\",\n                \"\",\n                \"Kate Elizabeth Brown-Jones\",\n            ],\n        )\n        answer = pd.Series([\"Michael\", np.nan, \"Kate\"], dtype=object)\n        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)\n\n    def test_single_name(self):\n        primitive_func = self.primitive().get_function()\n        names = pd.Series(\n            [\n                \"Peter, Michael J\",\n                \"James\",\n                \"Kate Elizabeth Brown-Jones\",\n            ],\n        )\n        answer = pd.Series([\"Michael\", \"James\", \"Kate\"], dtype=object)\n        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)\n\n    def test_nan(self):\n        primitive_func = self.primitive().get_function()\n        names = pd.Series([\"Mr. James Brown\", np.nan, None])\n        answer = pd.Series([\"James\", np.nan, np.nan])\n        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        transform.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n\n\nclass TestFullNameToLastName(PrimitiveTestBase):\n    primitive = FullNameToLastName\n\n    def test_urls(self):\n        primitive_func = self.primitive().get_function()\n        names = pd.Series(\n            [\n                \"Spector, Mr. Woolf\",\n                \"Oliva y Ocana, Dona. Fermina\",\n                \"Saether, Mr. Simon Sivertsen\",\n                \"Ware, Mr. Frederick\",\n                \"Peter, Master. Michael J\",\n            ],\n        )\n        answer = pd.Series([\"Spector\", \"Oliva y Ocana\", \"Saether\", \"Ware\", \"Peter\"])\n        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)\n\n    def test_no_title(self):\n        primitive_func = self.primitive().get_function()\n        names = pd.Series(\n            [\n                \"Peter, Michael J\",\n                \"James Masters\",\n                \"Kate Elizabeth Brown-Jones\",\n            ],\n        )\n        answer = pd.Series([\"Peter\", \"Masters\", \"Brown-Jones\"], dtype=object)\n        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)\n\n    def test_empty_string(self):\n        primitive_func = self.primitive().get_function()\n        names = pd.Series(\n            [\n                \"Peter, Michael J\",\n                \"\",\n                \"Kate Elizabeth Brown-Jones\",\n            ],\n        )\n        answer = pd.Series([\"Peter\", np.nan, \"Brown-Jones\"], dtype=object)\n        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)\n\n    def test_single_name(self):\n        primitive_func = self.primitive().get_function()\n        names = pd.Series(\n            [\n                \"Peter, Michael J\",\n                \"James\",\n                \"Kate Elizabeth Brown-Jones\",\n            ],\n        )\n        answer = pd.Series([\"Peter\", np.nan, \"Brown-Jones\"], dtype=object)\n        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)\n\n    def test_nan(self):\n        primitive_func = self.primitive().get_function()\n        names = pd.Series([\"Mr. James Brown\", np.nan, None])\n        answer = pd.Series([\"Brown\", np.nan, np.nan])\n        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        transform.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n\n\nclass TestFullNameToTitle(PrimitiveTestBase):\n    primitive = FullNameToTitle\n\n    def test_urls(self):\n        primitive_func = self.primitive().get_function()\n        names = pd.Series(\n            [\n                \"Spector, Mr. Woolf\",\n                \"Oliva y Ocana, Dona. Fermina\",\n                \"Saether, Mr. Simon Sivertsen\",\n                \"Ware, Mr. Frederick\",\n                \"Peter, Master. Michael J\",\n                \"Mr. Brown\",\n            ],\n        )\n        answer = pd.Series([\"Mr\", \"Dona\", \"Mr\", \"Mr\", \"Master\", \"Mr\"])\n        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)\n\n    def test_no_title(self):\n        primitive_func = self.primitive().get_function()\n        names = pd.Series(\n            [\n                \"Peter, Michael J\",\n                \"James Master.\",\n                \"Mrs Brown\",\n                \"\",\n            ],\n        )\n        answer = pd.Series([np.nan, np.nan, np.nan, np.nan], dtype=object)\n        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)\n\n    def test_nan(self):\n        primitive_func = self.primitive().get_function()\n        names = pd.Series([\"Mr. Brown\", np.nan, None])\n        answer = pd.Series([\"Mr\", np.nan, np.nan])\n        pd.testing.assert_series_equal(primitive_func(names), answer, check_names=False)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        transform.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/transform_primitive_tests/test_is_federal_holiday.py",
    "content": "from datetime import datetime\n\nimport numpy as np\nimport pandas as pd\nfrom pytest import raises\n\nfrom featuretools.primitives import IsFederalHoliday\n\n\ndef test_regular():\n    primitive_instance = IsFederalHoliday()\n    primitive_func = primitive_instance.get_function()\n    case = pd.Series(\n        [\n            \"2016-01-01\",\n            \"2016-02-29\",\n            \"2017-05-29\",\n            datetime(2019, 7, 4, 10, 0, 30),\n        ],\n    ).astype(\"datetime64[ns]\")\n    answer = pd.Series([True, False, True, True])\n    given_answer = pd.Series(primitive_func(case))\n    assert given_answer.equals(answer)\n\n\ndef test_nat():\n    primitive_instance = IsFederalHoliday()\n    primitive_func = primitive_instance.get_function()\n    case = pd.Series(\n        [\n            \"2019-10-14\",\n            \"NaT\",\n            \"2016-02-29\",\n            \"NaT\",\n        ],\n    ).astype(\"datetime64[ns]\")\n    answer = pd.Series([True, np.nan, False, np.nan])\n    given_answer = pd.Series(primitive_func(case))\n    assert given_answer.equals(answer)\n\n\ndef test_valid_country():\n    primitive_instance = IsFederalHoliday(country=\"Canada\")\n    primitive_func = primitive_instance.get_function()\n    case = pd.Series(\n        [\n            \"2016-07-01\",\n            \"2016-11-11\",\n            \"2018-09-03\",\n        ],\n    ).astype(\"datetime64[ns]\")\n    answer = pd.Series([True, False, True])\n    given_answer = pd.Series(primitive_func(case))\n    assert given_answer.equals(answer)\n\n\ndef test_invalid_country():\n    error_text = \"must be one of the available countries\"\n    with raises(ValueError, match=error_text):\n        IsFederalHoliday(country=\"\")\n\n\ndef test_multiple_countries():\n    primitive_mexico = IsFederalHoliday(country=\"Mexico\")\n    primitive_func = primitive_mexico.get_function()\n    case = pd.Series([datetime(2000, 9, 16), datetime(2005, 1, 1)])\n    assert len(primitive_func(case)) > 1\n    primitive_india = IsFederalHoliday(country=\"IND\")\n    primitive_func = primitive_mexico.get_function()\n    case = pd.Series([datetime(2048, 1, 1), datetime(2048, 10, 2)])\n    primitive_func = primitive_india.get_function()\n    assert len(primitive_func(case)) > 1\n    primitive_uk = IsFederalHoliday(country=\"UK\")\n    primitive_func = primitive_uk.get_function()\n    case = pd.Series([datetime(2048, 3, 17), datetime(2048, 4, 6)])\n    assert len(primitive_func(case)) > 1\n    countries = [\n        \"Argentina\",\n        \"AU\",\n        \"Austria\",\n        \"BY\",\n        \"Belgium\",\n        \"Brazil\",\n        \"Canada\",\n        \"Colombia\",\n        \"Croatia\",\n        \"England\",\n        \"Finland\",\n        \"FRA\",\n        \"Germany\",\n        \"Germany\",\n        \"Italy\",\n        \"NewZealand\",\n        \"PortugalExt\",\n        \"PTE\",\n        \"Spain\",\n        \"ES\",\n        \"Switzerland\",\n        \"UnitedStates\",\n        \"US\",\n        \"UK\",\n        \"UA\",\n        \"CH\",\n        \"SE\",\n        \"ZA\",\n    ]\n    for x in countries:\n        IsFederalHoliday(country=x)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/transform_primitive_tests/test_latlong_primitives.py",
    "content": "import numpy as np\nimport pandas as pd\nimport pytest\n\nfrom featuretools.primitives import CityblockDistance, GeoMidpoint, IsInGeoBox\n\n\ndef test_cityblock():\n    primitive_instance = CityblockDistance()\n    latlong_1 = pd.Series([(i, i) for i in range(3)])\n    latlong_2 = pd.Series([(i, i) for i in range(3, 6)])\n    answer = pd.Series([414.56051391, 414.52893691, 414.43421555])\n    given_answer = primitive_instance(latlong_1, latlong_2)\n    np.testing.assert_allclose(given_answer, answer, rtol=1e-09)\n\n    primitive_instance = CityblockDistance(unit=\"kilometers\")\n    answer = primitive_instance(latlong_1, latlong_2)\n    given_answer = pd.Series([667.1704814, 667.11966315, 666.96722389])\n    np.testing.assert_allclose(given_answer, answer, rtol=1e-09)\n\n\ndef test_cityblock_nans():\n    primitive_instance = CityblockDistance()\n    lats_longs_1 = [(i, i) for i in range(2)]\n    lats_longs_2 = [(i, i) for i in range(2, 4)]\n    lats_longs_1 += [(1, 1), (np.nan, 3), (4, np.nan), (np.nan, np.nan)]\n    lats_longs_2 += [(np.nan, np.nan), (np.nan, 5), (6, np.nan), (np.nan, np.nan)]\n    given_answer = pd.Series(list([276.37367594, 276.35262728] + [np.nan] * 4))\n    answer = primitive_instance(lats_longs_1, lats_longs_2)\n    np.testing.assert_allclose(given_answer, answer, rtol=1e-09)\n\n\ndef test_cityblock_error():\n    error_text = \"Invalid unit given\"\n    with pytest.raises(ValueError, match=error_text):\n        CityblockDistance(unit=\"invalid\")\n\n\ndef test_midpoint():\n    latlong1 = pd.Series([(-90, -180), (90, 180)])\n    latlong2 = pd.Series([(+90, +180), (-90, -180)])\n    function = GeoMidpoint().get_function()\n    answer = function(latlong1, latlong2)\n    for lat, longi in answer:\n        assert lat == 0.0\n        assert longi == 0.0\n\n\ndef test_midpoint_floating():\n    latlong1 = pd.Series([(-45.5, -100.5), (45.5, 100.5)])\n    latlong2 = pd.Series([(+45.5, +100.5), (-45.5, -100.5)])\n    function = GeoMidpoint().get_function()\n    answer = function(latlong1, latlong2)\n    for lat, longi in answer:\n        assert lat == 0.0\n        assert longi == 0.0\n\n\ndef test_midpoint_zeros():\n    latlong1 = pd.Series([(0, 0), (0, 0)])\n    latlong2 = pd.Series([(0, 0), (0, 0)])\n    function = GeoMidpoint().get_function()\n    answer = function(latlong1, latlong2)\n    for lat, longi in answer:\n        assert lat == 0.0\n        assert longi == 0.0\n\n\ndef test_midpoint_nan():\n    all_nan = pd.Series([(np.nan, np.nan), (np.nan, np.nan)])\n    latlong1 = pd.Series([(0, 0), (0, 0)])\n    function = GeoMidpoint().get_function()\n    answer = function(all_nan, latlong1)\n    for lat, longi in answer:\n        assert np.isnan(lat)\n        assert np.isnan(longi)\n\n\ndef test_isingeobox():\n    latlong = pd.Series(\n        [\n            (1, 2),\n            (5, 7),\n            (-5, 4),\n            (2, 3),\n            (0, 0),\n            (np.nan, np.nan),\n            (-2, np.nan),\n            (np.nan, 1),\n        ],\n    )\n    bottomleft = (-5, -5)\n    topright = (5, 5)\n    primitive = IsInGeoBox(bottomleft, topright)\n    function = primitive.get_function()\n    primitive_answer = function(latlong)\n    answer = pd.Series([True, False, True, True, True, False, False, False])\n    assert np.array_equal(primitive_answer, answer)\n\n\ndef test_boston():\n    NYC = (40.7128, -74.0060)\n    SF = (37.7749, -122.4194)\n    Somerville = (42.3876, -71.0995)\n    Bejing = (39.9042, 116.4074)\n    CapeTown = (-33.9249, 18.4241)\n    latlong = pd.Series([NYC, SF, Somerville, Bejing, CapeTown])\n    LynnMA = (42.4668, -70.9495)\n    DedhamMA = (42.2436, -71.1677)\n    primitive = IsInGeoBox(LynnMA, DedhamMA)\n    function = primitive.get_function()\n    primitive_answer = function(latlong)\n    answer = pd.Series([False, False, True, False, False])\n    assert np.array_equal(primitive_answer, answer)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/transform_primitive_tests/test_percent_change.py",
    "content": "import numpy as np\nimport pandas as pd\nfrom pytest import raises\n\nfrom featuretools.primitives import PercentChange\nfrom featuretools.tests.primitive_tests.utils import (\n    PrimitiveTestBase,\n    find_applicable_primitives,\n    valid_dfs,\n)\n\n\nclass TestPercentChange(PrimitiveTestBase):\n    primitive = PercentChange\n\n    def test_regular(self):\n        data = pd.Series([2, 5, 15, 3, 3, 9, 4.5])\n        answer = pd.Series([np.nan, 1.5, 2.0, -0.8, 0, 2.0, -0.5])\n        primtive_func = self.primitive().get_function()\n        given_answer = primtive_func(data)\n        np.testing.assert_array_equal(given_answer, answer)\n\n    def test_raises(self):\n        with raises(ValueError):\n            self.primitive(fill_method=\"invalid\")\n\n    def test_period(self):\n        data = pd.Series([2, 4, 8])\n        answer = pd.Series([np.nan, np.nan, 3])\n        primtive_func = self.primitive(periods=2).get_function()\n        given_answer = primtive_func(data)\n        np.testing.assert_array_equal(given_answer, answer)\n        primtive_func = self.primitive(periods=2).get_function()\n        data = pd.Series([2, 4, 8] + [np.nan] * 4)\n        primtive_func = self.primitive(limit=2).get_function()\n        answer = pd.Series([np.nan, 1, 1, 0, 0, np.nan, np.nan])\n        given_answer = primtive_func(data)\n        np.testing.assert_array_equal(given_answer, answer)\n\n    def test_nan(self):\n        data = pd.Series([np.nan, 5, 10, 20, np.nan, 10, np.nan])\n        answer = pd.Series([np.nan, np.nan, 1, 1, 0, -0.5, 0])\n        primtive_func = self.primitive().get_function()\n        given_answer = primtive_func(data)\n        np.testing.assert_array_equal(given_answer, answer)\n\n    def test_zero(self):\n        data = pd.Series([2, 0, 0, 5, 0, -4])\n        answer = pd.Series([np.nan, -1, np.nan, np.inf, -1, np.NINF])\n        primtive_func = self.primitive().get_function()\n        given_answer = primtive_func(data)\n        np.testing.assert_array_equal(given_answer, answer)\n\n    def test_inf(self):\n        data = pd.Series([0, np.inf, 0, 5, np.NINF, np.inf, np.NINF])\n        answer = pd.Series([np.nan, np.inf, -1, np.inf, np.NINF, np.nan, np.nan])\n        primtive_func = self.primitive().get_function()\n        given_answer = primtive_func(data)\n        np.testing.assert_array_equal(given_answer, answer)\n\n    def test_freq(self):\n        dates = pd.DatetimeIndex(\n            [\"2018-01-01\", \"2018-01-02\", \"2018-01-03\", \"2018-01-05\"],\n        )\n        data = pd.Series([1, 2, 3, 4], index=dates)\n        answer = pd.Series([np.nan, 1.0, 0.5, np.nan])\n        date_offset = pd.tseries.offsets.DateOffset(days=1)\n        primtive_func = self.primitive(freq=date_offset).get_function()\n        given_answer = primtive_func(data)\n        np.testing.assert_array_equal(given_answer, answer)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instantiate = self.primitive\n        transform.append(primitive_instantiate)\n        valid_dfs(es, aggregation, transform, self.primitive)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/transform_primitive_tests/test_percent_unique.py",
    "content": "import numpy as np\nimport pandas as pd\n\nfrom featuretools.primitives import PercentUnique\nfrom featuretools.tests.primitive_tests.utils import (\n    PrimitiveTestBase,\n)\n\n\nclass TestPercentUnique(PrimitiveTestBase):\n    array = pd.Series([1, 1, 2, 2, 3, 4, 5, 6, 7, 8])\n    primitive = PercentUnique\n\n    def test_percent_unique(self):\n        primitive_func = self.primitive().get_function()\n        assert primitive_func(self.array) == (8 / 10.0)\n\n    def test_nans(self):\n        primitive_func = self.primitive().get_function()\n        array_nans = pd.concat([self.array.copy(), pd.Series([np.nan])])\n        assert primitive_func(array_nans) == (8 / 11.0)\n        primitive_func = self.primitive(skipna=False).get_function()\n        assert primitive_func(array_nans) == (9 / 11.0)\n\n    def test_multiple_nans(self):\n        primitive_func = self.primitive().get_function()\n        array_nans = pd.concat([self.array.copy(), pd.Series([np.nan] * 3)])\n        assert primitive_func(array_nans) == (8 / 13.0)\n        primitive_func = self.primitive(skipna=False).get_function()\n        assert primitive_func(array_nans) == (9 / 13.0)\n\n    def test_empty_string(self):\n        primitive_func = self.primitive().get_function()\n        array_empty_string = pd.concat([self.array.copy(), pd.Series([np.nan, \"\", \"\"])])\n        assert primitive_func(array_empty_string) == (9 / 13.0)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/transform_primitive_tests/test_postal_primitives.py",
    "content": "import pandas as pd\n\nfrom featuretools.primitives.standard.transform.postal import (\n    OneDigitPostalCode,\n    TwoDigitPostalCode,\n)\n\n\ndef test_one_digit_postal_code(postal_code_dataframe):\n    primitive = OneDigitPostalCode().get_function()\n    for x in postal_code_dataframe:\n        series = postal_code_dataframe[x]\n        actual = primitive(series)\n        expected = series.apply(lambda t: str(t)[0] if pd.notna(t) else pd.NA)\n        pd.testing.assert_series_equal(actual, expected)\n\n\ndef test_two_digit_postal_code(postal_code_dataframe):\n    primitive = TwoDigitPostalCode().get_function()\n    for x in postal_code_dataframe:\n        series = postal_code_dataframe[x]\n        actual = primitive(series)\n        expected = series.apply(lambda t: str(t)[:2] if pd.notna(t) else pd.NA)\n        pd.testing.assert_series_equal(actual, expected)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/transform_primitive_tests/test_same_as_previous.py",
    "content": "import numpy as np\nimport pandas as pd\nimport pytest\n\nfrom featuretools.primitives import SameAsPrevious\n\n\nclass TestSameAsPrevious:\n    def test_ints(self):\n        primitive_func = SameAsPrevious().get_function()\n        array = pd.Series([1, 2, 2, 3, 2], dtype=\"int64\")\n        answer = primitive_func(array)\n        correct_answer = pd.Series([False, False, True, False, False])\n        pd.testing.assert_series_equal(answer, correct_answer)\n\n    def test_int64(self):\n        primitive_func = SameAsPrevious().get_function()\n        array = pd.Series([1, 2, 2, 3, 2], dtype=\"Int64\")\n        answer = primitive_func(array)\n        correct_answer = pd.Series([False, False, True, False, False], dtype=\"boolean\")\n        pd.testing.assert_series_equal(answer, correct_answer)\n\n    def test_floats(self):\n        primitive_func = SameAsPrevious().get_function()\n        array = pd.Series([1.0, 2.5, 2.5, 3.0, 2.0], dtype=\"float64\")\n        answer = primitive_func(array)\n        correct_answer = pd.Series([False, False, True, False, False])\n        pd.testing.assert_series_equal(answer, correct_answer)\n\n    def test_mixed(self):\n        primitive_func = SameAsPrevious().get_function()\n        array = pd.Series([1, 2, 2.0, 3, 2.0], dtype=\"float64\")\n        answer = primitive_func(array)\n        correct_answer = pd.Series([False, False, True, False, False])\n        np.testing.assert_array_equal(answer, correct_answer)\n\n    def test_nan(self):\n        primitive_instance = SameAsPrevious()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([1, np.nan, 3, np.nan, 2], dtype=\"float64\")\n        answer = primitive_func(array)\n        correct_answer = pd.Series([False, True, False, True, False])\n        np.testing.assert_array_equal(answer, correct_answer)\n\n    def test_all_nan(self):\n        primitive_instance = SameAsPrevious()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([np.nan, np.nan, np.nan, np.nan], dtype=\"float64\")\n        answer = primitive_func(array)\n        correct_answer = pd.Series([False, False, False, False])\n        np.testing.assert_array_equal(answer, correct_answer)\n\n    def test_inf(self):\n        primitive_instance = SameAsPrevious()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([1, np.inf, 3, np.inf, 2], dtype=\"float64\")\n        answer = primitive_func(array)\n        correct_answer = pd.Series([False, False, False, False, False])\n        np.testing.assert_array_equal(answer, correct_answer)\n\n    def test_all_inf(self):\n        primitive_instance = SameAsPrevious()\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([np.inf, np.inf, np.inf, np.inf], dtype=\"float64\")\n        answer = primitive_func(array)\n        correct_answer = pd.Series([False, True, True, True])\n        np.testing.assert_array_equal(answer, correct_answer)\n\n    def test_fill_method_bfill(self):\n        primitive_instance = SameAsPrevious(fill_method=\"bfill\")\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([1, np.nan, 3, 2, 2], dtype=\"float64\")\n        answer = primitive_func(array)\n        correct_answer = pd.Series([False, False, True, False, True])\n        np.testing.assert_array_equal(answer, correct_answer)\n\n    def test_fill_method_bfill_with_limit(self):\n        primitive_instance = SameAsPrevious(fill_method=\"bfill\", limit=2)\n        primitive_func = primitive_instance.get_function()\n        array = pd.Series([1, np.nan, np.nan, np.nan, 2, 3], dtype=\"float64\")\n        answer = primitive_func(array)\n        correct_answer = pd.Series([False, False, False, True, True, False])\n        np.testing.assert_array_equal(answer, correct_answer)\n\n    def test_raises(self):\n        with pytest.raises(ValueError):\n            SameAsPrevious(fill_method=\"invalid\")\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/transform_primitive_tests/test_savgol_filter.py",
    "content": "from math import floor\n\nimport numpy as np\nimport pandas as pd\nfrom pytest import raises\n\nfrom featuretools.primitives import SavgolFilter\nfrom featuretools.tests.primitive_tests.utils import (\n    PrimitiveTestBase,\n    find_applicable_primitives,\n    valid_dfs,\n)\n\n\nclass TestSavgolFilter(PrimitiveTestBase):\n    primitive = SavgolFilter\n    data = pd.Series(\n        [\n            0,\n            1,\n            1,\n            2,\n            3,\n            4,\n            5,\n            7,\n            8,\n            7,\n            9,\n            9,\n            12,\n            11,\n            12,\n            14,\n            15,\n            17,\n            17,\n            17,\n            20,\n            21,\n            20,\n            20,\n            22,\n            21,\n            25,\n            25,\n            26,\n            29,\n            30,\n            30,\n            28,\n            26,\n            34,\n            35,\n            33,\n            31,\n            38,\n            34,\n            39,\n            37,\n            42,\n            35,\n            36,\n            44,\n            46,\n            43,\n            39,\n            39,\n            44,\n            49,\n            45,\n            44,\n            44,\n            52,\n            50,\n            47,\n            58,\n            59,\n            60,\n            55,\n            57,\n            63,\n            61,\n            65,\n            66,\n            57,\n            65,\n            61,\n            60,\n            71,\n            64,\n            62,\n            70,\n            65,\n            67,\n            77,\n            68,\n            75,\n            72,\n            69,\n            82,\n            66,\n            84,\n            80,\n            76,\n            87,\n            77,\n            73,\n            90,\n            91,\n            92,\n            93,\n            78,\n            76,\n            82,\n            96,\n            91,\n            94,\n        ],\n    )\n    expected_output = pd.Series(\n        [\n            -0.24600037643516087,\n            0.6354225484660259,\n            1.518717742974036,\n            2.405318302343475,\n            3.296657321828948,\n            4.1941678966850615,\n            5.099283122166421,\n            6.0134360935276305,\n            6.938059906023296,\n            7.874587654908025,\n            8.824452435436303,\n            9.786858450473883,\n            10.923177508989724,\n            12.025171624713803,\n            13.009153318077633,\n            14.08041843739766,\n            14.900621118012227,\n            15.796338672768673,\n            16.77084014383764,\n            17.662961752206375,\n            18.472703497874882,\n            19.451454723765682,\n            20.530565544295253,\n            21.849950964367157,\n            22.478260869564927,\n            23.15233736515171,\n            24.12356979405003,\n            25.23962079110788,\n            26.000980712650854,\n            27.082379862699877,\n            27.787839163124843,\n            28.879045439685797,\n            29.762994442627924,\n            31.067342268714864,\n            32.11147433801854,\n            32.666557698593884,\n            33.06864988558309,\n            34.00098071265075,\n            35.134030728995945,\n            36.15135665250035,\n            36.945733899966825,\n            37.56227525335028,\n            38.55769859431137,\n            39.3975155279498,\n            39.87054593004198,\n            40.304347826086435,\n            41.11670480549146,\n            42.00948022229432,\n            41.982674076495044,\n            42.62798300098016,\n            43.15887544949274,\n            44.53481529911678,\n            45.680614579927486,\n            46.93886891140834,\n            47.98300098071202,\n            48.80549199084604,\n            50.28244524354299,\n            52.66851912389601,\n            54.28604118993064,\n            55.81529911735788,\n            57.10297482837455,\n            57.82641386073805,\n            59.45276234063342,\n            60.77280156913945,\n            61.23667865315383,\n            61.81660673422607,\n            62.60281137626594,\n            62.54004576658957,\n            62.78653154625613,\n            63.23046747302958,\n            64.09087937234307,\n            65.25661981039471,\n            65.19385420071833,\n            66.34161490683144,\n            66.65021248774022,\n            67.38280483818154,\n            68.8126838836212,\n            69.79470415168265,\n            70.943772474664,\n            72.74076495586698,\n            73.04020921869797,\n            73.3586139261187,\n            74.67734553775647,\n            75.71559333115299,\n            77.51814318404607,\n            79.62471395880902,\n            80.60150375939745,\n            80.61163779012645,\n            81.89342922523593,\n            82.41124550506593,\n            83.19293292519846,\n            83.97174920172642,\n            84.7620599588564,\n            85.57823082079385,\n            86.4346274117442,\n            87.34561535591293,\n            88.32556027750543,\n            89.38882780072717,\n            90.54978354978357,\n            91.82279314888011,\n        ],\n    )\n\n    def test_error(self):\n        window_length = 1\n        polyorder = 3\n        mode = \"incorrect\"\n        error_text = \"polyorder must be less than window_length.\"\n        with raises(ValueError, match=error_text):\n            self.primitive(window_length, polyorder)\n\n        error_text = (\n            \"Both window_length and polyorder must be defined if you define one.\"\n        )\n\n        with raises(ValueError, match=error_text):\n            self.primitive(window_length=window_length)\n        with raises(ValueError, match=error_text):\n            self.primitive(polyorder=polyorder)\n        error_text = \"mode must be 'mirror', 'constant', 'nearest', 'wrap' or 'interp'.\"\n        with raises(ValueError, match=error_text):\n            self.primitive(\n                window_length=window_length,\n                polyorder=polyorder,\n                mode=mode,\n            )\n\n    def test_less_window_size(self):\n        primitive_func = self.primitive().get_function()\n        for i in range(20):\n            data = pd.Series(list(range(i)), dtype=\"float64\")\n            assert data.equals(primitive_func(data))\n\n    def test_regular(self):\n        window_length = floor(len(self.data) / 10) * 2 + 1\n        polyorder = 3\n        primitive_func = self.primitive(window_length, polyorder).get_function()\n        output = list(primitive_func(self.data))\n        for a, b in zip(self.expected_output, output):\n            assert np.isclose(a, b)\n\n    def test_nans(self):\n        primitive_func = self.primitive().get_function()\n        data_nans = self.data.copy()\n        data_nans = pd.concat([data_nans, pd.Series([np.nan] * 5, dtype=\"float64\")])\n        # more than 5 nans due to window\n        assert sum(np.isnan(primitive_func(data_nans))) == 15\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instantiate = self.primitive()\n        transform.append(primitive_instantiate)\n        valid_dfs(es, aggregation, transform, self.primitive)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/transform_primitive_tests/test_season.py",
    "content": "from datetime import datetime\n\nimport pandas as pd\n\nfrom featuretools.primitives import Season\n\n\nclass TestSeason:\n    def test_regular(self):\n        primitive_instance = Season()\n        primitive_func = primitive_instance.get_function()\n        case = pd.date_range(start=\"2019-01\", periods=12, freq=\"m\").to_series()\n        answer = pd.Series(\n            [\n                \"winter\",\n                \"winter\",\n                \"spring\",\n                \"spring\",\n                \"spring\",\n                \"summer\",\n                \"summer\",\n                \"summer\",\n                \"fall\",\n                \"fall\",\n                \"fall\",\n                \"winter\",\n            ],\n            dtype=\"string\",\n        )\n        given_answer = primitive_func(case)\n        pd.testing.assert_series_equal(\n            given_answer.reset_index(drop=True),\n            answer.reset_index(drop=True),\n        )\n\n    def test_nat(self):\n        primitive_instance = Season()\n        primitive_func = primitive_instance.get_function()\n        case = pd.Series(\n            [\n                \"NaT\",\n                \"2019-02\",\n                \"2019-03\",\n                \"NaT\",\n            ],\n        ).astype(\"datetime64[ns]\")\n        answer = pd.Series([pd.NA, \"winter\", \"winter\", pd.NA], dtype=\"string\")\n        given_answer = pd.Series(primitive_func(case))\n        pd.testing.assert_series_equal(given_answer, answer)\n\n    def test_datetime(self):\n        primitive_instance = Season()\n        primitive_func = primitive_instance.get_function()\n        case = pd.Series(\n            [\n                datetime(2011, 3, 1),\n                datetime(2011, 6, 1),\n                datetime(2011, 9, 1),\n                datetime(2011, 12, 1),\n                # leap year\n                datetime(2020, 2, 29),\n            ],\n        )\n        answer = pd.Series(\n            [\"winter\", \"spring\", \"summer\", \"fall\", \"winter\"],\n            dtype=\"string\",\n        )\n        given_answer = primitive_func(case)\n        pd.testing.assert_series_equal(given_answer, answer)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/transform_primitive_tests/test_transform_primitive.py",
    "content": "import warnings\nfrom datetime import datetime\n\nimport numpy as np\nimport pandas as pd\nimport pytest\nfrom pytz import timezone\n\nfrom featuretools.primitives import (\n    Age,\n    DateToTimeZone,\n    DayOfYear,\n    DaysInMonth,\n    EmailAddressToDomain,\n    FileExtension,\n    IsFirstWeekOfMonth,\n    IsFreeEmailDomain,\n    IsLeapYear,\n    IsLunchTime,\n    IsMonthEnd,\n    IsMonthStart,\n    IsQuarterEnd,\n    IsQuarterStart,\n    IsWorkingHours,\n    IsYearEnd,\n    IsYearStart,\n    Lag,\n    NthWeekOfMonth,\n    NumericLag,\n    PartOfDay,\n    Quarter,\n    RateOfChange,\n    TimeSince,\n    URLToDomain,\n    URLToProtocol,\n    URLToTLD,\n    Week,\n    get_transform_primitives,\n)\nfrom featuretools.tests.primitive_tests.utils import (\n    PrimitiveTestBase,\n    find_applicable_primitives,\n    valid_dfs,\n)\n\n\ndef test_time_since():\n    time_since = TimeSince()\n    # class datetime.datetime(year, month, day[, hour[, minute[, second[, microsecond[,\n    times = pd.Series(\n        [\n            datetime(2019, 3, 1, 0, 0, 0, 1),\n            datetime(2019, 3, 1, 0, 0, 1, 0),\n            datetime(2019, 3, 1, 0, 2, 0, 0),\n        ],\n    )\n    cutoff_time = datetime(2019, 3, 1, 0, 0, 0, 0)\n    values = time_since(array=times, time=cutoff_time)\n\n    assert list(map(int, values)) == [0, -1, -120]\n\n    time_since = TimeSince(unit=\"nanoseconds\")\n    values = time_since(array=times, time=cutoff_time)\n    assert list(map(round, values)) == [-1000, -1000000000, -120000000000]\n\n    time_since = TimeSince(unit=\"milliseconds\")\n    values = time_since(array=times, time=cutoff_time)\n    assert list(map(int, values)) == [0, -1000, -120000]\n\n    time_since = TimeSince(unit=\"Milliseconds\")\n    values = time_since(array=times, time=cutoff_time)\n    assert list(map(int, values)) == [0, -1000, -120000]\n\n    time_since = TimeSince(unit=\"Years\")\n    values = time_since(array=times, time=cutoff_time)\n    assert list(map(int, values)) == [0, 0, 0]\n\n    times_y = pd.Series(\n        [\n            datetime(2019, 3, 1, 0, 0, 0, 1),\n            datetime(2020, 3, 1, 0, 0, 1, 0),\n            datetime(2017, 3, 1, 0, 0, 0, 0),\n        ],\n    )\n\n    time_since = TimeSince(unit=\"Years\")\n    values = time_since(array=times_y, time=cutoff_time)\n    assert list(map(int, values)) == [0, -1, 1]\n\n    error_text = \"Invalid unit given, make sure it is plural\"\n    with pytest.raises(ValueError, match=error_text):\n        time_since = TimeSince(unit=\"na\")\n        time_since(array=times, time=cutoff_time)\n\n\ndef test_age():\n    age = Age()\n    dates = pd.Series(datetime(2010, 2, 26))\n    ages = age(dates, time=datetime(2020, 2, 26))\n    correct_ages = [10.005]  # .005 added due to leap years\n    np.testing.assert_array_almost_equal(ages, correct_ages, decimal=3)\n\n\ndef test_age_two_years_quarterly():\n    age = Age()\n    dates = pd.Series(pd.date_range(\"2010-01-01\", \"2011-12-31\", freq=\"Q\"))\n    ages = age(dates, time=datetime(2020, 2, 26))\n    correct_ages = [9.915, 9.666, 9.414, 9.162, 8.915, 8.666, 8.414, 8.162]\n    np.testing.assert_array_almost_equal(ages, correct_ages, decimal=3)\n\n\ndef test_age_leap_year():\n    age = Age()\n    dates = pd.Series([datetime(2016, 1, 1)])\n    ages = age(dates, time=datetime(2016, 3, 1))\n    correct_ages = [(31 + 29) / 365.0]\n    np.testing.assert_array_almost_equal(ages, correct_ages, decimal=3)\n    # born leap year date\n    dates = pd.Series([datetime(2016, 2, 29)])\n    ages = age(dates, time=datetime(2020, 2, 29))\n    correct_ages = [4.0027]  # .0027 added due to leap year\n    np.testing.assert_array_almost_equal(ages, correct_ages, decimal=3)\n\n\ndef test_age_nan():\n    age = Age()\n    dates = pd.Series([datetime(2010, 1, 1), np.nan, datetime(2012, 1, 1)])\n    ages = age(dates, time=datetime(2020, 2, 26))\n    correct_ages = [10.159, np.nan, 8.159]\n    np.testing.assert_array_almost_equal(ages, correct_ages, decimal=3)\n\n\ndef test_day_of_year():\n    doy = DayOfYear()\n    dates = pd.Series([datetime(2019, 12, 31), np.nan, datetime(2020, 12, 31)])\n    days_of_year = doy(dates)\n    correct_days = [365, np.nan, 366]\n    np.testing.assert_array_equal(days_of_year, correct_days)\n\n\ndef test_days_in_month():\n    dim = DaysInMonth()\n    dates = pd.Series(\n        [datetime(2010, 1, 1), datetime(2019, 2, 1), np.nan, datetime(2020, 2, 1)],\n    )\n    days_in_month = dim(dates)\n    correct_days = [31, 28, np.nan, 29]\n    np.testing.assert_array_equal(days_in_month, correct_days)\n\n\ndef test_is_leap_year():\n    ily = IsLeapYear()\n    dates = pd.Series([datetime(2020, 1, 1), datetime(2021, 1, 1)])\n    leap_year_bools = ily(dates)\n    correct_bools = [True, False]\n    np.testing.assert_array_equal(leap_year_bools, correct_bools)\n\n\ndef test_is_month_end():\n    ime = IsMonthEnd()\n    dates = pd.Series(\n        [datetime(2019, 3, 1), datetime(2021, 2, 28), datetime(2020, 2, 29)],\n    )\n    ime_bools = ime(dates)\n    correct_bools = [False, True, True]\n    np.testing.assert_array_equal(ime_bools, correct_bools)\n\n\ndef test_is_month_start():\n    ims = IsMonthStart()\n    dates = pd.Series(\n        [datetime(2019, 3, 1), datetime(2020, 2, 28), datetime(2020, 2, 29)],\n    )\n    ims_bools = ims(dates)\n    correct_bools = [True, False, False]\n    np.testing.assert_array_equal(ims_bools, correct_bools)\n\n\ndef test_is_quarter_end():\n    iqe = IsQuarterEnd()\n    dates = pd.Series([datetime(2020, 1, 1), datetime(2021, 3, 31)])\n    iqe_bools = iqe(dates)\n    correct_bools = [False, True]\n    np.testing.assert_array_equal(iqe_bools, correct_bools)\n\n\ndef test_is_quarter_start():\n    iqs = IsQuarterStart()\n    dates = pd.Series([datetime(2020, 1, 1), datetime(2021, 3, 31)])\n    iqs_bools = iqs(dates)\n    correct_bools = [True, False]\n    np.testing.assert_array_equal(iqs_bools, correct_bools)\n\n\ndef test_is_lunch_time_default():\n    is_lunch_time = IsLunchTime()\n    dates = pd.Series(\n        [\n            datetime(2022, 6, 26, 12, 12, 12),\n            datetime(2022, 6, 28, 12, 3, 4),\n            datetime(2022, 6, 28, 11, 3, 4),\n            np.nan,\n        ],\n    )\n    actual = is_lunch_time(dates)\n    expected = [True, True, False, False]\n    np.testing.assert_array_equal(actual, expected)\n\n\ndef test_is_lunch_time_configurable():\n    is_lunch_time = IsLunchTime(14)\n    dates = pd.Series(\n        [\n            datetime(2022, 6, 26, 12, 12, 12),\n            datetime(2022, 6, 28, 14, 3, 4),\n            datetime(2022, 6, 28, 11, 3, 4),\n            np.nan,\n        ],\n    )\n    actual = is_lunch_time(dates)\n    expected = [False, True, False, False]\n    np.testing.assert_array_equal(actual, expected)\n\n\ndef test_is_working_hours_standard_hours():\n    is_working_hours = IsWorkingHours()\n    dates = pd.Series(\n        [\n            datetime(2022, 6, 21, 16, 3, 3),\n            datetime(2019, 1, 3, 4, 4, 4),\n            datetime(2022, 1, 1, 12, 1, 2),\n        ],\n    )\n    actual = is_working_hours(dates).tolist()\n    expected = [True, False, True]\n    np.testing.assert_array_equal(actual, expected)\n\n\ndef test_is_working_hours_configured_hours():\n    is_working_hours = IsWorkingHours(15, 18)\n    dates = pd.Series(\n        [\n            datetime(2022, 6, 21, 16, 3, 3),\n            datetime(2022, 6, 26, 14, 4, 4),\n            datetime(2022, 1, 1, 12, 1, 2),\n        ],\n    )\n    answer = is_working_hours(dates).tolist()\n    expected = [True, False, False]\n    np.testing.assert_array_equal(answer, expected)\n\n\ndef test_part_of_day():\n    pod = PartOfDay()\n    dates = pd.Series(\n        [\n            datetime(2020, 1, 11, 0, 2, 1),\n            datetime(2020, 1, 11, 1, 2, 1),\n            datetime(2021, 3, 31, 4, 2, 1),\n            datetime(2020, 3, 4, 6, 2, 1),\n            datetime(2020, 3, 4, 8, 2, 1),\n            datetime(2020, 3, 4, 11, 2, 1),\n            datetime(2020, 3, 4, 14, 2, 3),\n            datetime(2020, 3, 4, 17, 2, 3),\n            datetime(2020, 2, 2, 20, 2, 2),\n            np.nan,\n        ],\n    )\n    actual = pod(dates)\n    expected = pd.Series(\n        [\n            \"midnight\",\n            \"midnight\",\n            \"dawn\",\n            \"early morning\",\n            \"late morning\",\n            \"noon\",\n            \"afternoon\",\n            \"evening\",\n            \"night\",\n            np.nan,\n        ],\n    )\n    pd.testing.assert_series_equal(expected, actual)\n\n\ndef test_is_year_end():\n    is_year_end = IsYearEnd()\n    dates = pd.Series([datetime(2020, 12, 31), np.nan, datetime(2020, 1, 1)])\n    answer = is_year_end(dates)\n    correct_answer = [True, False, False]\n    np.testing.assert_array_equal(answer, correct_answer)\n\n\ndef test_is_year_start():\n    is_year_start = IsYearStart()\n    dates = pd.Series([datetime(2020, 12, 31), np.nan, datetime(2020, 1, 1)])\n    answer = is_year_start(dates)\n    correct_answer = [False, False, True]\n    np.testing.assert_array_equal(answer, correct_answer)\n\n\ndef test_quarter_regular():\n    q = Quarter()\n    array = pd.Series(\n        [\n            pd.to_datetime(\"2018-01-01\"),\n            pd.to_datetime(\"2018-04-01\"),\n            pd.to_datetime(\"2018-07-01\"),\n            pd.to_datetime(\"2018-10-01\"),\n        ],\n    )\n    answer = q(array)\n    correct_answer = pd.Series([1, 2, 3, 4])\n    np.testing.assert_array_equal(answer, correct_answer)\n\n\ndef test_quarter_leap_year():\n    q = Quarter()\n    array = pd.Series(\n        [\n            pd.to_datetime(\"2016-02-29\"),\n            pd.to_datetime(\"2018-04-01\"),\n            pd.to_datetime(\"2018-07-01\"),\n            pd.to_datetime(\"2018-10-01\"),\n        ],\n    )\n    answer = q(array)\n    correct_answer = pd.Series([1, 2, 3, 4])\n    np.testing.assert_array_equal(answer, correct_answer)\n\n\ndef test_quarter_nan_and_nat_input():\n    q = Quarter()\n    array = pd.Series(\n        [\n            pd.to_datetime(\"2016-02-29\"),\n            np.nan,\n            np.datetime64(\"NaT\"),\n            pd.to_datetime(\"2018-10-01\"),\n        ],\n    )\n    answer = q(array)\n    correct_answer = pd.Series([1, np.nan, np.nan, 4])\n    np.testing.assert_array_equal(answer, correct_answer)\n\n\ndef test_quarter_year_before_1970():\n    q = Quarter()\n    array = pd.Series(\n        [\n            pd.to_datetime(\"2018-01-01\"),\n            pd.to_datetime(\"1950-04-01\"),\n            pd.to_datetime(\"1874-07-01\"),\n            pd.to_datetime(\"2018-10-01\"),\n        ],\n    )\n    answer = q(array)\n    correct_answer = pd.Series([1, 2, 3, 4])\n    np.testing.assert_array_equal(answer, correct_answer)\n\n\ndef test_quarter_year_after_2038():\n    q = Quarter()\n    array = pd.Series(\n        [\n            pd.to_datetime(\"2018-01-01\"),\n            pd.to_datetime(\"2050-04-01\"),\n            pd.to_datetime(\"2174-07-01\"),\n            pd.to_datetime(\"2018-10-01\"),\n        ],\n    )\n    answer = q(array)\n    correct_answer = pd.Series([1, 2, 3, 4])\n    np.testing.assert_array_equal(answer, correct_answer)\n\n\ndef test_quarter():\n    q = Quarter()\n    dates = [datetime(2019, 12, 1), datetime(2019, 1, 3), datetime(2020, 2, 1)]\n    quarter = q(dates)\n    correct_quarters = [4, 1, 1]\n    np.testing.assert_array_equal(quarter, correct_quarters)\n\n\ndef test_week_no_deprecation_message():\n    dates = [\n        datetime(2019, 1, 3),\n        datetime(2019, 6, 17, 11, 10, 50),\n        datetime(2019, 11, 30, 19, 45, 15),\n    ]\n    with warnings.catch_warnings():\n        warnings.simplefilter(\"error\")\n        week = Week()\n        week(dates).tolist()\n\n\ndef test_url_to_domain_urls():\n    url_to_domain = URLToDomain()\n    urls = pd.Series(\n        [\n            \"https://play.google.com/store/apps/details?id=com.skgames.trafficracer%22\",\n            \"http://mplay.google.co.in/sadfask/asdkfals?dk=10\",\n            \"http://lplay.google.co.in/sadfask/asdkfals?dk=10\",\n            \"http://play.google.co.in/sadfask/asdkfals?dk=10\",\n            \"http://tplay.google.co.in/sadfask/asdkfals?dk=10\",\n            \"http://www.google.co.in/sadfask/asdkfals?dk=10\",\n            \"www.google.co.in/sadfask/asdkfals?dk=10\",\n            \"http://user:pass@google.com/?a=b#asdd\",\n            \"https://www.compzets.com?asd=10\",\n            \"www.compzets.com?asd=10\",\n            \"facebook.com\",\n            \"https://www.compzets.net?asd=10\",\n            \"http://www.featuretools.org\",\n        ],\n    )\n    correct_urls = [\n        \"play.google.com\",\n        \"mplay.google.co.in\",\n        \"lplay.google.co.in\",\n        \"play.google.co.in\",\n        \"tplay.google.co.in\",\n        \"google.co.in\",\n        \"google.co.in\",\n        \"google.com\",\n        \"compzets.com\",\n        \"compzets.com\",\n        \"facebook.com\",\n        \"compzets.net\",\n        \"featuretools.org\",\n    ]\n    np.testing.assert_array_equal(url_to_domain(urls), correct_urls)\n\n\ndef test_url_to_domain_long_url():\n    url_to_domain = URLToDomain()\n    urls = pd.Series(\n        [\n            \"http://chart.apis.google.com/chart?chs=500x500&chma=0,0,100, \\\n                        100&cht=p&chco=FF0000%2CFFFF00%7CFF8000%2C00FF00%7C00FF00%2C0 \\\n                        000FF&chd=t%3A122%2C42%2C17%2C10%2C8%2C7%2C7%2C7%2C7%2C6%2C6% \\\n                        2C6%2C6%2C5%2C5&chl=122%7C42%7C17%7C10%7C8%7C7%7C7%7C7%7C7%7C \\\n                        6%7C6%7C6%7C6%7C5%7C5&chdl=android%7Cjava%7Cstack-trace%7Cbro \\\n                        adcastreceiver%7Candroid-ndk%7Cuser-agent%7Candroid-webview%7 \\\n                        Cwebview%7Cbackground%7Cmultithreading%7Candroid-source%7Csms \\\n                        %7Cadb%7Csollections%7Cactivity|Chart\",\n        ],\n    )\n    correct_urls = [\"chart.apis.google.com\"]\n    results = url_to_domain(urls)\n    np.testing.assert_array_equal(results, correct_urls)\n\n\ndef test_url_to_domain_nan():\n    url_to_domain = URLToDomain()\n    urls = pd.Series([\"www.featuretools.com\", np.nan], dtype=\"object\")\n    correct_urls = pd.Series([\"featuretools.com\", np.nan], dtype=\"object\")\n    results = url_to_domain(urls)\n    pd.testing.assert_series_equal(results, correct_urls)\n\n\ndef test_url_to_protocol_urls():\n    url_to_protocol = URLToProtocol()\n    urls = pd.Series(\n        [\n            \"https://play.google.com/store/apps/details?id=com.skgames.trafficracer%22\",\n            \"http://mplay.google.co.in/sadfask/asdkfals?dk=10\",\n            \"http://lplay.google.co.in/sadfask/asdkfals?dk=10\",\n            \"www.google.co.in/sadfask/asdkfals?dk=10\",\n            \"http://user:pass@google.com/?a=b#asdd\",\n            \"https://www.compzets.com?asd=10\",\n            \"www.compzets.com?asd=10\",\n            \"facebook.com\",\n            \"https://www.compzets.net?asd=10\",\n            \"http://www.featuretools.org\",\n            \"https://featuretools.com\",\n        ],\n    )\n    correct_urls = pd.Series(\n        [\n            \"https\",\n            \"http\",\n            \"http\",\n            np.nan,\n            \"http\",\n            \"https\",\n            np.nan,\n            np.nan,\n            \"https\",\n            \"http\",\n            \"https\",\n        ],\n    )\n    results = url_to_protocol(urls)\n    pd.testing.assert_series_equal(results, correct_urls)\n\n\ndef test_url_to_protocol_long_url():\n    url_to_protocol = URLToProtocol()\n    urls = pd.Series(\n        [\n            \"http://chart.apis.google.com/chart?chs=500x500&chma=0,0,100, \\\n                        100&cht=p&chco=FF0000%2CFFFF00%7CFF8000%2C00FF00%7C00FF00%2C0 \\\n                        000FF&chd=t%3A122%2C42%2C17%2C10%2C8%2C7%2C7%2C7%2C7%2C6%2C6% \\\n                        2C6%2C6%2C5%2C5&chl=122%7C42%7C17%7C10%7C8%7C7%7C7%7C7%7C7%7C \\\n                        6%7C6%7C6%7C6%7C5%7C5&chdl=android%7Cjava%7Cstack-trace%7Cbro \\\n                        adcastreceiver%7Candroid-ndk%7Cuser-agent%7Candroid-webview%7 \\\n                        Cwebview%7Cbackground%7Cmultithreading%7Candroid-source%7Csms \\\n                        %7Cadb%7Csollections%7Cactivity|Chart\",\n        ],\n    )\n    correct_urls = [\"http\"]\n    results = url_to_protocol(urls)\n    np.testing.assert_array_equal(results, correct_urls)\n\n\ndef test_url_to_protocol_nan():\n    url_to_protocol = URLToProtocol()\n    urls = pd.Series([\"www.featuretools.com\", np.nan, \"\"], dtype=\"object\")\n    correct_urls = pd.Series([np.nan, np.nan, np.nan], dtype=\"object\")\n    results = url_to_protocol(urls)\n    pd.testing.assert_series_equal(results, correct_urls)\n\n\ndef test_url_to_tld_urls():\n    url_to_tld = URLToTLD()\n    urls = pd.Series(\n        [\n            \"https://play.google.com/store/apps/details?id=com.skgames.trafficracer%22\",\n            \"http://mplay.google.co.in/sadfask/asdkfals?dk=10\",\n            \"http://lplay.google.co.in/sadfask/asdkfals?dk=10\",\n            \"http://play.google.co.in/sadfask/asdkfals?dk=10\",\n            \"http://tplay.google.co.in/sadfask/asdkfals?dk=10\",\n            \"http://www.google.co.in/sadfask/asdkfals?dk=10\",\n            \"www.google.co.in/sadfask/asdkfals?dk=10\",\n            \"http://user:pass@google.com/?a=b#asdd\",\n            \"https://www.compzets.dev?asd=10\",\n            \"www.compzets.com?asd=10\",\n            \"https://www.compzets.net?asd=10\",\n            \"http://www.featuretools.org\",\n            \"featuretools.org\",\n        ],\n    )\n    correct_urls = [\n        \"com\",\n        \"in\",\n        \"in\",\n        \"in\",\n        \"in\",\n        \"in\",\n        \"in\",\n        \"com\",\n        \"dev\",\n        \"com\",\n        \"net\",\n        \"org\",\n        \"org\",\n    ]\n    np.testing.assert_array_equal(url_to_tld(urls), correct_urls)\n\n\ndef test_url_to_tld_long_url():\n    url_to_tld = URLToTLD()\n    urls = pd.Series(\n        [\n            \"http://chart.apis.google.com/chart?chs=500x500&chma=0,0,100, \\\n                        100&cht=p&chco=FF0000%2CFFFF00%7CFF8000%2C00FF00%7C00FF00%2C0 \\\n                        000FF&chd=t%3A122%2C42%2C17%2C10%2C8%2C7%2C7%2C7%2C7%2C6%2C6% \\\n                        2C6%2C6%2C5%2C5&chl=122%7C42%7C17%7C10%7C8%7C7%7C7%7C7%7C7%7C \\\n                        6%7C6%7C6%7C6%7C5%7C5&chdl=android%7Cjava%7Cstack-trace%7Cbro \\\n                        adcastreceiver%7Candroid-ndk%7Cuser-agent%7Candroid-webview%7 \\\n                        Cwebview%7Cbackground%7Cmultithreading%7Candroid-source%7Csms \\\n                        %7Cadb%7Csollections%7Cactivity|Chart\",\n        ],\n    )\n    correct_urls = [\"com\"]\n    np.testing.assert_array_equal(url_to_tld(urls), correct_urls)\n\n\ndef test_url_to_tld_nan():\n    url_to_tld = URLToTLD()\n    urls = pd.Series(\n        [\"www.featuretools.com\", np.nan, \"featuretools\", \"\"],\n        dtype=\"object\",\n    )\n    correct_urls = pd.Series([\"com\", np.nan, np.nan, np.nan], dtype=\"object\")\n    results = url_to_tld(urls)\n    pd.testing.assert_series_equal(results, correct_urls, check_names=False)\n\n\ndef test_is_free_email_domain_valid_addresses():\n    is_free_email_domain = IsFreeEmailDomain()\n    array = pd.Series(\n        [\n            \"test@hotmail.com\",\n            \"name@featuretools.com\",\n            \"nobody@yahoo.com\",\n            \"free@gmail.com\",\n        ],\n    )\n    answers = pd.Series(is_free_email_domain(array))\n    correct_answers = pd.Series([True, False, True, True])\n    pd.testing.assert_series_equal(answers, correct_answers)\n\n\ndef test_is_free_email_domain_valid_addresses_whitespace():\n    is_free_email_domain = IsFreeEmailDomain()\n    array = pd.Series(\n        [\n            \" test@hotmail.com\",\n            \" name@featuretools.com\",\n            \"nobody@yahoo.com \",\n            \" free@gmail.com \",\n        ],\n    )\n    answers = pd.Series(is_free_email_domain(array))\n    correct_answers = pd.Series([True, False, True, True])\n    pd.testing.assert_series_equal(answers, correct_answers)\n\n\ndef test_is_free_email_domain_nan():\n    is_free_email_domain = IsFreeEmailDomain()\n    array = pd.Series([np.nan, \"name@featuretools.com\", \"nobody@yahoo.com\"])\n    answers = pd.Series(is_free_email_domain(array))\n    correct_answers = pd.Series([np.nan, False, True])\n    pd.testing.assert_series_equal(answers, correct_answers)\n\n\ndef test_is_free_email_domain_empty_string():\n    is_free_email_domain = IsFreeEmailDomain()\n    array = pd.Series([\"\", \"name@featuretools.com\", \"nobody@yahoo.com\"])\n    answers = pd.Series(is_free_email_domain(array))\n    correct_answers = pd.Series([np.nan, False, True])\n    pd.testing.assert_series_equal(answers, correct_answers)\n\n\ndef test_is_free_email_domain_empty_series():\n    is_free_email_domain = IsFreeEmailDomain()\n    array = pd.Series([], dtype=\"category\")\n    answers = pd.Series(is_free_email_domain(array))\n    correct_answers = pd.Series([], dtype=\"category\")\n    pd.testing.assert_series_equal(answers, correct_answers)\n\n\ndef test_is_free_email_domain_invalid_email():\n    is_free_email_domain = IsFreeEmailDomain()\n    array = pd.Series(\n        [\n            np.nan,\n            \"this is not an email address\",\n            \"name@featuretools.com\",\n            \"nobody@yahoo.com\",\n            1234,\n            1.23,\n            True,\n        ],\n    )\n    answers = pd.Series(is_free_email_domain(array))\n    correct_answers = pd.Series([np.nan, np.nan, False, True, np.nan, np.nan, np.nan])\n    pd.testing.assert_series_equal(answers, correct_answers)\n\n\ndef test_is_free_email_domain_all_nan():\n    is_free_email_domain = IsFreeEmailDomain()\n    array = pd.Series([np.nan, np.nan])\n    answers = pd.Series(is_free_email_domain(array))\n    correct_answers = pd.Series([np.nan, np.nan], dtype=object)\n    pd.testing.assert_series_equal(answers, correct_answers)\n\n\ndef test_email_address_to_domain_valid_addresses():\n    email_address_to_domain = EmailAddressToDomain()\n    array = pd.Series(\n        [\n            \"test@hotmail.com\",\n            \"name@featuretools.com\",\n            \"nobody@yahoo.com\",\n            \"free@gmail.com\",\n        ],\n    )\n    answers = pd.Series(email_address_to_domain(array))\n    correct_answers = pd.Series(\n        [\"hotmail.com\", \"featuretools.com\", \"yahoo.com\", \"gmail.com\"],\n    )\n    pd.testing.assert_series_equal(answers, correct_answers)\n\n\ndef test_email_address_to_domain_valid_addresses_whitespace():\n    email_address_to_domain = EmailAddressToDomain()\n    array = pd.Series(\n        [\n            \" test@hotmail.com\",\n            \" name@featuretools.com\",\n            \"nobody@yahoo.com \",\n            \" free@gmail.com \",\n        ],\n    )\n    answers = pd.Series(email_address_to_domain(array))\n    correct_answers = pd.Series(\n        [\"hotmail.com\", \"featuretools.com\", \"yahoo.com\", \"gmail.com\"],\n    )\n    pd.testing.assert_series_equal(answers, correct_answers)\n\n\ndef test_email_address_to_domain_nan():\n    email_address_to_domain = EmailAddressToDomain()\n    array = pd.Series([np.nan, \"name@featuretools.com\", \"nobody@yahoo.com\"])\n    answers = pd.Series(email_address_to_domain(array))\n    correct_answers = pd.Series([np.nan, \"featuretools.com\", \"yahoo.com\"])\n    pd.testing.assert_series_equal(answers, correct_answers)\n\n\ndef test_email_address_to_domain_empty_string():\n    email_address_to_domain = EmailAddressToDomain()\n    array = pd.Series([\"\", \"name@featuretools.com\", \"nobody@yahoo.com\"])\n    answers = pd.Series(email_address_to_domain(array))\n    correct_answers = pd.Series([np.nan, \"featuretools.com\", \"yahoo.com\"])\n    pd.testing.assert_series_equal(answers, correct_answers)\n\n\ndef test_email_address_to_domain_empty_series():\n    email_address_to_domain = EmailAddressToDomain()\n    array = pd.Series([], dtype=\"category\")\n    answers = pd.Series(email_address_to_domain(array))\n    correct_answers = pd.Series([], dtype=\"category\")\n    pd.testing.assert_series_equal(answers, correct_answers)\n\n\ndef test_email_address_to_domain_invalid_email():\n    email_address_to_domain = EmailAddressToDomain()\n    array = pd.Series(\n        [\n            np.nan,\n            \"this is not an email address\",\n            \"name@featuretools.com\",\n            \"nobody@yahoo.com\",\n            1234,\n            1.23,\n            True,\n        ],\n    )\n    answers = pd.Series(email_address_to_domain(array))\n    correct_answers = pd.Series(\n        [np.nan, np.nan, \"featuretools.com\", \"yahoo.com\", np.nan, np.nan, np.nan],\n    )\n    pd.testing.assert_series_equal(answers, correct_answers)\n\n\ndef test_email_address_to_domain_all_nan():\n    email_address_to_domain = EmailAddressToDomain()\n    array = pd.Series([np.nan, np.nan])\n    answers = pd.Series(email_address_to_domain(array))\n    correct_answers = pd.Series([np.nan, np.nan], dtype=object)\n    pd.testing.assert_series_equal(answers, correct_answers)\n\n\ndef test_trans_primitives_can_init_without_params():\n    trans_primitives = get_transform_primitives().values()\n    for trans_primitive in trans_primitives:\n        trans_primitive()\n\n\ndef test_numeric_lag_future_warning():\n    warning_text = \"NumericLag is deprecated and will be removed in a future version. Please use the 'Lag' primitive instead.\"\n    with pytest.warns(FutureWarning, match=warning_text):\n        NumericLag()\n\n\ndef test_lag_regular():\n    primitive_instance = Lag()\n    primitive_func = primitive_instance.get_function()\n\n    array = pd.Series([1, 2, 3, 4])\n    time_array = pd.Series(pd.date_range(start=\"2020-01-01\", periods=4, freq=\"D\"))\n\n    answer = pd.Series(primitive_func(array, time_array))\n\n    correct_answer = pd.Series([np.nan, 1, 2, 3])\n    pd.testing.assert_series_equal(answer, correct_answer)\n\n\ndef test_lag_period():\n    primitive_instance = Lag(periods=3)\n    primitive_func = primitive_instance.get_function()\n\n    array = pd.Series([1, 2, 3, 4])\n    time_array = pd.Series(pd.date_range(start=\"2020-01-01\", periods=4, freq=\"D\"))\n\n    answer = pd.Series(primitive_func(array, time_array))\n\n    correct_answer = pd.Series([np.nan, np.nan, np.nan, 1])\n    pd.testing.assert_series_equal(answer, correct_answer)\n\n\ndef test_lag_negative_period():\n    primitive_instance = Lag(periods=-2)\n    primitive_func = primitive_instance.get_function()\n\n    array = pd.Series([1, 2, 3, 4])\n    time_array = pd.Series(pd.date_range(start=\"2020-01-01\", periods=4, freq=\"D\"))\n\n    answer = pd.Series(primitive_func(array, time_array))\n\n    correct_answer = pd.Series([3, 4, np.nan, np.nan])\n    pd.testing.assert_series_equal(answer, correct_answer)\n\n\ndef test_lag_starts_with_nan():\n    primitive_instance = Lag()\n    primitive_func = primitive_instance.get_function()\n\n    array = pd.Series([np.nan, 2, 3, 4])\n    time_array = pd.Series(pd.date_range(start=\"2020-01-01\", periods=4, freq=\"D\"))\n\n    answer = pd.Series(primitive_func(array, time_array))\n\n    correct_answer = pd.Series([np.nan, np.nan, 2, 3])\n    pd.testing.assert_series_equal(answer, correct_answer)\n\n\ndef test_lag_ends_with_nan():\n    primitive_instance = Lag()\n    primitive_func = primitive_instance.get_function()\n\n    array = pd.Series([1, 2, 3, np.nan])\n    time_array = pd.Series(pd.date_range(start=\"2020-01-01\", periods=4, freq=\"D\"))\n\n    answer = pd.Series(primitive_func(array, time_array))\n\n    correct_answer = pd.Series([np.nan, 1, 2, 3])\n    pd.testing.assert_series_equal(answer, correct_answer)\n\n\n@pytest.mark.parametrize(\n    \"input_array,expected_output\",\n    [\n        (\n            pd.Series([\"hello\", \"world\", \"foo\", \"bar\"], dtype=\"string\"),\n            pd.Series([np.nan, \"hello\", \"world\", \"foo\"], dtype=\"string\"),\n        ),\n        (\n            pd.Series([\"cow\", \"cow\", \"pig\", \"pig\"], dtype=\"category\"),\n            pd.Series([np.nan, \"cow\", \"cow\", \"pig\"], dtype=\"category\"),\n        ),\n        (\n            pd.Series([True, False, True, False], dtype=\"bool\"),\n            pd.Series([np.nan, True, False, True], dtype=\"object\"),\n        ),\n        (\n            pd.Series([True, False, True, False], dtype=\"boolean\"),\n            pd.Series([np.nan, True, False, True], dtype=\"boolean\"),\n        ),\n        (\n            pd.Series([1.23, 2.45, 3.56, 4.98], dtype=\"float\"),\n            pd.Series([np.nan, 1.23, 2.45, 3.56], dtype=\"float\"),\n        ),\n        (\n            pd.Series([1, 2, 3, 4], dtype=\"Int64\"),\n            pd.Series([np.nan, 1, 2, 3], dtype=\"Int64\"),\n        ),\n        (\n            pd.Series([1, 2, 3, 4], dtype=\"int64\"),\n            pd.Series([np.nan, 1, 2, 3], dtype=\"float64\"),\n        ),\n    ],\n)\ndef test_lag_with_different_dtypes(input_array, expected_output):\n    primitive_instance = Lag()\n    primitive_func = primitive_instance.get_function()\n    time_array = pd.Series(pd.date_range(start=\"2020-01-01\", periods=4, freq=\"D\"))\n    answer = pd.Series(primitive_func(input_array, time_array))\n    pd.testing.assert_series_equal(answer, expected_output)\n\n\ndef test_date_to_time_zone_primitive():\n    primitive_func = DateToTimeZone().get_function()\n    x = pd.Series(\n        [\n            datetime(2010, 1, 1, tzinfo=timezone(\"America/Los_Angeles\")),\n            datetime(2010, 1, 10, tzinfo=timezone(\"Singapore\")),\n            datetime(2020, 1, 1, tzinfo=timezone(\"UTC\")),\n            datetime(2010, 1, 1, tzinfo=timezone(\"Europe/London\")),\n        ],\n    )\n    answer = pd.Series([\"America/Los_Angeles\", \"Singapore\", \"UTC\", \"Europe/London\"])\n    pd.testing.assert_series_equal(primitive_func(x), answer)\n\n\ndef test_date_to_time_zone_datetime64():\n    primitive_func = DateToTimeZone().get_function()\n    x = pd.Series(\n        [\n            datetime(2010, 1, 1),\n            datetime(2010, 1, 10),\n            datetime(2020, 1, 1),\n        ],\n    ).astype(\"datetime64[ns]\")\n    x = x.dt.tz_localize(\"America/Los_Angeles\")\n    answer = pd.Series([\"America/Los_Angeles\"] * 3)\n    pd.testing.assert_series_equal(primitive_func(x), answer)\n\n\ndef test_date_to_time_zone_naive_dates():\n    primitive_func = DateToTimeZone().get_function()\n    x = pd.Series(\n        [\n            datetime(2010, 1, 1, tzinfo=timezone(\"America/Los_Angeles\")),\n            datetime(2010, 1, 1),\n            datetime(2010, 1, 2),\n        ],\n    )\n    answer = pd.Series([\"America/Los_Angeles\", np.nan, np.nan])\n    pd.testing.assert_series_equal(primitive_func(x), answer)\n\n\ndef test_date_to_time_zone_nan():\n    primitive_func = DateToTimeZone().get_function()\n    x = pd.Series(\n        [\n            datetime(2010, 1, 1, tzinfo=timezone(\"America/Los_Angeles\")),\n            pd.NaT,\n            np.nan,\n        ],\n    )\n    answer = pd.Series([\"America/Los_Angeles\", np.nan, np.nan])\n    pd.testing.assert_series_equal(primitive_func(x), answer)\n\n\ndef test_rate_of_change_primitive_regular_interval():\n    rate_of_change = RateOfChange()\n    times = pd.date_range(start=\"2019-01-01\", freq=\"2s\", periods=5)\n    values = [0, 30, 180, -90, 0]\n    expected = pd.Series([np.nan, 15, 75, -135, 45])\n    actual = rate_of_change(values, times)\n    pd.testing.assert_series_equal(actual, expected)\n\n\ndef test_rate_of_change_primitive_uneven_interval():\n    rate_of_change = RateOfChange()\n    times = pd.to_datetime(\n        [\n            \"2019-01-01 00:00:00\",\n            \"2019-01-01 00:00:01\",\n            \"2019-01-01 00:00:03\",\n            \"2019-01-01 00:00:07\",\n            \"2019-01-01 00:00:08\",\n        ],\n    )\n    values = [0, 30, 180, -90, 0]\n    expected = pd.Series([np.nan, 30, 75, -67.5, 90])\n    actual = rate_of_change(values, times)\n    pd.testing.assert_series_equal(actual, expected)\n\n\ndef test_rate_of_change_primitive_with_nan():\n    rate_of_change = RateOfChange()\n    times = pd.date_range(start=\"2019-01-01\", freq=\"2s\", periods=5)\n    values = [0, 30, np.nan, -90, 0]\n    expected = pd.Series([np.nan, 15, np.nan, np.nan, 45])\n    actual = rate_of_change(values, times)\n    pd.testing.assert_series_equal(actual, expected)\n\n\nclass TestFileExtension(PrimitiveTestBase):\n    primitive = FileExtension\n\n    def test_filepaths(self):\n        primitive_func = FileExtension().get_function()\n        array = pd.Series(\n            [\n                \"doc.txt\",\n                \"~/documents/data.json\",\n                \"data.JSON\",\n                \"C:\\\\Projects\\\\apilibrary\\\\apilibrary.sln\",\n            ],\n            dtype=\"string\",\n        )\n        answer = pd.Series([\".txt\", \".json\", \".json\", \".sln\"], dtype=\"string\")\n        pd.testing.assert_series_equal(primitive_func(array), answer)\n\n    def test_invalid(self):\n        primitive_func = FileExtension().get_function()\n        array = pd.Series([\"doc.txt\", \"~/documents/data\", np.nan], dtype=\"string\")\n        answer = pd.Series([\".txt\", np.nan, np.nan], dtype=\"string\")\n        pd.testing.assert_series_equal(primitive_func(array), answer)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        transform.append(primitive_instance)\n        valid_dfs(\n            es,\n            aggregation,\n            transform,\n            self.primitive,\n            target_dataframe_name=\"sessions\",\n        )\n\n\nclass TestIsFirstWeekOfMonth(PrimitiveTestBase):\n    primitive = IsFirstWeekOfMonth\n\n    def test_valid_dates(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(\n            [\n                pd.to_datetime(\"03/01/2019\"),\n                pd.to_datetime(\"03/03/2019\"),\n                pd.to_datetime(\"03/31/2019\"),\n                pd.to_datetime(\"03/30/2019\"),\n            ],\n        )\n        answers = primitive_func(array).tolist()\n        correct_answers = [True, False, False, False]\n        np.testing.assert_array_equal(answers, correct_answers)\n\n    def test_leap_year(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(\n            [\n                pd.to_datetime(\"03/01/2019\"),\n                pd.to_datetime(\"02/29/2016\"),\n                pd.to_datetime(\"03/31/2019\"),\n                pd.to_datetime(\"03/30/2019\"),\n            ],\n        )\n        answers = primitive_func(array).tolist()\n        correct_answers = [True, False, False, False]\n        np.testing.assert_array_equal(answers, correct_answers)\n\n    def test_year_before_1970(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(\n            [\n                pd.to_datetime(\"06/01/1965\"),\n                pd.to_datetime(\"03/02/2019\"),\n                pd.to_datetime(\"03/31/2019\"),\n                pd.to_datetime(\"03/30/2019\"),\n            ],\n        )\n        answers = primitive_func(array).tolist()\n        correct_answers = [True, True, False, False]\n        np.testing.assert_array_equal(answers, correct_answers)\n\n    def test_year_after_2038(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(\n            [\n                pd.to_datetime(\"12/31/2040\"),\n                pd.to_datetime(\"01/01/2040\"),\n                pd.to_datetime(\"03/31/2019\"),\n                pd.to_datetime(\"03/30/2019\"),\n            ],\n        )\n        answers = primitive_func(array).tolist()\n        correct_answers = [False, True, False, False]\n        np.testing.assert_array_equal(answers, correct_answers)\n\n    def test_nan_input(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(\n            [\n                pd.to_datetime(\"03/01/2019\"),\n                np.nan,\n                np.datetime64(\"NaT\"),\n                pd.to_datetime(\"03/30/2019\"),\n            ],\n        )\n        answers = primitive_func(array).tolist()\n        correct_answers = [True, np.nan, np.nan, False]\n        np.testing.assert_array_equal(answers, correct_answers)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        transform.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n\n\nclass TestNthWeekOfMonth(PrimitiveTestBase):\n    primitive = NthWeekOfMonth\n\n    def test_valid_dates(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(\n            [\n                pd.to_datetime(\"03/01/2019\"),\n                pd.to_datetime(\"03/03/2019\"),\n                pd.to_datetime(\"03/31/2019\"),\n                pd.to_datetime(\"03/30/2019\"),\n                pd.to_datetime(\"09/01/2019\"),\n            ],\n        )\n        answers = primitive_func(array)\n        correct_answers = [1, 2, 6, 5, 1]\n        np.testing.assert_array_equal(answers, correct_answers)\n\n    def test_leap_year(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(\n            [\n                pd.to_datetime(\"03/01/2019\"),\n                pd.to_datetime(\"02/29/2016\"),\n                pd.to_datetime(\"03/31/2019\"),\n                pd.to_datetime(\"03/30/2019\"),\n            ],\n        )\n        answers = primitive_func(array)\n        correct_answers = [1, 5, 6, 5]\n        np.testing.assert_array_equal(answers, correct_answers)\n\n    def test_year_before_1970(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(\n            [\n                pd.to_datetime(\"06/06/1965\"),\n                pd.to_datetime(\"03/02/2019\"),\n                pd.to_datetime(\"03/31/2019\"),\n                pd.to_datetime(\"03/30/2019\"),\n            ],\n        )\n        answers = primitive_func(array)\n        correct_answers = [2, 1, 6, 5]\n        np.testing.assert_array_equal(answers, correct_answers)\n\n    def test_year_after_2038(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(\n            [\n                pd.to_datetime(\"12/31/2040\"),\n                pd.to_datetime(\"01/01/2001\"),\n                pd.to_datetime(\"03/31/2019\"),\n                pd.to_datetime(\"03/30/2019\"),\n            ],\n        )\n        answers = primitive_func(array)\n        correct_answers = [6, 1, 6, 5]\n        np.testing.assert_array_equal(answers, correct_answers)\n\n    def test_nan_input(self):\n        primitive_func = self.primitive().get_function()\n        array = pd.Series(\n            [\n                pd.to_datetime(\"03/01/2019\"),\n                np.nan,\n                np.datetime64(\"NaT\"),\n                pd.to_datetime(\"03/30/2019\"),\n            ],\n        )\n        answers = primitive_func(array)\n        correct_answers = [1, np.nan, np.nan, 5]\n        np.testing.assert_array_equal(answers, correct_answers)\n\n    def test_with_featuretools(self, es):\n        transform, aggregation = find_applicable_primitives(self.primitive)\n        primitive_instance = self.primitive()\n        transform.append(primitive_instance)\n        valid_dfs(es, aggregation, transform, self.primitive)\n"
  },
  {
    "path": "featuretools/tests/primitive_tests/utils.py",
    "content": "from inspect import signature\n\nimport pytest\n\nfrom featuretools import (\n    FeatureBase,\n    calculate_feature_matrix,\n    dfs,\n    encode_features,\n    list_primitives,\n    load_features,\n    save_features,\n)\nfrom featuretools.primitives.base import AggregationPrimitive, PrimitiveBase\nfrom featuretools.tests.testing_utils import make_ecommerce_entityset\n\nPRIMITIVES = list_primitives()\n\n\ndef get_number_from_offset(offset):\n    \"\"\"Extract the numeric element of a potential offset string.\n\n    Args:\n        offset (int, str): If offset is an integer, that value is returned. If offset is a string,\n            it's assumed to be an offset string of the format nD where n is a single digit integer.\n\n    Note: This helper utility should only be used with offset strings that only have one numeric character.\n        Only the first character will be returned, so if an offset string 24H is used, it will incorrectly\n        return the integer 2. Additionally, any of the offset timespans (H for hourly, D for daily, etc.)\n        can be used here; however, care should be taken by the user to remember what that timespan is when\n        writing tests, as comparing 7 from 7D to 1 from 1W may not behave as expected.\n    \"\"\"\n    if isinstance(offset, str):\n        return int(offset[0])\n    else:\n        return offset\n\n\nclass PrimitiveTestBase:\n    primitive = None\n\n    @pytest.fixture()\n    def es(self):\n        es = make_ecommerce_entityset()\n        return es\n\n    def test_name_and_desc(self):\n        assert self.primitive.name is not None\n        assert self.primitive.__doc__ is not None\n        docstring = self.primitive.__doc__\n        short_description = docstring.splitlines()[0]\n        first_word = short_description.split(\" \", 1)[0]\n        valid_verbs = [\n            \"Calculates\",\n            \"Determines\",\n            \"Transforms\",\n            \"Computes\",\n            \"Shifts\",\n            \"Extracts\",\n            \"Applies\",\n        ]\n        assert any(s in first_word for s in valid_verbs)\n        assert self.primitive.input_types is not None\n\n    def test_name_in_primitive_list(self):\n        assert PRIMITIVES.name.eq(self.primitive.name).any()\n\n    def test_arg_init(self):\n        primitive_ = self.primitive()\n        # determine the optional arguments in the __init__\n        init_params = signature(self.primitive.__init__)\n        for name, parameter in init_params.parameters.items():\n            if parameter.default is not parameter.empty:\n                assert hasattr(primitive_, name)\n\n    def test_serialize(self, es, target_dataframe_name=\"log\"):\n        check_serialize(primitive=self.primitive, es=es, target_dataframe_name=\"log\")\n\n\ndef check_serialize(primitive, es, target_dataframe_name=\"log\"):\n    trans_primitives = []\n    agg_primitives = []\n    if issubclass(primitive, AggregationPrimitive):\n        agg_primitives = [primitive]\n    else:\n        trans_primitives = [primitive]\n    features = dfs(\n        entityset=es,\n        target_dataframe_name=target_dataframe_name,\n        agg_primitives=agg_primitives,\n        trans_primitives=trans_primitives,\n        max_features=-1,\n        max_depth=3,\n        features_only=True,\n        return_types=\"all\",\n    )\n\n    feat_to_serialize = None\n    for feature in features:\n        if feature.primitive.__class__ == primitive:\n            feat_to_serialize = feature\n            break\n        for base_feature in feature.get_dependencies(deep=True):\n            if base_feature.primitive.__class__ == primitive:\n                feat_to_serialize = base_feature\n                break\n    assert feat_to_serialize is not None\n\n    # Skip calculating feature matrix for long running primitives\n    skip_primitives = [\"elmo\"]\n\n    if primitive.name not in skip_primitives:\n        df1 = calculate_feature_matrix([feat_to_serialize], entityset=es)\n\n    new_feat = load_features(save_features([feat_to_serialize]))[0]\n    assert isinstance(new_feat, FeatureBase)\n\n    if primitive.name not in skip_primitives:\n        df2 = calculate_feature_matrix([new_feat], entityset=es)\n        assert df1.equals(df2)\n\n\ndef find_applicable_primitives(primitive):\n    from featuretools.primitives.utils import (\n        get_aggregation_primitives,\n        get_transform_primitives,\n    )\n\n    all_transform_primitives = list(get_transform_primitives().values())\n    all_aggregation_primitives = list(get_aggregation_primitives().values())\n    applicable_transforms = find_stackable_primitives(\n        all_transform_primitives,\n        primitive,\n    )\n    applicable_aggregations = find_stackable_primitives(\n        all_aggregation_primitives,\n        primitive,\n    )\n    return applicable_transforms, applicable_aggregations\n\n\ndef find_stackable_primitives(all_primitives, primitive):\n    applicable_primitives = []\n    for x in all_primitives:\n        if x.input_types == [primitive.return_type]:\n            applicable_primitives.append(x)\n    return applicable_primitives\n\n\ndef valid_dfs(\n    es,\n    aggregations,\n    transforms,\n    feature_substrings,\n    target_dataframe_name=\"log\",\n    multi_output=False,\n    max_depth=3,\n    max_features=-1,\n    instance_ids=[0, 1, 2, 3],\n):\n    if not isinstance(feature_substrings, list):\n        feature_substrings = [feature_substrings]\n\n    if any([issubclass(x, PrimitiveBase) for x in feature_substrings]):\n        feature_substrings = [x.name.upper() for x in feature_substrings]\n\n    features = dfs(\n        entityset=es,\n        target_dataframe_name=target_dataframe_name,\n        agg_primitives=aggregations,\n        trans_primitives=transforms,\n        max_features=max_features,\n        max_depth=max_depth,\n        features_only=True,\n    )\n    applicable_features = []\n    for feat in features:\n        applicable_features += [\n            feat for x in feature_substrings if x in feat.get_name()\n        ]\n    if len(applicable_features) == 0:\n        raise ValueError(\n            \"No feature names with %s, verify the name attribute \\\n                          is defined and/or generate_name() is defined to \\\n                          return %s \"\n            % (feature_substrings, feature_substrings),\n        )\n    df = calculate_feature_matrix(\n        entityset=es,\n        features=applicable_features,\n        instance_ids=instance_ids,\n        n_jobs=1,\n    )\n\n    encode_features(df, applicable_features)\n\n    # TODO: check the multi_output shape by checking\n    # feature.number_output_features for each feature\n    # and comparing it with the matrix shape\n    if not multi_output:\n        assert len(applicable_features) == df.shape[1]\n    return\n"
  },
  {
    "path": "featuretools/tests/profiling/__init__.py",
    "content": ""
  },
  {
    "path": "featuretools/tests/profiling/dfs_profile.py",
    "content": "\"\"\"\ndfs_profile.py\n\nHelper module to allow profiling of the dfs operations.  At some point we may\nwant to use pstats to output the results to a log, but I'm anticipating that\nLookingGlass will provide the performance data we want.\n\nNotes:\n  - output currently goes to the root directory and is in dfs_profile.stats\n  - *.stats is gitignored\n  - it uses the demo customers dataset for testing\n  - max_depth > 2 is very slow (currently)\n  - stats output can be viewed online with https://nejc.saje.info/pstats-viewer.html\n\"\"\"\n\nimport cProfile\nfrom pathlib import Path\n\nimport featuretools as ft\nimport featuretools.demo as demo\nfrom featuretools.synthesis.dfs import dfs\n\nes = demo.load_retail()\n\nall_aggs = ft.primitives.get_aggregation_primitives()\nall_trans = ft.primitives.get_transform_primitives()\n\nprofiler = cProfile.Profile(builtins=False)\nprofiler.enable()\nfeature_defs = dfs(\n    entityset=es,\n    target_dataframe_name=\"customers\",\n    trans_primitives=all_trans,\n    agg_primitives=all_aggs,\n    max_depth=2,\n    features_only=True,\n)\nprofiler.disable()\nprofiler.dump_stats(Path.cwd() / \"dfs_profile.stats\")\n"
  },
  {
    "path": "featuretools/tests/requirement_files/latest_requirements.txt",
    "content": "cloudpickle==3.0.0\ndask==2024.6.2\ndask-expr==1.1.6\ndistributed==2024.6.2\nholidays==0.51\nnumpy==1.26.4\npandas==2.2.2\npsutil==6.0.0\nscipy==1.13.1\ntqdm==4.66.4\nwoodwork==0.31.0\n"
  },
  {
    "path": "featuretools/tests/requirement_files/minimum_core_requirements.txt",
    "content": "cloudpickle==1.5.0\nholidays==0.17\nnumpy==1.25.0\npackaging==20.0\npandas==2.0.0\npsutil==5.7.0\nscipy==1.10.0\ntqdm==4.66.3\nwoodwork==0.28.0\n"
  },
  {
    "path": "featuretools/tests/requirement_files/minimum_dask_requirements.txt",
    "content": "cloudpickle==1.5.0\ndask[dataframe]==2023.2.0\ndistributed==2023.2.0\nholidays==0.17\nnumpy==1.25.0\npackaging==20.0\npandas==2.0.0\npsutil==5.7.0\nscipy==1.10.0\ntqdm==4.66.3\nwoodwork==0.28.0\n"
  },
  {
    "path": "featuretools/tests/requirement_files/minimum_test_requirements.txt",
    "content": "boto3==1.34.32\ncloudpickle==1.5.0\ncomposeml==0.8.0\ngraphviz==0.8.4\nholidays==0.17\nmoto[all]==5.0.0\nnumpy==1.25.0\npackaging==20.0\npandas==2.0.0\npip==23.3.0\npsutil==5.7.0\npyarrow==14.0.1\npympler==0.8\npytest-cov==3.0.0\npytest-timeout==2.1.0\npytest-xdist==2.5.0\npytest==7.1.2\nscipy==1.10.0\nsmart-open==5.0.0\ntqdm==4.66.3\nurllib3==1.26.18\nwoodwork==0.28.0\n"
  },
  {
    "path": "featuretools/tests/selection/__init__.py",
    "content": ""
  },
  {
    "path": "featuretools/tests/selection/test_selection.py",
    "content": "import numpy as np\nimport pandas as pd\nimport pytest\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Boolean, BooleanNullable, NaturalLanguage\n\nfrom featuretools import EntitySet, Feature, dfs\nfrom featuretools.selection import (\n    remove_highly_correlated_features,\n    remove_highly_null_features,\n    remove_low_information_features,\n    remove_single_value_features,\n)\nfrom featuretools.tests.testing_utils import make_ecommerce_entityset\n\n\n@pytest.fixture\ndef feature_matrix():\n    feature_matrix = pd.DataFrame(\n        {\n            \"test\": [0, 1, 2],\n            \"no_null\": [np.nan, 0, 0],\n            \"some_null\": [np.nan, 0, 0],\n            \"all_null\": [np.nan, np.nan, np.nan],\n            \"many_value\": [1, 2, 3],\n            \"dup_value\": [1, 1, 2],\n            \"one_value\": [1, 1, 1],\n        },\n    )\n    return feature_matrix\n\n\n@pytest.fixture\ndef test_es(es, feature_matrix):\n    es.add_dataframe(dataframe_name=\"test\", dataframe=feature_matrix, index=\"test\")\n    return es\n\n\ndef test_remove_low_information_feature_names(feature_matrix):\n    feature_matrix = remove_low_information_features(feature_matrix)\n    assert feature_matrix.shape == (3, 5)\n    assert \"one_value\" not in feature_matrix.columns\n    assert \"all_null\" not in feature_matrix.columns\n\n\ndef test_remove_low_information_features(test_es, feature_matrix):\n    features = [Feature(test_es[\"test\"].ww[col]) for col in test_es[\"test\"].columns]\n    feature_matrix, features = remove_low_information_features(feature_matrix, features)\n    assert feature_matrix.shape == (3, 5)\n    assert len(features) == 5\n    for f in features:\n        assert f.get_name() in feature_matrix.columns\n    assert \"one_value\" not in feature_matrix.columns\n    assert \"all_null\" not in feature_matrix.columns\n\n\ndef test_remove_highly_null_features():\n    nulls_df = pd.DataFrame(\n        {\n            \"id\": [0, 1, 2, 3],\n            \"half_nulls\": [None, None, 88, 99],\n            \"all_nulls\": [None, None, None, None],\n            \"quarter\": [\"a\", \"b\", None, \"c\"],\n            \"vals\": [True, True, False, False],\n        },\n    )\n\n    es = EntitySet(\"data\", {\"nulls\": (nulls_df, \"id\")})\n    es[\"nulls\"].ww.set_types(\n        logical_types={\"all_nulls\": \"categorical\", \"quarter\": \"categorical\"},\n    )\n    fm, features = dfs(\n        entityset=es,\n        target_dataframe_name=\"nulls\",\n        trans_primitives=[\"is_null\"],\n        max_depth=1,\n    )\n\n    with pytest.raises(\n        ValueError,\n        match=\"pct_null_threshold must be a float between 0 and 1, inclusive.\",\n    ):\n        remove_highly_null_features(fm, pct_null_threshold=1.1)\n\n    with pytest.raises(\n        ValueError,\n        match=\"pct_null_threshold must be a float between 0 and 1, inclusive.\",\n    ):\n        remove_highly_null_features(fm, pct_null_threshold=-0.1)\n\n    no_thresh = remove_highly_null_features(fm)\n    no_thresh_cols = set(no_thresh.columns)\n    diff = set(fm.columns) - no_thresh_cols\n    assert len(diff) == 1\n    assert \"all_nulls\" not in no_thresh_cols\n\n    half = remove_highly_null_features(fm, pct_null_threshold=0.5)\n    half_cols = set(half.columns)\n    diff = set(fm.columns) - half_cols\n    assert len(diff) == 2\n    assert \"all_nulls\" not in half_cols\n    assert \"half_nulls\" not in half_cols\n\n    no_tolerance = remove_highly_null_features(fm, pct_null_threshold=0)\n    no_tolerance_cols = set(no_tolerance.columns)\n    diff = set(fm.columns) - no_tolerance_cols\n    assert len(diff) == 3\n    assert \"all_nulls\" not in no_tolerance_cols\n    assert \"half_nulls\" not in no_tolerance_cols\n    assert \"quarter\" not in no_tolerance_cols\n\n    (\n        with_features_param,\n        with_features_param_features,\n    ) = remove_highly_null_features(fm, features)\n    assert len(with_features_param_features) == len(no_thresh.columns)\n    for i in range(len(with_features_param_features)):\n        assert with_features_param_features[i].get_name() == no_thresh.columns[i]\n        assert with_features_param.columns[i] == no_thresh.columns[i]\n\n\ndef test_remove_single_value_features():\n    same_vals_df = pd.DataFrame(\n        {\n            \"id\": [0, 1, 2, 3],\n            \"all_numeric\": [88, 88, 88, 88],\n            \"with_nan\": [1, 1, None, 1],\n            \"all_nulls\": [None, None, None, None],\n            \"all_categorical\": [\"a\", \"a\", \"a\", \"a\"],\n            \"all_bools\": [True, True, True, True],\n            \"diff_vals\": [\"hi\", \"bye\", \"bye\", \"hi\"],\n        },\n    )\n\n    es = EntitySet(\"data\", {\"single_vals\": (same_vals_df, \"id\")})\n    es[\"single_vals\"].ww.set_types(\n        logical_types={\n            \"all_nulls\": \"categorical\",\n            \"all_categorical\": \"categorical\",\n            \"diff_vals\": \"categorical\",\n        },\n    )\n    fm, features = dfs(\n        entityset=es,\n        target_dataframe_name=\"single_vals\",\n        trans_primitives=[\"is_null\"],\n        max_depth=1,\n    )\n\n    no_params, no_params_features = remove_single_value_features(fm, features)\n    no_params_cols = set(no_params.columns)\n    assert len(no_params_features) == 2\n    assert \"IS_NULL(with_nan)\" in no_params_cols\n    assert \"diff_vals\" in no_params_cols\n\n    nan_as_value, nan_as_value_features = remove_single_value_features(\n        fm,\n        features,\n        count_nan_as_value=True,\n    )\n    nan_cols = set(nan_as_value.columns)\n    assert len(nan_as_value_features) == 3\n    assert \"IS_NULL(with_nan)\" in nan_cols\n    assert \"diff_vals\" in nan_cols\n    assert \"with_nan\" in nan_cols\n\n    without_features_param = remove_single_value_features(fm)\n    assert len(no_params.columns) == len(without_features_param.columns)\n    for i in range(len(no_params.columns)):\n        assert no_params.columns[i] == without_features_param.columns[i]\n        assert no_params_features[i].get_name() == without_features_param.columns[i]\n\n\ndef test_remove_highly_correlated_features():\n    correlated_df = pd.DataFrame(\n        {\n            \"id\": [0, 1, 2, 3],\n            \"diff_ints\": [34, 11, 29, 91],\n            \"words\": [\"test\", \"this is a short sentence\", \"foo bar\", \"baz\"],\n            \"corr_words\": [4, 24, 7, 3],\n            \"corr_1\": [99, 88, 77, 33],\n            \"corr_2\": [99, 88, 77, 33],\n        },\n    )\n\n    es = EntitySet(\n        \"data\",\n        {\"correlated\": (correlated_df, \"id\", None, {\"words\": NaturalLanguage})},\n    )\n    fm, _ = dfs(\n        entityset=es,\n        target_dataframe_name=\"correlated\",\n        trans_primitives=[\"num_characters\"],\n        max_depth=1,\n    )\n\n    with pytest.raises(\n        ValueError,\n        match=\"pct_corr_threshold must be a float between 0 and 1, inclusive.\",\n    ):\n        remove_highly_correlated_features(fm, pct_corr_threshold=1.1)\n\n    with pytest.raises(\n        ValueError,\n        match=\"pct_corr_threshold must be a float between 0 and 1, inclusive.\",\n    ):\n        remove_highly_correlated_features(fm, pct_corr_threshold=-0.1)\n\n    with pytest.raises(\n        AssertionError,\n        match=\"feature named not_a_feature is not in feature matrix\",\n    ):\n        remove_highly_correlated_features(fm, features_to_check=[\"not_a_feature\"])\n\n    to_check = remove_highly_correlated_features(\n        fm,\n        features_to_check=[\"corr_words\", \"NUM_CHARACTERS(words)\", \"diff_ints\"],\n    )\n    to_check_columns = set(to_check.columns)\n    assert len(to_check_columns) == 4\n    assert \"NUM_CHARACTERS(words)\" not in to_check_columns\n    assert \"corr_1\" in to_check_columns\n    assert \"corr_2\" in to_check_columns\n\n    to_keep = remove_highly_correlated_features(\n        fm,\n        features_to_keep=[\"NUM_CHARACTERS(words)\"],\n    )\n    to_keep_names = set(to_keep.columns)\n    assert len(to_keep_names) == 4\n    assert \"corr_words\" in to_keep_names\n    assert \"NUM_CHARACTERS(words)\" in to_keep_names\n    assert \"corr_2\" not in to_keep_names\n\n    new_fm = remove_highly_correlated_features(fm)\n    assert len(new_fm.columns) == 3\n    assert \"corr_2\" not in new_fm.columns\n    assert \"NUM_CHARACTERS(words)\" not in new_fm.columns\n\n    diff_threshold = remove_highly_correlated_features(fm, pct_corr_threshold=0.8)\n    diff_threshold_cols = diff_threshold.columns\n    assert len(diff_threshold_cols) == 2\n    assert \"corr_words\" in diff_threshold_cols\n    assert \"diff_ints\" in diff_threshold_cols\n\n\ndef test_remove_highly_correlated_features_init_woodwork():\n    correlated_df = pd.DataFrame(\n        {\n            \"id\": [0, 1, 2, 3],\n            \"diff_ints\": [34, 11, 29, 91],\n            \"words\": [\"test\", \"this is a short sentence\", \"foo bar\", \"baz\"],\n            \"corr_words\": [4, 24, 7, 3],\n            \"corr_1\": [99, 88, 77, 33],\n            \"corr_2\": [99, 88, 77, 33],\n        },\n    )\n\n    es = EntitySet(\n        \"data\",\n        {\"correlated\": (correlated_df, \"id\", None, {\"words\": NaturalLanguage})},\n    )\n    fm, _ = dfs(\n        entityset=es,\n        target_dataframe_name=\"correlated\",\n        trans_primitives=[\"num_characters\"],\n        max_depth=1,\n    )\n\n    no_ww_fm = fm.copy()\n    ww_fm = fm.copy()\n    ww_fm.ww.init()\n\n    new_no_ww_fm = remove_highly_correlated_features(no_ww_fm)\n    new_ww_fm = remove_highly_correlated_features(ww_fm)\n\n    pd.testing.assert_frame_equal(new_no_ww_fm, new_ww_fm)\n\n\ndef test_multi_output_selection():\n    df1 = pd.DataFrame({\"id\": [0, 1, 2, 3]})\n\n    df2 = pd.DataFrame(\n        {\n            \"index\": [0, 1, 2, 3],\n            \"first_id\": [0, 1, 1, 3],\n            \"all_nulls\": [None, None, None, None],\n            \"quarter\": [\"a\", \"b\", None, \"c\"],\n        },\n    )\n\n    dataframes = {\n        \"first\": (df1, \"id\"),\n        \"second\": (df2, \"index\"),\n    }\n\n    relationships = [(\"first\", \"id\", \"second\", \"first_id\")]\n    es = EntitySet(\"data\", dataframes, relationships=relationships)\n    es[\"second\"].ww.set_types(\n        logical_types={\"all_nulls\": \"categorical\", \"quarter\": \"categorical\"},\n    )\n\n    fm, features = dfs(\n        entityset=es,\n        target_dataframe_name=\"first\",\n        trans_primitives=[],\n        agg_primitives=[\"n_most_common\"],\n        max_depth=1,\n    )\n\n    multi_output, multi_output_features = remove_single_value_features(fm, features)\n    assert multi_output.columns == [\"N_MOST_COMMON(second.quarter)[0]\"]\n    assert len(multi_output_features) == 1\n    assert multi_output_features[0].get_name() == multi_output.columns[0]\n\n    es = make_ecommerce_entityset()\n    fm, features = dfs(\n        entityset=es,\n        target_dataframe_name=\"régions\",\n        trans_primitives=[],\n        agg_primitives=[\"n_most_common\"],\n        max_depth=2,\n    )\n\n    matrix_with_slices, unsliced_features = remove_highly_null_features(fm, features)\n    assert len(matrix_with_slices.columns) == 18\n    assert len(unsliced_features) == 14\n\n    matrix_columns = set(matrix_with_slices.columns)\n    for f in unsliced_features:\n        for f_name in f.get_feature_names():\n            assert f_name in matrix_columns\n\n\ndef test_remove_highly_correlated_features_on_boolean_cols():\n    correlated_df = pd.DataFrame(\n        {\n            \"id\": [0, 1, 2, 3],\n            \"diff_ints\": [34, 11, 29, 91],\n            \"corr_words\": [4, 24, 7, 3],\n            \"bools\": [True, True, False, True],\n        },\n    )\n\n    es = EntitySet(\n        \"data\",\n        {\"correlated\": (correlated_df, \"id\", None, {\"bools\": Boolean})},\n    )\n\n    feature_matrix, features = dfs(\n        entityset=es,\n        target_dataframe_name=\"correlated\",\n        trans_primitives=[\"equal\"],\n        agg_primitives=[],\n        max_depth=1,\n        return_types=[\n            ColumnSchema(logical_type=BooleanNullable),\n            ColumnSchema(logical_type=Boolean),\n        ],\n    )\n    # Confirm both boolean logical types are included so that we know we're checking the correct types\n    assert {\n        ltype.type_string for ltype in feature_matrix.ww.logical_types.values()\n    } == {Boolean.type_string, BooleanNullable.type_string}\n\n    to_keep = remove_highly_correlated_features(\n        feature_matrix=feature_matrix,\n        features=features,\n        pct_corr_threshold=0.3,\n    )\n    assert len(to_keep[0].columns) < len(feature_matrix.columns)\n"
  },
  {
    "path": "featuretools/tests/synthesis/__init__.py",
    "content": ""
  },
  {
    "path": "featuretools/tests/synthesis/test_deep_feature_synthesis.py",
    "content": "import copy\nimport re\n\nimport pandas as pd\nimport pytest\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import Datetime\n\nfrom featuretools import EntitySet, Feature, GroupByTransformFeature\nfrom featuretools.entityset.entityset import LTI_COLUMN_NAME\nfrom featuretools.feature_base import (\n    AggregationFeature,\n    DirectFeature,\n    IdentityFeature,\n    TransformFeature,\n)\nfrom featuretools.feature_base.utils import is_valid_input\nfrom featuretools.primitives import (\n    Absolute,\n    AddNumeric,\n    Count,\n    CumCount,\n    CumMean,\n    CumMin,\n    CumSum,\n    Day,\n    Diff,\n    Equal,\n    Hour,\n    IsIn,\n    IsNull,\n    Last,\n    Mean,\n    Mode,\n    Month,\n    Negate,\n    NMostCommon,\n    Not,\n    NotEqual,\n    NumCharacters,\n    NumTrue,\n    NumUnique,\n    RollingCount,\n    RollingMax,\n    RollingMean,\n    RollingMin,\n    RollingOutlierCount,\n    RollingSTD,\n    Sum,\n    TimeSincePrevious,\n    TransformPrimitive,\n    Trend,\n    Year,\n)\nfrom featuretools.synthesis import DeepFeatureSynthesis\nfrom featuretools.tests.testing_utils import (\n    feature_with_name,\n    make_ecommerce_entityset,\n    number_of_features_with_name_like,\n)\n\n\ndef test_makes_agg_features_from_str(es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[\"sum\"],\n        trans_primitives=[],\n    )\n\n    features = dfs_obj.build_features()\n    assert feature_with_name(features, \"SUM(log.value)\")\n\n\ndef test_makes_agg_features_from_mixed_str(es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[Count, \"sum\"],\n        trans_primitives=[],\n    )\n\n    features = dfs_obj.build_features()\n    assert feature_with_name(features, \"SUM(log.value)\")\n    assert feature_with_name(features, \"COUNT(log)\")\n\n\ndef test_makes_agg_features(es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[Sum],\n        trans_primitives=[],\n    )\n\n    features = dfs_obj.build_features()\n    assert feature_with_name(features, \"SUM(log.value)\")\n\n\ndef test_only_makes_supplied_agg_feat(es):\n    kwargs = dict(\n        target_dataframe_name=\"customers\",\n        entityset=es,\n        max_depth=3,\n    )\n    dfs_obj = DeepFeatureSynthesis(agg_primitives=[Sum], **kwargs)\n\n    features = dfs_obj.build_features()\n\n    def find_other_agg_features(features):\n        return [\n            f\n            for f in features\n            if (isinstance(f, AggregationFeature) and not isinstance(f.primitive, Sum))\n            or len(\n                [\n                    g\n                    for g in f.base_features\n                    if isinstance(g, AggregationFeature)\n                    and not isinstance(g.primitive, Sum)\n                ],\n            )\n            > 0\n        ]\n\n    other_agg_features = find_other_agg_features(features)\n    assert len(other_agg_features) == 0\n\n\ndef test_error_for_missing_target_dataframe(es):\n    error_text = (\n        \"Provided target dataframe missing_dataframe does not exist in ecommerce\"\n    )\n    with pytest.raises(KeyError, match=error_text):\n        DeepFeatureSynthesis(\n            target_dataframe_name=\"missing_dataframe\",\n            entityset=es,\n            agg_primitives=[Last],\n            trans_primitives=[],\n            ignore_dataframes=[\"log\"],\n        )\n\n    es_without_id = EntitySet()\n    error_text = (\n        \"Provided target dataframe missing_dataframe does not exist in entity set\"\n    )\n    with pytest.raises(KeyError, match=error_text):\n        DeepFeatureSynthesis(\n            target_dataframe_name=\"missing_dataframe\",\n            entityset=es_without_id,\n            agg_primitives=[Last],\n            trans_primitives=[],\n            ignore_dataframes=[\"log\"],\n        )\n\n\ndef test_ignores_dataframes(es):\n    error_text = \"ignore_dataframes must be a list\"\n    with pytest.raises(TypeError, match=error_text):\n        DeepFeatureSynthesis(\n            target_dataframe_name=\"sessions\",\n            entityset=es,\n            agg_primitives=[Sum],\n            trans_primitives=[],\n            ignore_dataframes=\"log\",\n        )\n\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[Sum],\n        trans_primitives=[],\n        ignore_dataframes=[\"log\"],\n    )\n\n    features = dfs_obj.build_features()\n    for f in features:\n        deps = f.get_dependencies(deep=True)\n        dataframes = [d.dataframe_name for d in deps]\n        assert \"log\" not in dataframes\n\n\ndef test_ignores_columns(es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[Sum],\n        trans_primitives=[],\n        ignore_columns={\"log\": [\"value\"]},\n    )\n    features = dfs_obj.build_features()\n    for f in features:\n        deps = f.get_dependencies(deep=True)\n        identities = [d for d in deps if isinstance(d, IdentityFeature)]\n        columns = [d.column_name for d in identities if d.dataframe_name == \"log\"]\n        assert \"value\" not in columns\n\n\ndef test_ignore_columns_input_type(es):\n    error_msg = r\"ignore_columns should be dict\\[str -> list\\]\"  # need to use string literals to avoid regex params\n    wrong_input_type = {\"log\": \"value\"}\n    with pytest.raises(TypeError, match=error_msg):\n        DeepFeatureSynthesis(\n            target_dataframe_name=\"log\",\n            entityset=es,\n            ignore_columns=wrong_input_type,\n        )\n\n\ndef test_ignore_columns_with_nonstring_values(es):\n    error_msg = \"list in ignore_columns must only have string values\"\n    wrong_input_list = {\"log\": [\"a\", \"b\", 3]}\n    with pytest.raises(TypeError, match=error_msg):\n        DeepFeatureSynthesis(\n            target_dataframe_name=\"log\",\n            entityset=es,\n            ignore_columns=wrong_input_list,\n        )\n\n\ndef test_ignore_columns_with_nonstring_keys(es):\n    error_msg = r\"ignore_columns should be dict\\[str -> list\\]\"  # need to use string literals to avoid regex params\n    wrong_input_keys = {1: [\"a\", \"b\", \"c\"]}\n    with pytest.raises(TypeError, match=error_msg):\n        DeepFeatureSynthesis(\n            target_dataframe_name=\"log\",\n            entityset=es,\n            ignore_columns=wrong_input_keys,\n        )\n\n\ndef test_makes_dfeatures(es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[],\n        trans_primitives=[],\n    )\n\n    features = dfs_obj.build_features()\n    assert feature_with_name(features, \"customers.age\")\n\n\ndef test_makes_trans_feat(es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"log\",\n        entityset=es,\n        agg_primitives=[],\n        trans_primitives=[Hour],\n    )\n\n    features = dfs_obj.build_features()\n    assert feature_with_name(features, \"HOUR(datetime)\")\n\n\ndef test_handles_diff_dataframe_groupby(es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"log\",\n        entityset=es,\n        agg_primitives=[],\n        groupby_trans_primitives=[Diff],\n    )\n\n    features = dfs_obj.build_features()\n    assert feature_with_name(features, \"DIFF(value) by session_id\")\n    assert feature_with_name(features, \"DIFF(value) by product_id\")\n\n\ndef test_handles_time_since_previous_dataframe_groupby(es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"log\",\n        entityset=es,\n        agg_primitives=[],\n        groupby_trans_primitives=[TimeSincePrevious],\n    )\n\n    features = dfs_obj.build_features()\n    assert feature_with_name(features, \"TIME_SINCE_PREVIOUS(datetime) by session_id\")\n\n\n# M TODO\n# def test_handles_cumsum_dataframe_groupby(es):\n#     dfs_obj = DeepFeatureSynthesis(target_dataframe_name='sessions',\n#                                    entityset=es,\n#                                    agg_primitives=[],\n#                                    trans_primitives=[CumMean])\n\n#     features = dfs_obj.build_features()\n#     assert (feature_with_name(features, u'customers.CUM_MEAN(age by région_id)'))\n\n\ndef test_only_makes_supplied_trans_feat(es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"log\",\n        entityset=es,\n        agg_primitives=[],\n        trans_primitives=[Hour],\n    )\n\n    features = dfs_obj.build_features()\n    other_trans_features = [\n        f\n        for f in features\n        if (isinstance(f, TransformFeature) and not isinstance(f.primitive, Hour))\n        or len(\n            [\n                g\n                for g in f.base_features\n                if isinstance(g, TransformFeature) and not isinstance(g.primitive, Hour)\n            ],\n        )\n        > 0\n    ]\n    assert len(other_trans_features) == 0\n\n\ndef test_makes_dfeatures_of_agg_primitives(es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[\"max\"],\n        trans_primitives=[],\n    )\n    features = dfs_obj.build_features()\n\n    assert feature_with_name(features, \"customers.MAX(log.value)\")\n\n\ndef test_makes_agg_features_of_trans_primitives(es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[Mean],\n        trans_primitives=[NumCharacters],\n    )\n\n    features = dfs_obj.build_features()\n    assert feature_with_name(features, \"MEAN(log.NUM_CHARACTERS(comments))\")\n\n\ndef test_makes_agg_features_with_where(es):\n    es.add_interesting_values()\n\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[Count],\n        where_primitives=[Count],\n        trans_primitives=[],\n    )\n\n    features = dfs_obj.build_features()\n    assert feature_with_name(features, \"COUNT(log WHERE priority_level = 0)\")\n\n    # make sure they are made using direct features too\n    assert feature_with_name(features, \"COUNT(log WHERE products.department = food)\")\n\n\ndef test_make_groupby_features(es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"log\",\n        entityset=es,\n        agg_primitives=[],\n        trans_primitives=[],\n        groupby_trans_primitives=[\"cum_sum\"],\n    )\n    features = dfs_obj.build_features()\n    assert feature_with_name(features, \"CUM_SUM(value) by session_id\")\n\n\ndef test_make_indirect_groupby_features(es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"log\",\n        entityset=es,\n        agg_primitives=[],\n        trans_primitives=[],\n        groupby_trans_primitives=[\"cum_sum\"],\n    )\n    features = dfs_obj.build_features()\n    assert feature_with_name(features, \"CUM_SUM(products.rating) by session_id\")\n\n\ndef test_make_groupby_features_with_id(es):\n    # Need to convert customer_id to categorical column in order to build desired feature\n    es[\"sessions\"].ww.set_types(\n        logical_types={\"customer_id\": \"Categorical\"},\n        semantic_tags={\"customer_id\": \"foreign_key\"},\n    )\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[],\n        trans_primitives=[],\n        groupby_trans_primitives=[\"cum_count\"],\n    )\n    features = dfs_obj.build_features()\n\n    assert feature_with_name(features, \"CUM_COUNT(customer_id) by customer_id\")\n\n\ndef test_make_groupby_features_with_diff_id(es):\n    # Need to convert cohort to categorical column in order to build desired feature\n    es[\"customers\"].ww.set_types(\n        logical_types={\"cohort\": \"Categorical\"},\n        semantic_tags={\"cohort\": \"foreign_key\"},\n    )\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"customers\",\n        entityset=es,\n        agg_primitives=[],\n        trans_primitives=[],\n        groupby_trans_primitives=[\"cum_count\"],\n    )\n    features = dfs_obj.build_features()\n\n    groupby_with_diff_id = \"CUM_COUNT(cohort) by région_id\"\n    assert feature_with_name(features, groupby_with_diff_id)\n\n\ndef test_make_groupby_features_with_agg(es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"cohorts\",\n        entityset=es,\n        agg_primitives=[\"sum\"],\n        trans_primitives=[],\n        groupby_trans_primitives=[\"cum_sum\"],\n    )\n    features = dfs_obj.build_features()\n    agg_on_groupby_name = \"SUM(customers.CUM_SUM(age) by région_id)\"\n    assert feature_with_name(features, agg_on_groupby_name)\n\n\ndef test_bad_groupby_feature(es):\n    msg = re.escape(\n        \"Unknown groupby transform primitive max. \"\n        \"Call ft.primitives.list_primitives() to get \"\n        \"a list of available primitives\",\n    )\n    with pytest.raises(ValueError, match=msg):\n        DeepFeatureSynthesis(\n            target_dataframe_name=\"customers\",\n            entityset=es,\n            agg_primitives=[\"sum\"],\n            trans_primitives=[],\n            groupby_trans_primitives=[\"Max\"],\n        )\n\n\n@pytest.mark.parametrize(\n    \"rolling_primitive\",\n    [\n        RollingMax,\n        RollingMean,\n        RollingMin,\n        RollingOutlierCount,\n        RollingSTD,\n    ],\n)\n@pytest.mark.parametrize(\n    \"window_length, gap\",\n    [\n        (7, 3),\n        (\"7d\", \"3d\"),\n    ],\n)\ndef test_make_rolling_features(window_length, gap, rolling_primitive, es):\n    rolling_primitive_obj = rolling_primitive(\n        window_length=window_length,\n        gap=gap,\n        min_periods=5,\n    )\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"log\",\n        entityset=es,\n        agg_primitives=[],\n        trans_primitives=[rolling_primitive_obj],\n    )\n    features = dfs_obj.build_features()\n    rolling_transform_name = f\"{rolling_primitive.name.upper()}(datetime, value_many_nans, window_length={window_length}, gap={gap}, min_periods=5)\"\n    assert feature_with_name(features, rolling_transform_name)\n\n\n@pytest.mark.parametrize(\n    \"window_length, gap\",\n    [\n        (7, 3),\n        (\"7d\", \"3d\"),\n    ],\n)\ndef test_make_rolling_count_off_datetime_feature(window_length, gap, es):\n    rolling_count = RollingCount(window_length=window_length, min_periods=gap)\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"log\",\n        entityset=es,\n        agg_primitives=[],\n        trans_primitives=[rolling_count],\n    )\n    features = dfs_obj.build_features()\n    rolling_transform_name = (\n        f\"ROLLING_COUNT(datetime, window_length={window_length}, min_periods={gap})\"\n    )\n    assert feature_with_name(features, rolling_transform_name)\n\n\ndef test_abides_by_max_depth_param(es):\n    for i in [0, 1, 2, 3]:\n        dfs_obj = DeepFeatureSynthesis(\n            target_dataframe_name=\"sessions\",\n            entityset=es,\n            agg_primitives=[Sum],\n            trans_primitives=[],\n            max_depth=i,\n        )\n\n        features = dfs_obj.build_features()\n        for f in features:\n            assert f.get_depth() <= i\n\n\ndef test_max_depth_single_table(transform_es):\n    assert len(transform_es.dataframe_dict) == 1\n\n    def make_dfs_obj(max_depth):\n        dfs_obj = DeepFeatureSynthesis(\n            target_dataframe_name=\"first\",\n            entityset=transform_es,\n            trans_primitives=[AddNumeric],\n            max_depth=max_depth,\n        )\n        return dfs_obj\n\n    for i in [-1, 0, 1, 2]:\n        if i in [-1, 2]:\n            match = (\n                \"Only one dataframe in entityset, changing max_depth to 1 \"\n                \"since deeper features cannot be created\"\n            )\n            with pytest.warns(UserWarning, match=match):\n                dfs_obj = make_dfs_obj(i)\n        else:\n            dfs_obj = make_dfs_obj(i)\n\n        features = dfs_obj.build_features()\n        assert len(features) > 0\n        if i != 0:\n            # at least one depth 1 feature made\n            assert any([f.get_depth() == 1 for f in features])\n            # no depth 2 or higher even with max_depth=2\n            assert all([f.get_depth() <= 1 for f in features])\n        else:\n            # no depth 1 or higher features with max_depth=0\n            assert all([f.get_depth() == 0 for f in features])\n\n\ndef test_drop_contains(es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[Sum],\n        trans_primitives=[],\n        max_depth=1,\n        seed_features=[],\n        drop_contains=[],\n    )\n    features = dfs_obj.build_features()\n    to_drop = features[2]\n    partial_name = to_drop.get_name()[:5]\n\n    dfs_drop = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[Sum],\n        trans_primitives=[],\n        max_depth=1,\n        seed_features=[],\n        drop_contains=[partial_name],\n    )\n    features = dfs_drop.build_features()\n    assert to_drop.get_name() not in [f.get_name() for f in features]\n\n\ndef test_drop_exact(es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[Sum],\n        trans_primitives=[],\n        max_depth=1,\n        seed_features=[],\n        drop_exact=[],\n    )\n    features = dfs_obj.build_features()\n    to_drop = features[2]\n    name = to_drop.get_name()\n    dfs_drop = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[Sum],\n        trans_primitives=[],\n        max_depth=1,\n        seed_features=[],\n        drop_exact=[name],\n    )\n    features = dfs_drop.build_features()\n    assert name not in [f.get_name() for f in features]\n\n\ndef test_seed_features(es):\n    seed_feature_sessions = (\n        Feature(es[\"log\"].ww[\"id\"], parent_dataframe_name=\"sessions\", primitive=Count)\n        > 2\n    )\n    seed_feature_log = Feature(es[\"log\"].ww[\"comments\"], primitive=NumCharacters)\n    session_agg = Feature(\n        seed_feature_log,\n        parent_dataframe_name=\"sessions\",\n        primitive=Mean,\n    )\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[Mean],\n        trans_primitives=[],\n        max_depth=2,\n        seed_features=[seed_feature_sessions, seed_feature_log],\n    )\n    features = dfs_obj.build_features()\n    assert seed_feature_sessions.get_name() in [f.get_name() for f in features]\n    assert session_agg.get_name() in [f.get_name() for f in features]\n\n\ndef test_does_not_make_agg_of_direct_of_target_dataframe(es):\n    count_sessions = Feature(\n        es[\"sessions\"].ww[\"id\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Count,\n    )\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"customers\",\n        entityset=es,\n        agg_primitives=[Last],\n        trans_primitives=[],\n        max_depth=2,\n        seed_features=[count_sessions],\n    )\n    features = dfs_obj.build_features()\n    # this feature is meaningless because customers.COUNT(sessions) is already defined on\n    # the customers dataframe\n    assert not feature_with_name(features, \"LAST(sessions.customers.COUNT(sessions))\")\n    assert not feature_with_name(features, \"LAST(sessions.customers.age)\")\n\n\ndef test_dfs_builds_on_seed_features_more_than_max_depth(es):\n    seed_feature_sessions = Feature(\n        es[\"log\"].ww[\"id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Count,\n    )\n    seed_feature_log = Feature(es[\"log\"].ww[\"datetime\"], primitive=Hour)\n    session_agg = Feature(\n        seed_feature_log,\n        parent_dataframe_name=\"sessions\",\n        primitive=Last,\n    )\n\n    # Depth of this feat is 2 relative to session_agg, the seed feature,\n    # which is greater than max_depth so it shouldn't be built\n    session_agg_trans = DirectFeature(\n        Feature(session_agg, parent_dataframe_name=\"customers\", primitive=Mode),\n        \"sessions\",\n    )\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[Last, Count],\n        trans_primitives=[],\n        max_depth=1,\n        seed_features=[seed_feature_sessions, seed_feature_log],\n    )\n    features = dfs_obj.build_features()\n    assert seed_feature_sessions.get_name() in [f.get_name() for f in features]\n    assert session_agg.get_name() in [f.get_name() for f in features]\n    assert session_agg_trans.get_name() not in [f.get_name() for f in features]\n\n\ndef test_dfs_includes_seed_features_greater_than_max_depth(es):\n    session_agg = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=Sum,\n    )\n    customer_agg = Feature(\n        session_agg,\n        parent_dataframe_name=\"customers\",\n        primitive=Mean,\n    )\n    assert customer_agg.get_depth() == 2\n\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"customers\",\n        entityset=es,\n        agg_primitives=[Mean],\n        trans_primitives=[],\n        max_depth=1,\n        seed_features=[customer_agg],\n    )\n    features = dfs_obj.build_features()\n    assert feature_with_name(features=features, name=customer_agg.get_name())\n\n\ndef test_allowed_paths(es):\n    kwargs = dict(\n        target_dataframe_name=\"customers\",\n        entityset=es,\n        agg_primitives=[Last],\n        trans_primitives=[],\n        max_depth=2,\n        seed_features=[],\n    )\n    dfs_unconstrained = DeepFeatureSynthesis(**kwargs)\n    features_unconstrained = dfs_unconstrained.build_features()\n\n    unconstrained_names = [f.get_name() for f in features_unconstrained]\n    customers_session_feat = Feature(\n        es[\"sessions\"].ww[\"device_type\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Last,\n    )\n    customers_session_log_feat = Feature(\n        es[\"log\"].ww[\"value\"],\n        parent_dataframe_name=\"customers\",\n        primitive=Last,\n    )\n    assert customers_session_feat.get_name() in unconstrained_names\n    assert customers_session_log_feat.get_name() in unconstrained_names\n\n    dfs_constrained = DeepFeatureSynthesis(\n        allowed_paths=[[\"customers\", \"sessions\"]], **kwargs\n    )\n    features = dfs_constrained.build_features()\n    names = [f.get_name() for f in features]\n    assert customers_session_feat.get_name() in names\n    assert customers_session_log_feat.get_name() not in names\n\n\ndef test_max_features(es):\n    kwargs = dict(\n        target_dataframe_name=\"customers\",\n        entityset=es,\n        agg_primitives=[Sum],\n        trans_primitives=[],\n        max_depth=2,\n        seed_features=[],\n    )\n    dfs_unconstrained = DeepFeatureSynthesis(**kwargs)\n    features_unconstrained = dfs_unconstrained.build_features()\n    dfs_unconstrained_with_arg = DeepFeatureSynthesis(max_features=-1, **kwargs)\n    feats_unconstrained_with_arg = dfs_unconstrained_with_arg.build_features()\n    dfs_constrained = DeepFeatureSynthesis(max_features=1, **kwargs)\n    features = dfs_constrained.build_features()\n    assert len(features_unconstrained) == len(feats_unconstrained_with_arg)\n    assert len(features) == 1\n\n\ndef test_where_primitives(es):\n    es.add_interesting_values(dataframe_name=\"sessions\", values={\"device_type\": [0]})\n    kwargs = dict(\n        target_dataframe_name=\"customers\",\n        entityset=es,\n        agg_primitives=[Count, Sum],\n        trans_primitives=[Absolute],\n        max_depth=3,\n    )\n    dfs_unconstrained = DeepFeatureSynthesis(**kwargs)\n    dfs_constrained = DeepFeatureSynthesis(where_primitives=[\"sum\"], **kwargs)\n    features_unconstrained = dfs_unconstrained.build_features()\n    features = dfs_constrained.build_features()\n\n    where_feats_unconstrained = [\n        f\n        for f in features_unconstrained\n        if isinstance(f, AggregationFeature) and f.where is not None\n    ]\n    where_feats = [\n        f for f in features if isinstance(f, AggregationFeature) and f.where is not None\n    ]\n\n    assert len(where_feats_unconstrained) >= 1\n\n    assert (\n        len([f for f in where_feats_unconstrained if isinstance(f.primitive, Sum)]) == 0\n    )\n    assert (\n        len([f for f in where_feats_unconstrained if isinstance(f.primitive, Count)])\n        > 0\n    )\n\n    assert len([f for f in where_feats if isinstance(f.primitive, Sum)]) > 0\n    assert len([f for f in where_feats if isinstance(f.primitive, Count)]) == 0\n    assert (\n        len(\n            [\n                d\n                for f in where_feats\n                for d in f.get_dependencies(deep=True)\n                if isinstance(d.primitive, Absolute)\n            ],\n        )\n        > 0\n    )\n\n\ndef test_stacking_where_primitives(es):\n    es = copy.deepcopy(es)\n    es.add_interesting_values(dataframe_name=\"sessions\", values={\"device_type\": [0]})\n    es.add_interesting_values(\n        dataframe_name=\"log\",\n        values={\"product_id\": [\"coke_zero\"]},\n    )\n    kwargs = dict(\n        target_dataframe_name=\"customers\",\n        entityset=es,\n        agg_primitives=[Count, Last],\n        max_depth=3,\n    )\n    dfs_where_stack_limit_1 = DeepFeatureSynthesis(\n        where_primitives=[\"last\", Count], **kwargs\n    )\n    dfs_where_stack_limit_2 = DeepFeatureSynthesis(\n        where_primitives=[\"last\", Count], where_stacking_limit=2, **kwargs\n    )\n    stack_limit_1_features = dfs_where_stack_limit_1.build_features()\n    stack_limit_2_features = dfs_where_stack_limit_2.build_features()\n\n    where_stack_1_feats = [\n        f\n        for f in stack_limit_1_features\n        if isinstance(f, AggregationFeature) and f.where is not None\n    ]\n    where_stack_2_feats = [\n        f\n        for f in stack_limit_2_features\n        if isinstance(f, AggregationFeature) and f.where is not None\n    ]\n\n    assert len(where_stack_1_feats) >= 1\n    assert len(where_stack_2_feats) >= 1\n\n    assert len([f for f in where_stack_1_feats if isinstance(f.primitive, Last)]) > 0\n    assert len([f for f in where_stack_1_feats if isinstance(f.primitive, Count)]) > 0\n\n    assert len([f for f in where_stack_2_feats if isinstance(f.primitive, Last)]) > 0\n    assert len([f for f in where_stack_2_feats if isinstance(f.primitive, Count)]) > 0\n\n    stacked_where_limit_1_feats = []\n    stacked_where_limit_2_feats = []\n    where_double_where_tuples = [\n        (where_stack_1_feats, stacked_where_limit_1_feats),\n        (where_stack_2_feats, stacked_where_limit_2_feats),\n    ]\n    for where_list, double_where_list in where_double_where_tuples:\n        for feature in where_list:\n            for base_feat in feature.base_features:\n                if (\n                    isinstance(base_feat, AggregationFeature)\n                    and base_feat.where is not None\n                ):\n                    double_where_list.append(feature)\n\n    assert len(stacked_where_limit_1_feats) == 0\n    assert len(stacked_where_limit_2_feats) > 0\n\n\ndef test_where_different_base_feats(es):\n    es.add_interesting_values(dataframe_name=\"sessions\", values={\"device_type\": [0]})\n\n    kwargs = dict(\n        target_dataframe_name=\"customers\",\n        entityset=es,\n        agg_primitives=[Sum, Count],\n        where_primitives=[Sum, Count],\n        max_depth=3,\n    )\n    dfs_unconstrained = DeepFeatureSynthesis(**kwargs)\n    features = dfs_unconstrained.build_features()\n    where_feats = [\n        f.unique_name()\n        for f in features\n        if isinstance(f, AggregationFeature) and f.where is not None\n    ]\n    not_where_feats = [\n        f.unique_name()\n        for f in features\n        if isinstance(f, AggregationFeature) and f.where is None\n    ]\n    for name in not_where_feats:\n        assert name not in where_feats\n\n\ndef test_dfeats_where(es):\n    es.add_interesting_values()\n\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[Count],\n        trans_primitives=[],\n    )\n\n    features = dfs_obj.build_features()\n\n    # test to make sure we build direct features of agg features with where clause\n    assert feature_with_name(features, \"customers.COUNT(log WHERE priority_level = 0)\")\n\n    assert feature_with_name(\n        features,\n        \"COUNT(log WHERE products.department = electronics)\",\n    )\n\n\ndef test_commutative(es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"log\",\n        entityset=es,\n        agg_primitives=[Sum],\n        trans_primitives=[AddNumeric],\n        max_depth=3,\n    )\n    feats = dfs_obj.build_features()\n\n    add_feats = [f for f in feats if isinstance(f.primitive, AddNumeric)]\n\n    # Check that there are no two AddNumeric features with the same base\n    # features.\n    unordered_args = set()\n    for f in add_feats:\n        arg1, arg2 = f.base_features\n        args_set = frozenset({arg1.unique_name(), arg2.unique_name()})\n        unordered_args.add(args_set)\n\n    assert len(add_feats) == len(unordered_args)\n\n\ndef test_transform_consistency(transform_es):\n    # Generate features\n    transform_es[\"first\"].ww.set_types(\n        logical_types={\"b\": \"BooleanNullable\", \"b1\": \"BooleanNullable\"},\n    )\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"first\",\n        entityset=transform_es,\n        trans_primitives=[\"and\", \"add_numeric\", \"or\"],\n        max_depth=1,\n    )\n    feature_defs = dfs_obj.build_features()\n\n    # Check for correct ordering of features\n    assert feature_with_name(feature_defs, \"a\")\n    assert feature_with_name(feature_defs, \"b\")\n    assert feature_with_name(feature_defs, \"b1\")\n    assert feature_with_name(feature_defs, \"b12\")\n    assert feature_with_name(feature_defs, \"P\")\n\n    assert feature_with_name(feature_defs, \"AND(b, b1)\")\n    assert not feature_with_name(\n        feature_defs,\n        \"AND(b1, b)\",\n    )  # make sure it doesn't exist the other way\n    assert feature_with_name(feature_defs, \"a + P\")\n    assert feature_with_name(feature_defs, \"b12 + P\")\n    assert feature_with_name(feature_defs, \"a + b12\")\n    assert feature_with_name(feature_defs, \"OR(b, b1)\")\n\n\ndef test_transform_no_stack_agg(es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"customers\",\n        entityset=es,\n        agg_primitives=[NMostCommon],\n        trans_primitives=[NotEqual],\n        max_depth=3,\n    )\n    feature_defs = dfs_obj.build_features()\n\n    assert not feature_with_name(\n        feature_defs,\n        \"id != N_MOST_COMMON(sessions.device_type)\",\n    )\n\n\ndef test_initialized_trans_prim(es):\n    prim = IsIn(list_of_outputs=[\"coke zero\"])\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"log\",\n        entityset=es,\n        agg_primitives=[],\n        trans_primitives=[prim],\n    )\n\n    features = dfs_obj.build_features()\n\n    assert feature_with_name(features, \"product_id.isin(['coke zero'])\")\n\n\ndef test_initialized_agg_prim(es):\n    ThreeMost = NMostCommon(n=3)\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[ThreeMost],\n        trans_primitives=[],\n    )\n    features = dfs_obj.build_features()\n\n    assert feature_with_name(features, \"N_MOST_COMMON(log.subregioncode)\")\n\n\ndef test_return_types(es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[Count, NMostCommon],\n        trans_primitives=[Absolute, Hour, IsIn],\n    )\n\n    discrete = ColumnSchema(semantic_tags={\"category\"})\n    numeric = ColumnSchema(semantic_tags={\"numeric\"})\n    datetime = ColumnSchema(logical_type=Datetime)\n\n    f1 = dfs_obj.build_features(return_types=None)\n    f2 = dfs_obj.build_features(return_types=[discrete])\n    f3 = dfs_obj.build_features(return_types=\"all\")\n    f4 = dfs_obj.build_features(return_types=[datetime])\n\n    f1_types = [f.column_schema for f in f1]\n    f2_types = [f.column_schema for f in f2]\n    f3_types = [f.column_schema for f in f3]\n    f4_types = [f.column_schema for f in f4]\n\n    assert any([is_valid_input(schema, discrete) for schema in f1_types])\n    assert any([is_valid_input(schema, numeric) for schema in f1_types])\n    assert not any([is_valid_input(schema, datetime) for schema in f1_types])\n\n    assert any([is_valid_input(schema, discrete) for schema in f2_types])\n    assert not any([is_valid_input(schema, numeric) for schema in f2_types])\n    assert not any([is_valid_input(schema, datetime) for schema in f2_types])\n\n    assert any([is_valid_input(schema, discrete) for schema in f3_types])\n    assert any([is_valid_input(schema, numeric) for schema in f3_types])\n    assert any([is_valid_input(schema, datetime) for schema in f3_types])\n\n    assert not any([is_valid_input(schema, discrete) for schema in f4_types])\n    assert not any([is_valid_input(schema, numeric) for schema in f4_types])\n    assert any([is_valid_input(schema, datetime) for schema in f4_types])\n\n\ndef test_checks_primitives_correct_type(es):\n    error_text = (\n        \"Primitive <class \\\\'featuretools\\\\.primitives\\\\.standard\\\\.\"\n        \"transform\\\\.datetime\\\\.hour\\\\.Hour\\\\'> in \"\n        \"agg_primitives is not an aggregation primitive\"\n    )\n    with pytest.raises(ValueError, match=error_text):\n        DeepFeatureSynthesis(\n            target_dataframe_name=\"sessions\",\n            entityset=es,\n            agg_primitives=[Hour],\n            trans_primitives=[],\n        )\n\n    error_text = (\n        \"Primitive <class \\\\'featuretools\\\\.primitives\\\\.standard\\\\.\"\n        \"aggregation\\\\.sum_primitive\\\\.Sum\\\\'> in trans_primitives \"\n        \"is not a transform primitive\"\n    )\n    with pytest.raises(ValueError, match=error_text):\n        DeepFeatureSynthesis(\n            target_dataframe_name=\"sessions\",\n            entityset=es,\n            agg_primitives=[],\n            trans_primitives=[Sum],\n        )\n\n\ndef test_makes_agg_features_along_multiple_paths(diamond_es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"regions\",\n        entityset=diamond_es,\n        agg_primitives=[\"mean\"],\n        trans_primitives=[],\n    )\n\n    features = dfs_obj.build_features()\n    assert feature_with_name(features, \"MEAN(customers.transactions.amount)\")\n    assert feature_with_name(features, \"MEAN(stores.transactions.amount)\")\n\n\ndef test_makes_direct_features_through_multiple_relationships(games_es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"games\",\n        entityset=games_es,\n        agg_primitives=[\"mean\"],\n        trans_primitives=[],\n    )\n\n    features = dfs_obj.build_features()\n\n    teams = [\"home\", \"away\"]\n    for forward in teams:\n        for backward in teams:\n            for col in teams:\n                f = \"teams[%s_team_id].MEAN(games[%s_team_id].%s_team_score)\" % (\n                    forward,\n                    backward,\n                    col,\n                )\n                assert feature_with_name(features, f)\n\n\ndef test_stacks_multioutput_features(es):\n    class TestTime(TransformPrimitive):\n        name = \"test_time\"\n        input_types = [ColumnSchema(logical_type=Datetime)]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n        number_output_features = 6\n\n        def get_function(self):\n            def test_f(x):\n                times = pd.Series(x)\n                units = [\"year\", \"month\", \"day\", \"hour\", \"minute\", \"second\"]\n                return [times.apply(lambda x: getattr(x, unit)) for unit in units]\n\n            return test_f\n\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"customers\",\n        entityset=es,\n        agg_primitives=[NumUnique, NMostCommon(n=3)],\n        trans_primitives=[TestTime, Diff],\n        max_depth=4,\n    )\n    feat = dfs_obj.build_features()\n\n    for i in range(3):\n        f = \"NUM_UNIQUE(sessions.N_MOST_COMMON(log.countrycode)[%d])\" % i\n        assert feature_with_name(feat, f)\n\n\ndef test_seed_multi_output_feature_stacking(es):\n    threecommon = NMostCommon(3)\n    tc = Feature(\n        es[\"log\"].ww[\"product_id\"],\n        parent_dataframe_name=\"sessions\",\n        primitive=threecommon,\n    )\n\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"customers\",\n        entityset=es,\n        seed_features=[tc],\n        agg_primitives=[NumUnique],\n        trans_primitives=[],\n        max_depth=4,\n    )\n    feat = dfs_obj.build_features()\n\n    for i in range(3):\n        f = \"NUM_UNIQUE(sessions.N_MOST_COMMON(log.product_id)[%d])\" % i\n        assert feature_with_name(feat, f)\n\n\ndef test_makes_direct_features_along_multiple_paths(diamond_es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"transactions\",\n        entityset=diamond_es,\n        max_depth=3,\n        agg_primitives=[],\n        trans_primitives=[],\n    )\n\n    features = dfs_obj.build_features()\n    assert feature_with_name(features, \"customers.regions.name\")\n    assert feature_with_name(features, \"stores.regions.name\")\n\n\ndef test_does_not_make_trans_of_single_direct_feature(es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[],\n        trans_primitives=[\"weekday\"],\n        max_depth=2,\n    )\n\n    features = dfs_obj.build_features()\n\n    assert not feature_with_name(features, \"WEEKDAY(customers.signup_date)\")\n    assert feature_with_name(features, \"customers.WEEKDAY(signup_date)\")\n\n\ndef test_makes_trans_of_multiple_direct_features(diamond_es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"transactions\",\n        entityset=diamond_es,\n        agg_primitives=[\"mean\"],\n        trans_primitives=[Equal],\n        max_depth=4,\n    )\n\n    features = dfs_obj.build_features()\n\n    # Make trans of direct and non-direct\n    assert feature_with_name(features, \"amount = stores.MEAN(transactions.amount)\")\n\n    # Make trans of direct features on different dataframes\n    assert feature_with_name(\n        features,\n        \"customers.MEAN(transactions.amount) = stores.square_ft\",\n    )\n\n    # Make trans of direct features on same dataframe with different paths.\n    assert feature_with_name(features, \"customers.regions.name = stores.regions.name\")\n\n    # Don't make trans of direct features with same path.\n    assert not feature_with_name(\n        features,\n        \"stores.square_ft = stores.MEAN(transactions.amount)\",\n    )\n    assert not feature_with_name(\n        features,\n        \"stores.MEAN(transactions.amount) = stores.square_ft\",\n    )\n\n    # The naming of the below is confusing but this is a direct feature of a transform.\n    assert feature_with_name(features, \"stores.MEAN(transactions.amount) = square_ft\")\n\n\ndef test_makes_direct_of_agg_of_trans_on_target(es):\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"log\",\n        entityset=es,\n        agg_primitives=[\"mean\"],\n        trans_primitives=[Absolute],\n        max_depth=3,\n    )\n\n    features = dfs_obj.build_features()\n    assert feature_with_name(features, \"sessions.MEAN(log.ABSOLUTE(value))\")\n\n\ndef test_primitive_options_errors(es):\n    wrong_key_options = {\"mean\": {\"ignore_dataframe\": [\"sessions\"]}}\n    wrong_type_list = {\"mean\": {\"ignore_dataframes\": \"sessions\"}}\n    wrong_type_dict = {\"mean\": {\"ignore_columns\": {\"sessions\": \"product_id\"}}}\n    conflicting_primitive_options = {\n        (\"count\", \"mean\"): {\"ignore_dataframes\": [\"sessions\"]},\n        \"mean\": {\"include_dataframes\": [\"sessions\"]},\n    }\n    invalid_dataframe = {\"mean\": {\"include_dataframes\": [\"invalid_dataframe\"]}}\n    invalid_column_dataframe = {\n        \"mean\": {\"include_columns\": {\"invalid_dataframe\": [\"product_id\"]}},\n    }\n    invalid_column = {\"mean\": {\"include_columns\": {\"sessions\": [\"invalid_column\"]}}}\n    key_error_text = \"Unrecognized primitive option 'ignore_dataframe' for mean\"\n    list_error_text = \"Incorrect type formatting for 'ignore_dataframes' for mean\"\n    dict_error_text = \"Incorrect type formatting for 'ignore_columns' for mean\"\n    conflicting_error_text = \"Multiple options found for primitive mean\"\n    invalid_dataframe_warning = \"Dataframe 'invalid_dataframe' not in entityset\"\n    invalid_column_warning = \"Column 'invalid_column' not in dataframe 'sessions'\"\n    with pytest.raises(KeyError, match=key_error_text):\n        DeepFeatureSynthesis(\n            target_dataframe_name=\"customers\",\n            entityset=es,\n            agg_primitives=[\"mean\"],\n            trans_primitives=[],\n            primitive_options=wrong_key_options,\n        )\n    with pytest.raises(TypeError, match=list_error_text):\n        DeepFeatureSynthesis(\n            target_dataframe_name=\"customers\",\n            entityset=es,\n            agg_primitives=[\"mean\"],\n            trans_primitives=[],\n            primitive_options=wrong_type_list,\n        )\n    with pytest.raises(TypeError, match=dict_error_text):\n        DeepFeatureSynthesis(\n            target_dataframe_name=\"customers\",\n            entityset=es,\n            agg_primitives=[\"mean\"],\n            trans_primitives=[],\n            primitive_options=wrong_type_dict,\n        )\n    with pytest.raises(KeyError, match=conflicting_error_text):\n        DeepFeatureSynthesis(\n            target_dataframe_name=\"customers\",\n            entityset=es,\n            agg_primitives=[\"mean\"],\n            trans_primitives=[],\n            primitive_options=conflicting_primitive_options,\n        )\n    with pytest.warns(UserWarning, match=invalid_dataframe_warning) as record:\n        DeepFeatureSynthesis(\n            target_dataframe_name=\"customers\",\n            entityset=es,\n            agg_primitives=[\"mean\"],\n            trans_primitives=[],\n            primitive_options=invalid_dataframe,\n        )\n    assert len(record) == 1\n    with pytest.warns(UserWarning, match=invalid_dataframe_warning) as record:\n        DeepFeatureSynthesis(\n            target_dataframe_name=\"customers\",\n            entityset=es,\n            agg_primitives=[\"mean\"],\n            trans_primitives=[],\n            primitive_options=invalid_column_dataframe,\n        )\n    assert len(record) == 1\n    with pytest.warns(UserWarning, match=invalid_column_warning) as record:\n        DeepFeatureSynthesis(\n            target_dataframe_name=\"customers\",\n            entityset=es,\n            agg_primitives=[\"mean\"],\n            trans_primitives=[],\n            primitive_options=invalid_column,\n        )\n    assert len(record) == 1\n\n\ndef test_primitive_options(es):\n    options = {\n        \"sum\": {\"include_columns\": {\"customers\": [\"age\"]}},\n        \"mean\": {\"include_dataframes\": [\"customers\"]},\n        \"mode\": {\"ignore_dataframes\": [\"sessions\"]},\n        \"num_unique\": {\"ignore_columns\": {\"customers\": [\"engagement_level\"]}},\n    }\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"cohorts\",\n        entityset=es,\n        primitive_options=options,\n    )\n    features = dfs_obj.build_features()\n\n    for f in features:\n        deps = f.get_dependencies(deep=True)\n        df_names = [d.dataframe_name for d in deps]\n        columns = [d for d in deps if isinstance(d, IdentityFeature)]\n        if isinstance(f.primitive, Sum):\n            for identity_base in columns:\n                if identity_base.dataframe_name == \"customers\":\n                    assert identity_base.get_name() == \"age\"\n        if isinstance(f.primitive, Mean):\n            assert all([df_name in [\"customers\"] for df_name in df_names])\n        if isinstance(f.primitive, Mode):\n            assert \"sessions\" not in df_names\n        if isinstance(f.primitive, NumUnique):\n            for identity_base in columns:\n                assert not (\n                    identity_base.dataframe_name == \"customers\"\n                    and identity_base.get_name() == \"engagement_level\"\n                )\n\n    options = {\n        \"month\": {\"ignore_columns\": {\"customers\": [\"birthday\"]}},\n        \"day\": {\"include_columns\": {\"customers\": [\"signup_date\", \"upgrade_date\"]}},\n        \"num_characters\": {\"ignore_dataframes\": [\"customers\"]},\n        \"year\": {\"include_dataframes\": [\"customers\"]},\n    }\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"customers\",\n        entityset=es,\n        agg_primitives=[],\n        ignore_dataframes=[\"cohort\"],\n        primitive_options=options,\n    )\n    features = dfs_obj.build_features()\n    assert not any([isinstance(f, NumCharacters) for f in features])\n    for f in features:\n        deps = f.get_dependencies(deep=True)\n        df_names = [d.dataframe_name for d in deps]\n        columns = [d for d in deps if isinstance(d, IdentityFeature)]\n        if isinstance(f.primitive, Month):\n            for identity_base in columns:\n                assert not (\n                    identity_base.dataframe_name == \"customers\"\n                    and identity_base.get_name() == \"birthday\"\n                )\n        if isinstance(f.primitive, Day):\n            for identity_base in columns:\n                if identity_base.dataframe_name == \"customers\":\n                    assert (\n                        identity_base.get_name() == \"signup_date\"\n                        or identity_base.get_name() == \"upgrade_date\"\n                    )\n        if isinstance(f.primitive, Year):\n            assert all([df_name in [\"customers\"] for df_name in df_names])\n\n\ndef test_primitive_options_with_globals(es):\n    # non-overlapping ignore_dataframes\n    options = {\"mode\": {\"ignore_dataframes\": [\"sessions\"]}}\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"cohorts\",\n        entityset=es,\n        ignore_dataframes=[\"régions\"],\n        primitive_options=options,\n    )\n    features = dfs_obj.build_features()\n    for f in features:\n        deps = f.get_dependencies(deep=True)\n        df_names = [d.dataframe_name for d in deps]\n        assert \"régions\" not in df_names\n        if isinstance(f.primitive, Mode):\n            assert \"sessions\" not in df_names\n\n    # non-overlapping ignore_columns\n    options = {\"num_unique\": {\"ignore_columns\": {\"customers\": [\"engagement_level\"]}}}\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"customers\",\n        entityset=es,\n        ignore_columns={\"customers\": [\"région_id\"]},\n        primitive_options=options,\n    )\n    features = dfs_obj.build_features()\n    for f in features:\n        deps = f.get_dependencies(deep=True)\n        columns = [d for d in deps if isinstance(d, IdentityFeature)]\n        for identity_base in columns:\n            assert not (\n                identity_base.dataframe_name == \"customers\"\n                and identity_base.get_name() == \"région_id\"\n            )\n        if isinstance(f.primitive, NumUnique):\n            for identity_base in columns:\n                assert not (\n                    identity_base.dataframe_name == \"customers\"\n                    and identity_base.get_name() == \"engagement_level\"\n                )\n\n    # Overlapping globals/options with ignore_dataframes\n    options = {\n        \"mode\": {\n            \"include_dataframes\": [\"sessions\", \"customers\"],\n            \"ignore_columns\": {\"customers\": [\"région_id\"]},\n        },\n        \"num_unique\": {\n            \"include_dataframes\": [\"sessions\", \"customers\"],\n            \"include_columns\": {\"sessions\": [\"device_type\"], \"customers\": [\"age\"]},\n        },\n        \"month\": {\"ignore_columns\": {\"cohorts\": [\"cohort_end\"]}},\n    }\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"cohorts\",\n        entityset=es,\n        ignore_dataframes=[\"sessions\"],\n        ignore_columns={\"customers\": [\"age\"]},\n        primitive_options=options,\n    )\n    features = dfs_obj.build_features()\n    for f in features:\n        assert f.primitive.name != \"month\"\n        # ignoring cohorts means no features are created\n        assert not isinstance(f.primitive, Month)\n\n        deps = f.get_dependencies(deep=True)\n        df_names = [d.dataframe_name for d in deps]\n        columns = [d for d in deps if isinstance(d, IdentityFeature)]\n        if isinstance(f.primitive, Mode):\n            assert [all([df_name in [\"sessions\", \"customers\"] for df_name in df_names])]\n            for identity_base in columns:\n                assert not (\n                    identity_base.dataframe_name == \"customers\"\n                    and (\n                        identity_base.get_name() == \"age\"\n                        or identity_base.get_name() == \"région_id\"\n                    )\n                )\n        elif isinstance(f.primitive, NumUnique):\n            assert [all([df_name in [\"sessions\", \"customers\"] for df_name in df_names])]\n            for identity_base in columns:\n                if identity_base.dataframe_name == \"sessions\":\n                    assert identity_base.get_name() == \"device_type\"\n        # All other primitives ignore 'sessions' and 'age'\n        else:\n            assert \"sessions\" not in df_names\n            for identity_base in columns:\n                assert not (\n                    identity_base.dataframe_name == \"customers\"\n                    and identity_base.get_name() == \"age\"\n                )\n\n\ndef test_primitive_options_groupbys(es):\n    options = {\n        \"cum_count\": {\"include_groupby_dataframes\": [\"log\", \"customers\"]},\n        \"cum_sum\": {\"ignore_groupby_dataframes\": [\"sessions\"]},\n        \"cum_mean\": {\n            \"ignore_groupby_columns\": {\n                \"customers\": [\"région_id\"],\n                \"log\": [\"session_id\"],\n            },\n        },\n        \"cum_min\": {\n            \"include_groupby_columns\": {\"sessions\": [\"customer_id\", \"device_type\"]},\n        },\n    }\n\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"log\",\n        entityset=es,\n        agg_primitives=[],\n        trans_primitives=[],\n        max_depth=3,\n        groupby_trans_primitives=[\"cum_sum\", \"cum_count\", \"cum_min\", \"cum_mean\"],\n        primitive_options=options,\n    )\n    features = dfs_obj.build_features()\n    for f in features:\n        if isinstance(f, GroupByTransformFeature):\n            deps = f.groupby.get_dependencies(deep=True)\n            df_names = [d.dataframe_name for d in deps] + [f.groupby.dataframe_name]\n            columns = [d for d in deps if isinstance(d, IdentityFeature)]\n            columns += [f.groupby] if isinstance(f.groupby, IdentityFeature) else []\n        if isinstance(f.primitive, CumMean):\n            for identity_groupby in columns:\n                assert not (\n                    identity_groupby.dataframe_name == \"customers\"\n                    and identity_groupby.get_name() == \"région_id\"\n                )\n                assert not (\n                    identity_groupby.dataframe_name == \"log\"\n                    and identity_groupby.get_name() == \"session_id\"\n                )\n        if isinstance(f.primitive, CumCount):\n            assert all([name in [\"log\", \"customers\"] for name in df_names])\n        if isinstance(f.primitive, CumSum):\n            assert \"sessions\" not in df_names\n        if isinstance(f.primitive, CumMin):\n            for identity_groupby in columns:\n                if identity_groupby.dataframe_name == \"sessions\":\n                    assert (\n                        identity_groupby.get_name() == \"customer_id\"\n                        or identity_groupby.get_name() == \"device_type\"\n                    )\n\n\ndef test_primitive_options_multiple_inputs(es):\n    too_many_options = {\n        \"mode\": [{\"include_dataframes\": [\"logs\"]}, {\"ignore_dataframes\": [\"sessions\"]}],\n    }\n    error_msg = \"Number of options does not match number of inputs for primitive mode\"\n    with pytest.raises(AssertionError, match=error_msg):\n        DeepFeatureSynthesis(\n            target_dataframe_name=\"customers\",\n            entityset=es,\n            agg_primitives=[\"mode\"],\n            trans_primitives=[],\n            primitive_options=too_many_options,\n        )\n\n    unknown_primitive = Trend()\n    unknown_primitive.name = \"unknown_primitive\"\n    unknown_primitive_option = {\n        \"unknown_primitive\": [\n            {\"include_dataframes\": [\"logs\"]},\n            {\"ignore_dataframes\": [\"sessions\"]},\n        ],\n    }\n    error_msg = \"Unknown primitive with name 'unknown_primitive'\"\n    with pytest.raises(ValueError, match=error_msg):\n        DeepFeatureSynthesis(\n            target_dataframe_name=\"customers\",\n            entityset=es,\n            agg_primitives=[unknown_primitive],\n            trans_primitives=[],\n            primitive_options=unknown_primitive_option,\n        )\n\n    options1 = {\n        \"trend\": [\n            {\"include_dataframes\": [\"log\"], \"ignore_columns\": {\"log\": [\"value\"]}},\n            {\"include_dataframes\": [\"log\"], \"include_columns\": {\"log\": [\"datetime\"]}},\n        ],\n    }\n    dfs_obj1 = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[\"trend\"],\n        trans_primitives=[],\n        primitive_options=options1,\n    )\n    features1 = dfs_obj1.build_features()\n    for f in features1:\n        deps = f.get_dependencies()\n        df_names = [d.dataframe_name for d in deps]\n        columns = [d.get_name() for d in deps]\n        if f.primitive.name == \"trend\":\n            assert all([df_name in [\"log\"] for df_name in df_names])\n            assert \"datetime\" in columns\n            if len(columns) == 2:\n                assert \"value\" != columns[0]\n\n    options2 = {\n        Trend: [\n            {\"include_dataframes\": [\"log\"], \"ignore_columns\": {\"log\": [\"value\"]}},\n            {\"include_dataframes\": [\"log\"], \"include_columns\": {\"log\": [\"datetime\"]}},\n        ],\n    }\n    dfs_obj2 = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[\"trend\"],\n        trans_primitives=[],\n        primitive_options=options2,\n    )\n    features2 = dfs_obj2.build_features()\n\n    assert set(features2) == set(features1)\n\n\ndef test_primitive_options_class_names(es):\n    options1 = {\"mean\": {\"include_dataframes\": [\"customers\"]}}\n\n    options2 = {Mean: {\"include_dataframes\": [\"customers\"]}}\n\n    bad_options = {\n        \"mean\": {\"include_dataframes\": [\"customers\"]},\n        Mean: {\"ignore_dataframes\": [\"customers\"]},\n    }\n    conflicting_error_text = \"Multiple options found for primitive mean\"\n\n    primitives = [[\"mean\"], [Mean]]\n    options = [options1, options2]\n\n    features = []\n    for primitive in primitives:\n        with pytest.raises(KeyError, match=conflicting_error_text):\n            DeepFeatureSynthesis(\n                target_dataframe_name=\"cohorts\",\n                entityset=es,\n                agg_primitives=primitive,\n                trans_primitives=[],\n                primitive_options=bad_options,\n            )\n        for option in options:\n            dfs_obj = DeepFeatureSynthesis(\n                target_dataframe_name=\"cohorts\",\n                entityset=es,\n                agg_primitives=primitive,\n                trans_primitives=[],\n                primitive_options=option,\n            )\n            features.append(set(dfs_obj.build_features()))\n\n    for f in features[0]:\n        deps = f.get_dependencies(deep=True)\n        df_names = [d.dataframe_name for d in deps]\n        if isinstance(f.primitive, Mean):\n            assert all(df_name == \"customers\" for df_name in df_names)\n\n    assert features[0] == features[1] == features[2] == features[3]\n\n\ndef test_primitive_options_instantiated_primitive(es):\n    warning_msg = (\n        \"Options present for primitive instance and generic \"\n        \"primitive class \\\\(mean\\\\), primitive instance will not use generic \"\n        \"options\"\n    )\n\n    skipna_mean = Mean(skipna=False)\n    options = {\n        skipna_mean: {\"include_dataframes\": [\"stores\"]},\n        \"mean\": {\"ignore_dataframes\": [\"stores\"]},\n    }\n    with pytest.warns(UserWarning, match=warning_msg):\n        dfs_obj = DeepFeatureSynthesis(\n            target_dataframe_name=\"régions\",\n            entityset=es,\n            agg_primitives=[\"mean\", skipna_mean],\n            trans_primitives=[],\n            primitive_options=options,\n        )\n\n    features = dfs_obj.build_features()\n    for f in features:\n        deps = f.get_dependencies(deep=True)\n        df_names = [d.dataframe_name for d in deps]\n        if f.primitive == skipna_mean:\n            assert all(df_name == \"stores\" for df_name in df_names)\n        elif isinstance(f.primitive, Mean):\n            assert \"stores\" not in df_names\n\n\ndef test_primitive_options_commutative(es):\n    class AddThree(TransformPrimitive):\n        name = \"add_three\"\n        input_types = [\n            ColumnSchema(semantic_tags={\"numeric\"}),\n            ColumnSchema(semantic_tags={\"numeric\"}),\n            ColumnSchema(semantic_tags={\"numeric\"}),\n        ]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n        commutative = True\n\n        def generate_name(self, base_feature_names):\n            return \"%s + %s + %s\" % (\n                base_feature_names[0],\n                base_feature_names[1],\n                base_feature_names[2],\n            )\n\n    options = {\n        \"add_numeric\": [\n            {\"include_columns\": {\"log\": [\"value_2\"]}},\n            {\"include_columns\": {\"log\": [\"value\"]}},\n        ],\n        AddThree: [\n            {\"include_columns\": {\"log\": [\"value_2\"]}},\n            {\"include_columns\": {\"log\": [\"value_many_nans\"]}},\n            {\"include_columns\": {\"log\": [\"value\"]}},\n        ],\n    }\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"log\",\n        entityset=es,\n        agg_primitives=[],\n        trans_primitives=[AddNumeric, AddThree],\n        primitive_options=options,\n        max_depth=1,\n    )\n    features = dfs_obj.build_features()\n    add_numeric = [f for f in features if isinstance(f.primitive, AddNumeric)]\n    assert len(add_numeric) == 1\n    deps = add_numeric[0].get_dependencies(deep=True)\n    assert deps[0].get_name() == \"value_2\" and deps[1].get_name() == \"value\"\n\n    add_three = [f for f in features if isinstance(f.primitive, AddThree)]\n    assert len(add_three) == 1\n    deps = add_three[0].get_dependencies(deep=True)\n    assert (\n        deps[0].get_name() == \"value_2\"\n        and deps[1].get_name() == \"value_many_nans\"\n        and deps[2].get_name() == \"value\"\n    )\n\n\ndef test_primitive_options_include_over_exclude(es):\n    options = {\n        \"mean\": {\"ignore_dataframes\": [\"stores\"], \"include_dataframes\": [\"stores\"]},\n    }\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"régions\",\n        entityset=es,\n        agg_primitives=[\"mean\"],\n        trans_primitives=[],\n        primitive_options=options,\n    )\n\n    features = dfs_obj.build_features()\n    at_least_one_mean = False\n    for f in features:\n        deps = f.get_dependencies(deep=True)\n        dataframes = [d.dataframe_name for d in deps]\n        if isinstance(f.primitive, Mean):\n            at_least_one_mean = True\n            assert \"stores\" in dataframes\n    assert at_least_one_mean\n\n\ndef test_primitive_ordering():\n    # Test that the order of the input primitives impacts neither\n    # which features are created nor their order\n    es = make_ecommerce_entityset()\n\n    trans_prims = [AddNumeric, Absolute, \"divide_numeric\", NotEqual, \"is_null\"]\n    groupby_trans_prim = [\"cum_mean\", CumMin, CumSum]\n    agg_prims = [NMostCommon(n=3), Sum, Mean, Mean(skipna=False), \"min\", \"max\"]\n    where_prims = [\"count\", Sum]\n\n    seed_num_chars = Feature(\n        es[\"customers\"].ww[\"favorite_quote\"],\n        primitive=NumCharacters,\n    )\n    seed_is_null = Feature(es[\"customers\"].ww[\"age\"], primitive=IsNull)\n    seed_features = [seed_num_chars, seed_is_null]\n\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"customers\",\n        entityset=es,\n        trans_primitives=trans_prims,\n        groupby_trans_primitives=groupby_trans_prim,\n        agg_primitives=agg_prims,\n        where_primitives=where_prims,\n        seed_features=seed_features,\n        max_features=-1,\n        max_depth=2,\n    )\n    features1 = dfs_obj.build_features()\n\n    trans_prims.reverse()\n    groupby_trans_prim.reverse()\n    agg_prims.reverse()\n    where_prims.reverse()\n    seed_features.reverse()\n\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"customers\",\n        entityset=es,\n        trans_primitives=trans_prims,\n        groupby_trans_primitives=groupby_trans_prim,\n        agg_primitives=agg_prims,\n        where_primitives=where_prims,\n        seed_features=seed_features,\n        max_features=-1,\n        max_depth=2,\n    )\n    features2 = dfs_obj.build_features()\n\n    assert len(features1) == len(features2)\n\n    for i in range(len(features2)):\n        assert features1[i].unique_name() == features2[i].unique_name()\n\n\ndef test_no_transform_stacking():\n    df1 = pd.DataFrame({\"id\": [0, 1, 2, 3], \"A\": [0, 1, 2, 3]})\n    df2 = pd.DataFrame(\n        {\"index\": [0, 1, 2, 3], \"first_id\": [0, 1, 1, 3], \"B\": [99, 88, 77, 66]},\n    )\n\n    dataframes = {\"first\": (df1, \"id\"), \"second\": (df2, \"index\")}\n    relationships = [(\"first\", \"id\", \"second\", \"first_id\")]\n    es = EntitySet(\"data\", dataframes, relationships)\n\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"second\",\n        entityset=es,\n        trans_primitives=[\"negate\", \"add_numeric\"],\n        agg_primitives=[\"sum\"],\n        max_depth=4,\n    )\n    feature_defs = dfs_obj.build_features()\n\n    expected = [\n        \"first_id\",\n        \"B\",\n        \"-(B)\",\n        \"first.A\",\n        \"first.SUM(second.B)\",\n        \"first.-(A)\",\n        \"B + first.A\",\n        \"first.SUM(second.-(B))\",\n        \"first.A + SUM(second.B)\",\n        \"first.-(SUM(second.B))\",\n        \"B + first.SUM(second.B)\",\n        \"first.A + SUM(second.-(B))\",\n        \"first.SUM(second.-(B)) + SUM(second.B)\",\n        \"first.-(SUM(second.-(B)))\",\n        \"B + first.SUM(second.-(B))\",\n    ]\n\n    assert len(feature_defs) == len(expected)\n\n    for feature_name in expected:\n        assert feature_with_name(feature_defs, feature_name)\n\n\ndef test_builds_seed_features_on_foreign_key_col(es):\n    seed_feature_sessions = Feature(es[\"sessions\"].ww[\"customer_id\"], primitive=Negate)\n\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        agg_primitives=[],\n        trans_primitives=[],\n        max_depth=2,\n        seed_features=[seed_feature_sessions],\n    )\n\n    features = dfs_obj.build_features()\n    assert feature_with_name(features, \"-(customer_id)\")\n\n\ndef test_does_not_build_features_on_last_time_index_col(es):\n    es.add_last_time_indexes()\n\n    dfs_obj = DeepFeatureSynthesis(target_dataframe_name=\"log\", entityset=es)\n\n    features = dfs_obj.build_features()\n\n    for feature in features:\n        assert LTI_COLUMN_NAME not in feature.get_name()\n\n\ndef test_builds_features_using_all_input_types(es):\n    new_log_df = es[\"log\"]\n    new_log_df.ww[\"purchased_nullable\"] = es[\"log\"][\"purchased\"]\n    new_log_df.ww.set_types(logical_types={\"purchased_nullable\": \"boolean_nullable\"})\n    es.replace_dataframe(\"log\", new_log_df)\n\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"log\",\n        entityset=es,\n        trans_primitives=[Not],\n        max_depth=1,\n    )\n    trans_features = dfs_obj.build_features()\n    assert feature_with_name(trans_features, \"NOT(purchased)\")\n    assert feature_with_name(trans_features, \"NOT(purchased_nullable)\")\n\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"log\",\n        entityset=es,\n        groupby_trans_primitives=[Not],\n        max_depth=1,\n    )\n    groupby_trans_features = dfs_obj.build_features()\n    assert feature_with_name(groupby_trans_features, \"NOT(purchased) by session_id\")\n    assert feature_with_name(\n        groupby_trans_features,\n        \"NOT(purchased_nullable) by session_id\",\n    )\n\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"sessions\",\n        entityset=es,\n        trans_primitives=[],\n        agg_primitives=[NumTrue],\n    )\n    agg_features = dfs_obj.build_features()\n    assert feature_with_name(agg_features, \"NUM_TRUE(log.purchased)\")\n    assert feature_with_name(agg_features, \"NUM_TRUE(log.purchased_nullable)\")\n\n\ndef test_make_groupby_features_with_depth_none(es):\n    # If max_depth is set to -1, it sets it to None internally, so this\n    # test validates code paths that have a None max_depth\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"log\",\n        entityset=es,\n        agg_primitives=[],\n        trans_primitives=[],\n        groupby_trans_primitives=[\"cum_sum\"],\n        max_depth=-1,\n    )\n    features = dfs_obj.build_features()\n    assert feature_with_name(features, \"CUM_SUM(value) by session_id\")\n\n\ndef test_check_stacking_when_building_transform_features(es):\n    class NewMean(Mean):\n        name = \"NEW_MEAN\"\n        base_of_exclude = [Absolute]\n\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"log\",\n        entityset=es,\n        agg_primitives=[NewMean, \"mean\"],\n        trans_primitives=[\"absolute\"],\n        max_depth=-1,\n    )\n    features = dfs_obj.build_features()\n    assert number_of_features_with_name_like(features, \"ABSOLUTE(MEAN\") > 0\n    assert number_of_features_with_name_like(features, \"ABSOLUTE(NEW_MEAN\") == 0\n\n\ndef test_check_stacking_when_building_groupby_features(es):\n    class NewMean(Mean):\n        name = \"NEW_MEAN\"\n        base_of_exclude = [CumSum]\n\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"log\",\n        entityset=es,\n        agg_primitives=[NewMean, \"mean\"],\n        groupby_trans_primitives=[\"cum_sum\"],\n        max_depth=5,\n    )\n    features = dfs_obj.build_features()\n    assert number_of_features_with_name_like(features, \"CUM_SUM(MEAN\") > 0\n    assert number_of_features_with_name_like(features, \"CUM_SUM(NEW_MEAN\") == 0\n\n\ndef test_check_stacking_when_building_agg_features(es):\n    class NewAbsolute(Absolute):\n        name = \"NEW_ABSOLUTE\"\n        base_of_exclude = [Mean]\n\n    dfs_obj = DeepFeatureSynthesis(\n        target_dataframe_name=\"log\",\n        entityset=es,\n        agg_primitives=[\"mean\"],\n        trans_primitives=[NewAbsolute, \"absolute\"],\n        max_depth=5,\n    )\n    features = dfs_obj.build_features()\n    assert number_of_features_with_name_like(features, \"MEAN(log.ABSOLUTE\") > 0\n    assert number_of_features_with_name_like(features, \"MEAN(log.NEW_ABSOLUTE\") == 0\n"
  },
  {
    "path": "featuretools/tests/synthesis/test_dfs_method.py",
    "content": "import warnings\nfrom unittest.mock import patch\n\nimport composeml as cp\nimport numpy as np\nimport pandas as pd\nimport pytest\nfrom packaging.version import parse\nfrom woodwork.column_schema import ColumnSchema\nfrom woodwork.logical_types import NaturalLanguage\n\nfrom featuretools.computational_backends.calculate_feature_matrix import (\n    FEATURE_CALCULATION_PERCENTAGE,\n)\nfrom featuretools.entityset import EntitySet, Timedelta\nfrom featuretools.exceptions import UnusedPrimitiveWarning\nfrom featuretools.primitives import GreaterThanScalar, Max, Mean, Min, Sum\nfrom featuretools.primitives.base import AggregationPrimitive, TransformPrimitive\nfrom featuretools.synthesis import dfs\nfrom featuretools.synthesis.deep_feature_synthesis import DeepFeatureSynthesis\n\n\n@pytest.fixture\ndef datetime_es():\n    cards_df = pd.DataFrame({\"id\": [1, 2, 3, 4, 5]})\n    transactions_df = pd.DataFrame(\n        {\n            \"id\": [1, 2, 3, 4, 5],\n            \"card_id\": [1, 1, 5, 1, 5],\n            \"transaction_time\": pd.to_datetime(\n                [\n                    \"2011-2-28 04:00\",\n                    \"2012-2-28 05:00\",\n                    \"2012-2-29 06:00\",\n                    \"2012-3-1 08:00\",\n                    \"2014-4-1 10:00\",\n                ],\n            ),\n            \"fraud\": [True, False, False, False, True],\n        },\n    )\n\n    datetime_es = EntitySet(id=\"fraud_data\")\n    datetime_es = datetime_es.add_dataframe(\n        dataframe_name=\"transactions\",\n        dataframe=transactions_df,\n        index=\"id\",\n        time_index=\"transaction_time\",\n    )\n\n    datetime_es = datetime_es.add_dataframe(\n        dataframe_name=\"cards\",\n        dataframe=cards_df,\n        index=\"id\",\n    )\n\n    datetime_es = datetime_es.add_relationship(\"cards\", \"id\", \"transactions\", \"card_id\")\n    datetime_es.add_last_time_indexes()\n    return datetime_es\n\n\ndef test_dfs_empty_features():\n    error_text = \"No features can be generated from the specified primitives. Please make sure the primitives you are using are compatible with the variable types in your data.\"\n    teams = pd.DataFrame({\"id\": range(3), \"name\": [\"Breakers\", \"Spirit\", \"Thorns\"]})\n    games = pd.DataFrame(\n        {\n            \"id\": range(5),\n            \"home_team_id\": [2, 2, 1, 0, 1],\n            \"away_team_id\": [1, 0, 2, 1, 0],\n            \"home_team_score\": [3, 0, 1, 0, 4],\n            \"away_team_score\": [2, 1, 2, 0, 0],\n        },\n    )\n    dataframes = {\n        \"teams\": (teams, \"id\", None, {\"name\": \"natural_language\"}),\n        \"games\": (games, \"id\"),\n    }\n    relationships = [(\"teams\", \"id\", \"games\", \"home_team_id\")]\n    with patch.object(DeepFeatureSynthesis, \"build_features\", return_value=[]):\n        features = dfs(\n            dataframes,\n            relationships,\n            target_dataframe_name=\"teams\",\n            features_only=True,\n        )\n        assert features == []\n    with (\n        pytest.raises(AssertionError, match=error_text),\n        patch.object(\n            DeepFeatureSynthesis,\n            \"build_features\",\n            return_value=[],\n        ),\n    ):\n        dfs(\n            dataframes,\n            relationships,\n            target_dataframe_name=\"teams\",\n            features_only=False,\n        )\n\n\ndef test_passing_strings_to_logical_types_dfs():\n    teams = pd.DataFrame({\"id\": range(3), \"name\": [\"Breakers\", \"Spirit\", \"Thorns\"]})\n    games = pd.DataFrame(\n        {\n            \"id\": range(5),\n            \"home_team_id\": [2, 2, 1, 0, 1],\n            \"away_team_id\": [1, 0, 2, 1, 0],\n            \"home_team_score\": [3, 0, 1, 0, 4],\n            \"away_team_score\": [2, 1, 2, 0, 0],\n        },\n    )\n    dataframes = {\n        \"teams\": (teams, \"id\", None, {\"name\": \"natural_language\"}),\n        \"games\": (games, \"id\"),\n    }\n    relationships = [(\"teams\", \"id\", \"games\", \"home_team_id\")]\n\n    features = dfs(\n        dataframes,\n        relationships,\n        target_dataframe_name=\"teams\",\n        features_only=True,\n    )\n\n    name_logical_type = features[0].dataframe[\"name\"].ww.logical_type\n    assert isinstance(name_logical_type, NaturalLanguage)\n\n\ndef test_accepts_cutoff_time_df(dataframes, relationships):\n    cutoff_times_df = pd.DataFrame({\"instance_id\": [1, 2, 3], \"time\": [10, 12, 15]})\n    feature_matrix, features = dfs(\n        dataframes=dataframes,\n        relationships=relationships,\n        target_dataframe_name=\"transactions\",\n        cutoff_time=cutoff_times_df,\n    )\n    feature_matrix = feature_matrix\n    assert len(feature_matrix.index) == 3\n    assert len(feature_matrix.columns) == len(features)\n\n\ndef test_accepts_cutoff_time_compose(dataframes, relationships):\n    def fraud_occured(df):\n        return df[\"fraud\"].any()\n\n    kwargs = {\n        \"time_index\": \"transaction_time\",\n        \"labeling_function\": fraud_occured,\n        \"window_size\": 1,\n    }\n    if parse(cp.__version__) >= parse(\"0.10.0\"):\n        kwargs[\"target_dataframe_index\"] = \"card_id\"\n    else:\n        kwargs[\"target_dataframe_name\"] = \"card_id\"  # pragma: no cover\n\n    lm = cp.LabelMaker(**kwargs)\n\n    transactions_df = dataframes[\"transactions\"][0]\n\n    labels = lm.search(transactions_df, num_examples_per_instance=-1)\n\n    labels[\"time\"] = pd.to_numeric(labels[\"time\"])\n    labels.rename({\"card_id\": \"id\"}, axis=1, inplace=True)\n\n    feature_matrix, features = dfs(\n        dataframes=dataframes,\n        relationships=relationships,\n        target_dataframe_name=\"cards\",\n        cutoff_time=labels,\n    )\n    assert len(feature_matrix.index) == 6\n    assert len(feature_matrix.columns) == len(features) + 1\n\n\ndef test_accepts_single_cutoff_time(dataframes, relationships):\n    feature_matrix, features = dfs(\n        dataframes=dataframes,\n        relationships=relationships,\n        target_dataframe_name=\"transactions\",\n        cutoff_time=20,\n    )\n    assert len(feature_matrix.index) == 5\n    assert len(feature_matrix.columns) == len(features)\n\n\ndef test_accepts_no_cutoff_time(dataframes, relationships):\n    feature_matrix, features = dfs(\n        dataframes=dataframes,\n        relationships=relationships,\n        target_dataframe_name=\"transactions\",\n        instance_ids=[1, 2, 3, 5, 6],\n    )\n    assert len(feature_matrix.index) == 5\n    assert len(feature_matrix.columns) == len(features)\n\n\ndef test_ignores_instance_ids_if_cutoff_df(dataframes, relationships):\n    cutoff_times_df = pd.DataFrame({\"instance_id\": [1, 2, 3], \"time\": [10, 12, 15]})\n    instance_ids = [1, 2, 3, 4, 5]\n    feature_matrix, features = dfs(\n        dataframes=dataframes,\n        relationships=relationships,\n        target_dataframe_name=\"transactions\",\n        cutoff_time=cutoff_times_df,\n        instance_ids=instance_ids,\n    )\n    assert len(feature_matrix.index) == 3\n    assert len(feature_matrix.columns) == len(features)\n\n\ndef test_approximate_features(dataframes, relationships):\n    cutoff_times_df = pd.DataFrame(\n        {\"instance_id\": [1, 3, 1, 5, 3, 6], \"time\": [11, 16, 16, 26, 17, 22]},\n    )\n    # force column to BooleanNullable\n    dataframes[\"transactions\"] += ({\"fraud\": \"BooleanNullable\"},)\n    feature_matrix, features = dfs(\n        dataframes=dataframes,\n        relationships=relationships,\n        target_dataframe_name=\"transactions\",\n        cutoff_time=cutoff_times_df,\n        approximate=5,\n        cutoff_time_in_index=True,\n    )\n    direct_agg_feat_name = \"cards.PERCENT_TRUE(transactions.fraud)\"\n    assert len(feature_matrix.index) == 6\n    assert len(feature_matrix.columns) == len(features)\n\n    truth_values = pd.Series(data=[1.0, 0.5, 0.5, 1.0, 0.5, 1.0])\n\n    assert (feature_matrix[direct_agg_feat_name] == truth_values.values).all()\n\n\ndef test_all_columns(dataframes, relationships):\n    cutoff_times_df = pd.DataFrame({\"instance_id\": [1, 2, 3], \"time\": [10, 12, 15]})\n    feature_matrix, features = dfs(\n        dataframes=dataframes,\n        relationships=relationships,\n        target_dataframe_name=\"transactions\",\n        cutoff_time=cutoff_times_df,\n        agg_primitives=[Max, Mean, Min, Sum],\n        trans_primitives=[],\n        groupby_trans_primitives=[\"cum_sum\"],\n        max_depth=3,\n        allowed_paths=None,\n        ignore_dataframes=None,\n        ignore_columns=None,\n        seed_features=None,\n    )\n    assert len(feature_matrix.index) == 3\n    assert len(feature_matrix.columns) == len(features)\n\n\ndef test_features_only(dataframes, relationships):\n    if len(dataframes[\"transactions\"]) > 3:\n        dataframes[\"transactions\"][3][\"fraud\"] = \"BooleanNullable\"\n    else:\n        dataframes[\"transactions\"] += ({\"fraud\": \"BooleanNullable\"},)\n    features = dfs(\n        dataframes=dataframes,\n        relationships=relationships,\n        target_dataframe_name=\"transactions\",\n        features_only=True,\n    )\n\n    expected_features = 11\n    assert len(features) == expected_features\n\n\ndef test_accepts_relative_training_window(datetime_es):\n    feature_matrix, _ = dfs(entityset=datetime_es, target_dataframe_name=\"transactions\")\n\n    feature_matrix_2, _ = dfs(\n        entityset=datetime_es,\n        target_dataframe_name=\"transactions\",\n        cutoff_time=pd.Timestamp(\"2012-4-1 04:00\"),\n    )\n\n    feature_matrix_3, _ = dfs(\n        entityset=datetime_es,\n        target_dataframe_name=\"transactions\",\n        cutoff_time=pd.Timestamp(\"2012-4-1 04:00\"),\n        training_window=Timedelta(\"3 months\"),\n    )\n\n    feature_matrix_4, _ = dfs(\n        entityset=datetime_es,\n        target_dataframe_name=\"transactions\",\n        cutoff_time=pd.Timestamp(\"2012-4-1 04:00\"),\n        training_window=\"3 months\",\n    )\n\n    assert (feature_matrix.index == [1, 2, 3, 4, 5]).all()\n    assert (feature_matrix_2.index == [1, 2, 3, 4]).all()\n    assert (feature_matrix_3.index == [2, 3, 4]).all()\n    assert (feature_matrix_4.index == [2, 3, 4]).all()\n\n    # Test case for leap years\n    feature_matrix_5, _ = dfs(\n        entityset=datetime_es,\n        target_dataframe_name=\"transactions\",\n        cutoff_time=pd.Timestamp(\"2012-2-29 04:00\"),\n        training_window=Timedelta(\"1 year\"),\n        include_cutoff_time=True,\n    )\n    assert (feature_matrix_5.index == [2]).all()\n\n    feature_matrix_5, _ = dfs(\n        entityset=datetime_es,\n        target_dataframe_name=\"transactions\",\n        cutoff_time=pd.Timestamp(\"2012-2-29 04:00\"),\n        training_window=Timedelta(\"1 year\"),\n        include_cutoff_time=False,\n    )\n    assert (feature_matrix_5.index == [1, 2]).all()\n\n\ndef test_accepts_pd_timedelta_training_window(datetime_es):\n    feature_matrix, _ = dfs(\n        entityset=datetime_es,\n        target_dataframe_name=\"transactions\",\n        cutoff_time=pd.Timestamp(\"2012-3-31 04:00\"),\n        training_window=pd.Timedelta(61, \"D\"),\n    )\n\n    assert (feature_matrix.index == [2, 3, 4]).all()\n\n\ndef test_accepts_pd_dateoffset_training_window(datetime_es):\n    feature_matrix, _ = dfs(\n        entityset=datetime_es,\n        target_dataframe_name=\"transactions\",\n        cutoff_time=pd.Timestamp(\"2012-3-31 04:00\"),\n        training_window=pd.DateOffset(months=2),\n    )\n\n    feature_matrix_2, _ = dfs(\n        entityset=datetime_es,\n        target_dataframe_name=\"transactions\",\n        cutoff_time=pd.Timestamp(\"2012-3-31 04:00\"),\n        training_window=pd.offsets.BDay(44),\n    )\n\n    assert (feature_matrix.index == [2, 3, 4]).all()\n    assert (feature_matrix.index == feature_matrix_2.index).all()\n\n\ndef test_accepts_datetime_and_string_offset(datetime_es):\n    feature_matrix, _ = dfs(\n        entityset=datetime_es,\n        target_dataframe_name=\"transactions\",\n        cutoff_time=pd.to_datetime(\"2012-3-31 04:00\"),\n        training_window=pd.DateOffset(months=2),\n    )\n\n    feature_matrix_2, _ = dfs(\n        entityset=datetime_es,\n        target_dataframe_name=\"transactions\",\n        cutoff_time=\"2012-3-31 04:00\",\n        training_window=pd.offsets.BDay(44),\n    )\n\n    assert (feature_matrix.index == [2, 3, 4]).all()\n    assert (feature_matrix.index == feature_matrix_2.index).all()\n\n\ndef test_handles_pandas_parser_error(datetime_es):\n    with pytest.raises(ValueError):\n        _, _ = dfs(\n            entityset=datetime_es,\n            target_dataframe_name=\"transactions\",\n            cutoff_time=\"2--012-----3-----31 04:00\",\n            training_window=pd.DateOffset(months=2),\n        )\n\n\ndef test_handles_pandas_overflow_error(datetime_es):\n    # pandas 1.5.0 raises ValueError, older versions raised OverflowError\n    with pytest.raises((OverflowError, ValueError)):\n        _, _ = dfs(\n            entityset=datetime_es,\n            target_dataframe_name=\"transactions\",\n            cutoff_time=\"200000000000000000000000000000000000000000000000000000000000000000-3-31 04:00\",\n            training_window=pd.DateOffset(months=2),\n        )\n\n\ndef test_warns_with_unused_primitives(es):\n    trans_primitives = [\"num_characters\", \"num_words\", \"add_numeric\"]\n    agg_primitives = [Max, \"min\"]\n\n    warning_text = (\n        \"Some specified primitives were not used during DFS:\\n\"\n        + \"  trans_primitives: ['add_numeric']\\n  agg_primitives: ['max', 'min']\\n\"\n        + \"This may be caused by a using a value of max_depth that is too small, not setting interesting values, \"\n        + \"or it may indicate no compatible columns for the primitive were found in the data. If the DFS call \"\n        + \"contained multiple instances of a primitive in the list above, none of them were used.\"\n    )\n\n    with pytest.warns(UnusedPrimitiveWarning) as record:\n        dfs(\n            entityset=es,\n            target_dataframe_name=\"customers\",\n            trans_primitives=trans_primitives,\n            agg_primitives=agg_primitives,\n            max_depth=1,\n            features_only=True,\n        )\n\n    assert record[0].message.args[0] == warning_text\n\n    # Should not raise a warning\n    with warnings.catch_warnings():\n        warnings.simplefilter(\"error\")\n        dfs(\n            entityset=es,\n            target_dataframe_name=\"customers\",\n            trans_primitives=trans_primitives,\n            agg_primitives=agg_primitives,\n            max_depth=2,\n            features_only=True,\n        )\n\n\ndef test_no_warns_with_camel_and_title_case(es):\n    for trans_primitive in [\"isNull\", \"IsNull\"]:\n        # Should not raise a UnusedPrimitiveWarning warning\n        with warnings.catch_warnings():\n            warnings.simplefilter(\"error\")\n            dfs(\n                entityset=es,\n                target_dataframe_name=\"customers\",\n                trans_primitives=[trans_primitive],\n                max_depth=1,\n                features_only=True,\n            )\n\n    for agg_primitive in [\"numUnique\", \"NumUnique\"]:\n        # Should not raise a UnusedPrimitiveWarning warning\n        with warnings.catch_warnings():\n            warnings.simplefilter(\"error\")\n            dfs(\n                entityset=es,\n                target_dataframe_name=\"customers\",\n                agg_primitives=[agg_primitive],\n                max_depth=2,\n                features_only=True,\n            )\n\n\ndef test_does_not_warn_with_stacking_feature(es):\n    with warnings.catch_warnings():\n        warnings.simplefilter(\"error\")\n        dfs(\n            entityset=es,\n            target_dataframe_name=\"régions\",\n            agg_primitives=[\"percent_true\"],\n            trans_primitives=[GreaterThanScalar(5)],\n            primitive_options={\n                \"greater_than_scalar\": {\"include_dataframes\": [\"stores\"]},\n            },\n            features_only=True,\n        )\n\n\ndef test_warns_with_unused_where_primitives(es):\n    warning_text = (\n        \"Some specified primitives were not used during DFS:\\n\"\n        + \"  where_primitives: ['count', 'sum']\\n\"\n        + \"This may be caused by a using a value of max_depth that is too small, not setting interesting values, \"\n        + \"or it may indicate no compatible columns for the primitive were found in the data. If the DFS call \"\n        + \"contained multiple instances of a primitive in the list above, none of them were used.\"\n    )\n\n    with pytest.warns(UnusedPrimitiveWarning) as record:\n        dfs(\n            entityset=es,\n            target_dataframe_name=\"customers\",\n            agg_primitives=[\"count\"],\n            where_primitives=[\"sum\", \"count\"],\n            max_depth=1,\n            features_only=True,\n        )\n\n    assert record[0].message.args[0] == warning_text\n\n\ndef test_warns_with_unused_groupby_primitives(es):\n    warning_text = (\n        \"Some specified primitives were not used during DFS:\\n\"\n        + \"  groupby_trans_primitives: ['cum_sum']\\n\"\n        + \"This may be caused by a using a value of max_depth that is too small, not setting interesting values, \"\n        + \"or it may indicate no compatible columns for the primitive were found in the data. If the DFS call \"\n        + \"contained multiple instances of a primitive in the list above, none of them were used.\"\n    )\n\n    with pytest.warns(UnusedPrimitiveWarning) as record:\n        dfs(\n            entityset=es,\n            target_dataframe_name=\"sessions\",\n            groupby_trans_primitives=[\"cum_sum\"],\n            max_depth=1,\n            features_only=True,\n        )\n\n    assert record[0].message.args[0] == warning_text\n\n    # Should not raise a warning\n    with warnings.catch_warnings():\n        warnings.simplefilter(\"error\")\n        dfs(\n            entityset=es,\n            target_dataframe_name=\"customers\",\n            groupby_trans_primitives=[\"cum_sum\"],\n            max_depth=1,\n            features_only=True,\n        )\n\n\ndef test_warns_with_unused_custom_primitives(es):\n    class AboveTen(TransformPrimitive):\n        name = \"above_ten\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n    trans_primitives = [AboveTen]\n\n    warning_text = (\n        \"Some specified primitives were not used during DFS:\\n\"\n        + \"  trans_primitives: ['above_ten']\\n\"\n        + \"This may be caused by a using a value of max_depth that is too small, not setting interesting values, \"\n        + \"or it may indicate no compatible columns for the primitive were found in the data. If the DFS call \"\n        + \"contained multiple instances of a primitive in the list above, none of them were used.\"\n    )\n\n    with pytest.warns(UnusedPrimitiveWarning) as record:\n        dfs(\n            entityset=es,\n            target_dataframe_name=\"sessions\",\n            trans_primitives=trans_primitives,\n            max_depth=1,\n            features_only=True,\n        )\n\n    assert record[0].message.args[0] == warning_text\n\n    # Should not raise a warning\n    with warnings.catch_warnings():\n        warnings.simplefilter(\"error\")\n        dfs(\n            entityset=es,\n            target_dataframe_name=\"customers\",\n            trans_primitives=trans_primitives,\n            max_depth=1,\n            features_only=True,\n        )\n\n    class MaxAboveTen(AggregationPrimitive):\n        name = \"max_above_ten\"\n        input_types = [ColumnSchema(semantic_tags={\"numeric\"})]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n    agg_primitives = [MaxAboveTen]\n\n    warning_text = (\n        \"Some specified primitives were not used during DFS:\\n\"\n        + \"  agg_primitives: ['max_above_ten']\\n\"\n        + \"This may be caused by a using a value of max_depth that is too small, not setting interesting values, \"\n        + \"or it may indicate no compatible columns for the primitive were found in the data. If the DFS call \"\n        + \"contained multiple instances of a primitive in the list above, none of them were used.\"\n    )\n\n    with pytest.warns(UnusedPrimitiveWarning) as record:\n        dfs(\n            entityset=es,\n            target_dataframe_name=\"stores\",\n            agg_primitives=agg_primitives,\n            max_depth=1,\n            features_only=True,\n        )\n\n    assert record[0].message.args[0] == warning_text\n\n    # Should not raise a warning\n    with warnings.catch_warnings():\n        warnings.simplefilter(\"error\")\n        dfs(\n            entityset=es,\n            target_dataframe_name=\"sessions\",\n            agg_primitives=agg_primitives,\n            max_depth=1,\n            features_only=True,\n        )\n\n\ndef test_calls_progress_callback(dataframes, relationships):\n    class MockProgressCallback:\n        def __init__(self):\n            self.progress_history = []\n            self.total_update = 0\n            self.total_progress_percent = 0\n\n        def __call__(self, update, progress_percent, time_elapsed):\n            self.total_update += update\n            self.total_progress_percent = progress_percent\n            self.progress_history.append(progress_percent)\n\n    mock_progress_callback = MockProgressCallback()\n\n    dfs(\n        dataframes=dataframes,\n        relationships=relationships,\n        target_dataframe_name=\"transactions\",\n        progress_callback=mock_progress_callback,\n    )\n\n    # second to last entry is the last update from feature calculation\n    assert np.isclose(\n        mock_progress_callback.progress_history[-2],\n        FEATURE_CALCULATION_PERCENTAGE * 100,\n    )\n    assert np.isclose(mock_progress_callback.total_update, 100.0)\n    assert np.isclose(mock_progress_callback.total_progress_percent, 100.0)\n\n\ndef test_calls_progress_callback_cluster(dataframes, relationships, dask_cluster):\n    class MockProgressCallback:\n        def __init__(self):\n            self.progress_history = []\n            self.total_update = 0\n            self.total_progress_percent = 0\n\n        def __call__(self, update, progress_percent, time_elapsed):\n            self.total_update += update\n            self.total_progress_percent = progress_percent\n            self.progress_history.append(progress_percent)\n\n    mock_progress_callback = MockProgressCallback()\n\n    dkwargs = {\"cluster\": dask_cluster.scheduler.address}\n    dfs(\n        dataframes=dataframes,\n        relationships=relationships,\n        target_dataframe_name=\"transactions\",\n        progress_callback=mock_progress_callback,\n        dask_kwargs=dkwargs,\n    )\n\n    assert np.isclose(mock_progress_callback.total_update, 100.0)\n    assert np.isclose(mock_progress_callback.total_progress_percent, 100.0)\n\n\ndef test_dask_kwargs(dataframes, relationships, dask_cluster):\n    cutoff_times_df = pd.DataFrame({\"instance_id\": [1, 2, 3], \"time\": [10, 12, 15]})\n    feature_matrix, features = dfs(\n        dataframes=dataframes,\n        relationships=relationships,\n        target_dataframe_name=\"transactions\",\n        cutoff_time=cutoff_times_df,\n    )\n\n    dask_kwargs = {\"cluster\": dask_cluster.scheduler.address}\n    feature_matrix_2, features_2 = dfs(\n        dataframes=dataframes,\n        relationships=relationships,\n        target_dataframe_name=\"transactions\",\n        cutoff_time=cutoff_times_df,\n        dask_kwargs=dask_kwargs,\n    )\n\n    assert all(\n        f1.unique_name() == f2.unique_name() for f1, f2 in zip(features, features_2)\n    )\n    for column in feature_matrix:\n        for x, y in zip(feature_matrix[column], feature_matrix_2[column]):\n            assert (pd.isnull(x) and pd.isnull(y)) or (x == y)\n"
  },
  {
    "path": "featuretools/tests/synthesis/test_encode_features.py",
    "content": "import pandas as pd\nimport pytest\n\nfrom featuretools import EntitySet, calculate_feature_matrix, dfs\nfrom featuretools.feature_base import Feature, IdentityFeature\nfrom featuretools.primitives import NMostCommon\nfrom featuretools.synthesis import encode_features\n\n\ndef test_encodes_features(es):\n    f1 = IdentityFeature(es[\"log\"].ww[\"product_id\"])\n    f2 = IdentityFeature(es[\"log\"].ww[\"purchased\"])\n    f3 = IdentityFeature(es[\"log\"].ww[\"value\"])\n\n    features = [f1, f2, f3]\n    feature_matrix = calculate_feature_matrix(\n        features,\n        es,\n        instance_ids=[0, 1, 2, 3, 4, 5],\n    )\n\n    _, features_encoded = encode_features(feature_matrix, features)\n    assert len(features_encoded) == 6\n\n    _, features_encoded = encode_features(feature_matrix, features, top_n=2)\n    assert len(features_encoded) == 5\n\n    _, features_encoded = encode_features(\n        feature_matrix,\n        features,\n        include_unknown=False,\n    )\n    assert len(features_encoded) == 5\n\n\ndef test_inplace_encodes_features(es):\n    f1 = IdentityFeature(es[\"log\"].ww[\"product_id\"])\n\n    features = [f1]\n    feature_matrix = calculate_feature_matrix(\n        features,\n        es,\n        instance_ids=[0, 1, 2, 3, 4, 5],\n    )\n\n    feature_matrix_shape = feature_matrix.shape\n    feature_matrix_encoded, _ = encode_features(feature_matrix, features)\n    assert feature_matrix_encoded.shape != feature_matrix_shape\n    assert feature_matrix.shape == feature_matrix_shape\n\n    # inplace they should be the same\n    feature_matrix_encoded, _ = encode_features(feature_matrix, features, inplace=True)\n    assert feature_matrix_encoded.shape == feature_matrix.shape\n\n\ndef test_to_encode_features(es):\n    f1 = IdentityFeature(es[\"log\"].ww[\"product_id\"])\n    f2 = IdentityFeature(es[\"log\"].ww[\"value\"])\n    f3 = IdentityFeature(es[\"log\"].ww[\"datetime\"])\n\n    features = [f1, f2, f3]\n    feature_matrix = calculate_feature_matrix(\n        features,\n        es,\n        instance_ids=[0, 1, 2, 3, 4, 5],\n    )\n\n    feature_matrix_encoded, _ = encode_features(feature_matrix, features)\n    feature_matrix_encoded_shape = feature_matrix_encoded.shape\n\n    # to_encode should keep product_id as a string and datetime as a date,\n    # and not have the same shape as previous encoded matrix due to fewer encoded features\n    to_encode = []\n    feature_matrix_encoded, _ = encode_features(\n        feature_matrix,\n        features,\n        to_encode=to_encode,\n    )\n    assert feature_matrix_encoded_shape != feature_matrix_encoded.shape\n    assert feature_matrix_encoded[\"datetime\"].dtype == \"datetime64[ns]\"\n    assert feature_matrix_encoded[\"product_id\"].dtype == \"category\"\n\n    to_encode = [\"value\"]\n    feature_matrix_encoded, _ = encode_features(\n        feature_matrix,\n        features,\n        to_encode=to_encode,\n    )\n    assert feature_matrix_encoded_shape != feature_matrix_encoded.shape\n    assert feature_matrix_encoded[\"datetime\"].dtype == \"datetime64[ns]\"\n    assert feature_matrix_encoded[\"product_id\"].dtype == \"category\"\n\n\ndef test_encode_features_handles_pass_columns(es):\n    f1 = IdentityFeature(es[\"log\"].ww[\"product_id\"])\n    f2 = IdentityFeature(es[\"log\"].ww[\"value\"])\n\n    features = [f1, f2]\n    cutoff_time = pd.DataFrame(\n        {\n            \"instance_id\": range(6),\n            \"time\": es[\"log\"][\"datetime\"][0:6],\n            \"label\": [i % 2 for i in range(6)],\n        },\n        columns=[\"instance_id\", \"time\", \"label\"],\n    )\n    feature_matrix = calculate_feature_matrix(features, es, cutoff_time)\n\n    assert \"label\" in feature_matrix.columns\n\n    feature_matrix_encoded, _ = encode_features(feature_matrix, features)\n    feature_matrix_encoded_shape = feature_matrix_encoded.shape\n\n    # to_encode should keep product_id as a string, and not create 3 additional columns\n    to_encode = []\n    feature_matrix_encoded, _ = encode_features(\n        feature_matrix,\n        features,\n        to_encode=to_encode,\n    )\n    assert feature_matrix_encoded_shape != feature_matrix_encoded.shape\n\n    to_encode = [\"value\"]\n    feature_matrix_encoded, _ = encode_features(\n        feature_matrix,\n        features,\n        to_encode=to_encode,\n    )\n    assert feature_matrix_encoded_shape != feature_matrix_encoded.shape\n\n    assert \"label\" in feature_matrix_encoded.columns\n\n\ndef test_encode_features_catches_features_mismatch(es):\n    f1 = IdentityFeature(es[\"log\"].ww[\"product_id\"])\n    f2 = IdentityFeature(es[\"log\"].ww[\"value\"])\n    f3 = IdentityFeature(es[\"log\"].ww[\"session_id\"])\n\n    features = [f1, f2]\n    cutoff_time = pd.DataFrame(\n        {\n            \"instance_id\": range(6),\n            \"time\": es[\"log\"][\"datetime\"][0:6],\n            \"label\": [i % 2 for i in range(6)],\n        },\n        columns=[\"instance_id\", \"time\", \"label\"],\n    )\n    feature_matrix = calculate_feature_matrix(features, es, cutoff_time)\n\n    assert \"label\" in feature_matrix.columns\n\n    error_text = \"Feature session_id not found in feature matrix\"\n    with pytest.raises(AssertionError, match=error_text):\n        encode_features(feature_matrix, [f1, f3])\n\n\ndef test_encode_unknown_features():\n    # Dataframe with categorical column with \"unknown\" string\n    df = pd.DataFrame({\"category\": [\"unknown\", \"b\", \"c\", \"d\", \"e\"]}).astype(\n        {\"category\": \"category\"},\n    )\n\n    es = EntitySet(\"test\")\n    es.add_dataframe(\n        dataframe_name=\"a\",\n        dataframe=df,\n        index=\"index\",\n        make_index=True,\n    )\n    features, feature_defs = dfs(\n        entityset=es,\n        target_dataframe_name=\"a\",\n        max_depth=1,\n    )\n\n    # Specify unknown token for replacement\n    features_enc, _ = encode_features(features, feature_defs, include_unknown=True)\n    assert list(features_enc.columns) == [\n        \"category = unknown\",\n        \"category = e\",\n        \"category = d\",\n        \"category = c\",\n        \"category = b\",\n        \"category is unknown\",\n    ]\n\n\ndef test_encode_features_topn(es):\n    topn = Feature(\n        Feature(es[\"log\"].ww[\"product_id\"]),\n        parent_dataframe_name=\"customers\",\n        primitive=NMostCommon(n=3),\n    )\n    features, feature_defs = dfs(\n        entityset=es,\n        instance_ids=[0, 1, 2],\n        target_dataframe_name=\"customers\",\n        agg_primitives=[NMostCommon(n=3)],\n    )\n    features_enc, feature_defs_enc = encode_features(\n        features,\n        feature_defs,\n        include_unknown=True,\n    )\n    assert topn.unique_name() in [feat.unique_name() for feat in feature_defs_enc]\n    for name in topn.get_feature_names():\n        assert name in features_enc.columns\n        assert features_enc.columns.tolist().count(name) == 1\n\n\ndef test_encode_features_drop_first():\n    df = pd.DataFrame({\"category\": [\"ao\", \"b\", \"c\", \"d\", \"e\"]}).astype(\n        {\"category\": \"category\"},\n    )\n    es = EntitySet(\"test\")\n    es.add_dataframe(\n        dataframe_name=\"a\",\n        dataframe=df,\n        index=\"index\",\n        make_index=True,\n    )\n    features, feature_defs = dfs(\n        entityset=es,\n        target_dataframe_name=\"a\",\n        max_depth=1,\n    )\n    features_enc, _ = encode_features(\n        features,\n        feature_defs,\n        drop_first=True,\n        include_unknown=False,\n    )\n    assert len(features_enc.columns) == 4\n\n    features_enc, feature_defs = encode_features(\n        features,\n        feature_defs,\n        top_n=3,\n        drop_first=True,\n        include_unknown=False,\n    )\n\n    assert len(features_enc.columns) == 2\n\n\ndef test_encode_features_handles_dictionary_input(es):\n    f1 = IdentityFeature(es[\"log\"].ww[\"product_id\"])\n    f2 = IdentityFeature(es[\"log\"].ww[\"purchased\"])\n    f3 = IdentityFeature(es[\"log\"].ww[\"session_id\"])\n\n    features = [f1, f2, f3]\n    feature_matrix = calculate_feature_matrix(features, es, instance_ids=range(16))\n    feature_matrix_encoded, features_encoded = encode_features(feature_matrix, features)\n    true_values = [\n        \"product_id = coke zero\",\n        \"product_id = toothpaste\",\n        \"product_id = car\",\n        \"product_id = brown bag\",\n        \"product_id = taco clock\",\n        \"product_id = Haribo sugar-free gummy bears\",\n        \"product_id is unknown\",\n        \"purchased\",\n        \"session_id = 0\",\n        \"session_id = 1\",\n        \"session_id = 4\",\n        \"session_id = 3\",\n        \"session_id = 5\",\n        \"session_id = 2\",\n        \"session_id is unknown\",\n    ]\n    assert len(features_encoded) == 15\n    for col in true_values:\n        assert col in list(feature_matrix_encoded.columns)\n\n    top_n_dict = {}\n    feature_matrix_encoded, features_encoded = encode_features(\n        feature_matrix,\n        features,\n        top_n=top_n_dict,\n    )\n    assert len(features_encoded) == 15\n    for col in true_values:\n        assert col in list(feature_matrix_encoded.columns)\n\n    top_n_dict = {f1.get_name(): 4, f3.get_name(): 3}\n    feature_matrix_encoded, features_encoded = encode_features(\n        feature_matrix,\n        features,\n        top_n=top_n_dict,\n    )\n    assert len(features_encoded) == 10\n    true_values = [\n        \"product_id = coke zero\",\n        \"product_id = toothpaste\",\n        \"product_id = car\",\n        \"product_id = brown bag\",\n        \"product_id is unknown\",\n        \"purchased\",\n        \"session_id = 0\",\n        \"session_id = 1\",\n        \"session_id = 4\",\n        \"session_id is unknown\",\n    ]\n    for col in true_values:\n        assert col in list(feature_matrix_encoded.columns)\n\n    feature_matrix_encoded, features_encoded = encode_features(\n        feature_matrix,\n        features,\n        top_n=top_n_dict,\n        include_unknown=False,\n    )\n    true_values = [\n        \"product_id = coke zero\",\n        \"product_id = toothpaste\",\n        \"product_id = car\",\n        \"product_id = brown bag\",\n        \"purchased\",\n        \"session_id = 0\",\n        \"session_id = 1\",\n        \"session_id = 4\",\n    ]\n    assert len(features_encoded) == 8\n    for col in true_values:\n        assert col in list(feature_matrix_encoded.columns)\n\n\ndef test_encode_features_matches_calculate_feature_matrix():\n    df = pd.DataFrame({\"category\": [\"b\", \"c\", \"d\", \"e\"]}).astype(\n        {\"category\": \"category\"},\n    )\n\n    es = EntitySet(\"test\")\n    es.add_dataframe(\n        dataframe_name=\"a\",\n        dataframe=df,\n        index=\"index\",\n        make_index=True,\n    )\n    features, feature_defs = dfs(\n        entityset=es,\n        target_dataframe_name=\"a\",\n        max_depth=1,\n    )\n\n    features_enc, feature_defs_enc = encode_features(\n        features,\n        feature_defs,\n        to_encode=[\"category\"],\n    )\n\n    features_calc = calculate_feature_matrix(feature_defs_enc, entityset=es)\n\n    pd.testing.assert_frame_equal(features_enc, features_calc)\n    assert features_calc.ww._schema == features_enc.ww._schema\n"
  },
  {
    "path": "featuretools/tests/synthesis/test_get_valid_primitives.py",
    "content": "import pytest\nfrom woodwork.column_schema import ColumnSchema\n\nfrom featuretools.primitives import (\n    AggregationPrimitive,\n    Count,\n    Hour,\n    IsIn,\n    Not,\n    TimeSincePrevious,\n    TransformPrimitive,\n)\nfrom featuretools.synthesis.get_valid_primitives import get_valid_primitives\n\n\ndef test_get_valid_primitives_selected_primitives(es):\n    agg_prims, trans_prims = get_valid_primitives(\n        es,\n        \"log\",\n        selected_primitives=[Hour, Count],\n    )\n    assert set(agg_prims) == set([Count])\n    assert set(trans_prims) == set([Hour])\n\n    agg_prims, trans_prims = get_valid_primitives(\n        es,\n        \"products\",\n        selected_primitives=[Hour],\n        max_depth=1,\n    )\n    assert set(agg_prims) == set()\n    assert set(trans_prims) == set()\n\n\ndef test_get_valid_primitives_selected_primitives_strings(es):\n    agg_prims, trans_prims = get_valid_primitives(\n        es,\n        \"log\",\n        selected_primitives=[\"hour\", \"count\"],\n    )\n    assert set(agg_prims) == set([Count])\n    assert set(trans_prims) == set([Hour])\n\n    agg_prims, trans_prims = get_valid_primitives(\n        es,\n        \"products\",\n        selected_primitives=[\"hour\"],\n        max_depth=1,\n    )\n    assert set(agg_prims) == set()\n    assert set(trans_prims) == set()\n\n\ndef test_invalid_primitive(es):\n    with pytest.raises(ValueError, match=\"'foobar' is not a recognized primitive name\"):\n        get_valid_primitives(\n            es,\n            target_dataframe_name=\"log\",\n            selected_primitives=[\"foobar\"],\n        )\n\n    msg = (\n        \"Selected primitive <class 'woodwork.column_schema.ColumnSchema'> \"\n        \"is not an AggregationPrimitive, TransformPrimitive, or str\"\n    )\n    with pytest.raises(ValueError, match=msg):\n        get_valid_primitives(\n            es,\n            target_dataframe_name=\"log\",\n            selected_primitives=[ColumnSchema],\n        )\n\n\ndef test_primitive_compatibility(es):\n    _, trans_prims = get_valid_primitives(\n        es,\n        \"customers\",\n        selected_primitives=[TimeSincePrevious],\n    )\n    assert len(trans_prims) == 1\n\n\ndef test_get_valid_primitives_custom_primitives(es):\n    class ThreeMostCommonCat(AggregationPrimitive):\n        name = \"n_most_common_categorical\"\n        input_types = [ColumnSchema(semantic_tags={\"category\"})]\n        return_type = ColumnSchema(semantic_tags={\"category\"})\n        number_output_features = 3\n\n    class AddThree(TransformPrimitive):\n        name = \"add_three\"\n        input_types = [\n            ColumnSchema(semantic_tags=\"numeric\"),\n            ColumnSchema(semantic_tags=\"numeric\"),\n            ColumnSchema(semantic_tags=\"numeric\"),\n        ]\n        return_type = ColumnSchema(semantic_tags=\"numeric\")\n        commutative = True\n\n    agg_prims, trans_prims = get_valid_primitives(es, \"log\")\n    assert ThreeMostCommonCat not in agg_prims\n    assert AddThree not in trans_prims\n\n    with pytest.raises(\n        ValueError,\n        match=\"'add_three' is not a recognized primitive name\",\n    ):\n        agg_prims, trans_prims = get_valid_primitives(\n            es,\n            \"log\",\n            2,\n            [ThreeMostCommonCat, \"add_three\"],\n        )\n\n\ndef test_get_valid_primitives_all_primitives(es):\n    agg_prims, trans_prims = get_valid_primitives(es, \"customers\")\n    assert Count in agg_prims\n    assert Hour in trans_prims\n\n\ndef test_get_valid_primitives_single_table(transform_es):\n    msg = \"Only one dataframe in entityset, changing max_depth to 1 since deeper features cannot be created\"\n    with pytest.warns(UserWarning, match=msg):\n        agg_prims, trans_prims = get_valid_primitives(transform_es, \"first\")\n\n    assert set(agg_prims) == set()\n    assert IsIn in trans_prims\n\n\ndef test_get_valid_primitives_with_dfs_kwargs(es):\n    agg_prims, trans_prims = get_valid_primitives(\n        es,\n        \"customers\",\n        selected_primitives=[Hour, Count, Not],\n    )\n    assert set(agg_prims) == set([Count])\n    assert set(trans_prims) == set([Hour, Not])\n\n    # Can use other dfs parameters and they get applied\n    agg_prims, trans_prims = get_valid_primitives(\n        es,\n        \"customers\",\n        selected_primitives=[Hour, Count, Not],\n        ignore_columns={\"customers\": [\"loves_ice_cream\"]},\n    )\n    assert set(agg_prims) == set([Count])\n    assert set(trans_prims) == set([Hour])\n\n    agg_prims, trans_prims = get_valid_primitives(\n        es,\n        \"products\",\n        selected_primitives=[Hour, Count],\n        ignore_dataframes=[\"log\"],\n    )\n    assert set(agg_prims) == set()\n    assert set(trans_prims) == set()\n"
  },
  {
    "path": "featuretools/tests/test_version.py",
    "content": "from featuretools import __version__\n\n\ndef test_version():\n    assert __version__ == \"1.31.0\"\n"
  },
  {
    "path": "featuretools/tests/testing_utils/__init__.py",
    "content": "# flake8: noqa\nfrom featuretools.tests.testing_utils.cluster import (\n    MockClient,\n    mock_cluster,\n    get_mock_client_cluster,\n)\nfrom featuretools.tests.testing_utils.es_utils import get_df_tags\nfrom featuretools.tests.testing_utils.features import (\n    feature_with_name,\n    number_of_features_with_name_like,\n    backward_path,\n    forward_path,\n    check_rename,\n    check_names,\n)\nfrom featuretools.tests.testing_utils.mock_ds import make_ecommerce_entityset\n"
  },
  {
    "path": "featuretools/tests/testing_utils/cluster.py",
    "content": "from psutil import virtual_memory\n\n\ndef mock_cluster(\n    n_workers=1,\n    threads_per_worker=1,\n    diagnostics_port=8787,\n    memory_limit=None,\n    **dask_kwarg,\n):\n    return (n_workers, threads_per_worker, diagnostics_port, memory_limit)\n\n\nclass MockClient:\n    def __init__(self, cluster):\n        self.cluster = cluster\n\n    def scheduler_info(self):\n        return {\"workers\": {\"worker 1\": {\"memory_limit\": virtual_memory().total}}}\n\n\ndef get_mock_client_cluster():\n    return MockClient, mock_cluster\n"
  },
  {
    "path": "featuretools/tests/testing_utils/es_utils.py",
    "content": "def get_df_tags(df):\n    \"\"\"Gets a DataFrame's semantic tags without index or time index tags for Woodwork init\"\"\"\n    semantic_tags = {}\n    for col_name in df.columns:\n        semantic_tags[col_name] = df.ww.semantic_tags[col_name] - {\n            \"time_index\",\n            \"index\",\n        }\n\n    return semantic_tags\n"
  },
  {
    "path": "featuretools/tests/testing_utils/features.py",
    "content": "import re\n\nfrom featuretools.entityset.relationship import RelationshipPath\n\n\ndef feature_with_name(features, name):\n    for f in features:\n        if f.get_name() == name:\n            return True\n    return False\n\n\ndef number_of_features_with_name_like(features, pattern):\n    \"\"\"Returns number of features with names that match the provided regex pattern\"\"\"\n    pattern = re.compile(re.escape(pattern))\n    names = [f.get_name() for f in features]\n    return len([name for name in names if pattern.search(name)])\n\n\ndef backward_path(es, dataframe_ids):\n    \"\"\"\n    Create a backward RelationshipPath through the given dataframes. Assumes only\n    one such path is possible.\n    \"\"\"\n\n    def _get_relationship(child, parent):\n        return next(\n            r\n            for r in es.get_forward_relationships(child)\n            if r._parent_dataframe_name == parent\n        )\n\n    relationships = [\n        _get_relationship(child, parent)\n        for parent, child in zip(dataframe_ids[:-1], dataframe_ids[1:])\n    ]\n\n    return RelationshipPath([(False, r) for r in relationships])\n\n\ndef forward_path(es, dataframe_ids):\n    \"\"\"\n    Create a forward RelationshipPath through the given dataframes. Assumes only\n    one such path is possible.\n    \"\"\"\n\n    def _get_relationship(child, parent):\n        return next(\n            r\n            for r in es.get_forward_relationships(child)\n            if r._parent_dataframe_name == parent\n        )\n\n    relationships = [\n        _get_relationship(child, parent)\n        for child, parent in zip(dataframe_ids[:-1], dataframe_ids[1:])\n    ]\n\n    return RelationshipPath([(True, r) for r in relationships])\n\n\ndef check_rename(feat, new_name, new_names):\n    copy_feat = feat.rename(new_name)\n    assert feat.unique_name() != copy_feat.unique_name()\n    assert feat.get_name() != copy_feat.get_name()\n    assert (\n        feat.base_features[0].generate_name()\n        == copy_feat.base_features[0].generate_name()\n    )\n    assert feat.dataframe_name == copy_feat.dataframe_name\n    assert feat.get_feature_names() != copy_feat.get_feature_names()\n    check_names(copy_feat, new_name, new_names)\n\n\ndef check_names(feat, new_name, new_names):\n    assert feat.get_name() == new_name\n    assert feat.get_feature_names() == new_names\n"
  },
  {
    "path": "featuretools/tests/testing_utils/generate_fake_dataframe.py",
    "content": "import random\nfrom datetime import datetime as dt\n\nimport pandas as pd\nimport woodwork.type_sys.type_system as ww_type_system\nfrom woodwork import logical_types\n\nfrom featuretools.feature_discovery.utils import flatten_list\n\nlogical_type_mapping = {\n    logical_types.Boolean.__name__: [True, False],\n    logical_types.BooleanNullable.__name__: [True, False, pd.NA],\n    logical_types.Categorical.__name__: [\"A\", \"B\", \"C\"],\n    logical_types.Datetime.__name__: [\n        dt(2020, 1, 1, 12, 0, 0),\n        dt(2020, 6, 1, 12, 0, 0),\n    ],\n    logical_types.Double.__name__: [1.2, 2.3, 3.4],\n    logical_types.Integer.__name__: [1, 2, 3],\n    logical_types.IntegerNullable.__name__: [1, 2, 3, pd.NA],\n    logical_types.EmailAddress.__name__: [\n        \"john.smith@example.com\",\n        \"sally.jones@example.com\",\n    ],\n    logical_types.LatLong.__name__: [(1, 2), (3, 4)],\n    logical_types.NaturalLanguage.__name__: [\n        \"This is sentence 1\",\n        \"This is sentence 2\",\n    ],\n    logical_types.Ordinal.__name__: [1, 2, 3],\n    logical_types.URL.__name__: [\"https://www.example.com\", \"https://www.example2.com\"],\n    logical_types.PostalCode.__name__: [\"60018\", \"60018-0123\"],\n}\n\n\ndef generate_fake_dataframe(\n    col_defs=[(\"f_1\", \"Numeric\"), (\"f_2\", \"Datetime\", \"time_index\")],\n    n_rows=10,\n    df_name=\"df\",\n):\n    def randomize(values_):\n        random.seed(10)\n        values = values_.copy()\n        random.shuffle(values)\n        return values\n\n    def gen_series(values):\n        values = [values] * n_rows\n        if isinstance(values, list):\n            values = flatten_list(values)\n\n        return randomize(values)[:n_rows]\n\n    def get_tags(lt, tags=set()):\n        inferred_tags = ww_type_system.str_to_logical_type(lt).standard_tags\n        assert isinstance(inferred_tags, set)\n        return inferred_tags.union(tags) - {\"index\", \"time_index\"}\n\n    other_kwargs = {}\n\n    df = pd.DataFrame()\n    lt_dict = {}\n    tags_dict = {}\n    for name, lt_name, *rest in col_defs:\n        if lt_name in logical_type_mapping:\n            values = logical_type_mapping[lt_name]\n            if lt_name == logical_types.Ordinal.__name__:\n                lt = logical_types.Ordinal(order=values)\n            else:\n                lt = lt_name\n            values = gen_series(values)\n        else:\n            raise Exception(f\"Unknown logical type {lt_name}\")\n\n        lt_dict[name] = lt\n\n        if len(rest):\n            tags = rest[0]\n            if \"index\" in tags:\n                other_kwargs[\"index\"] = name\n                values = range(n_rows)\n            if \"time_index\" in tags:\n                other_kwargs[\"time_index\"] = name\n                values = pd.date_range(\"2000-01-01\", periods=n_rows)\n            tags_dict[name] = get_tags(lt_name, tags)\n        else:\n            tags_dict[name] = get_tags(lt_name)\n\n        s = pd.Series(values, name=name)\n        df = pd.concat([df, s], axis=1)\n\n    df.ww.init(\n        name=df_name,\n        logical_types=lt_dict,\n        semantic_tags=tags_dict,\n        **other_kwargs,\n    )\n\n    return df\n"
  },
  {
    "path": "featuretools/tests/testing_utils/mock_ds.py",
    "content": "from datetime import datetime\n\nimport numpy as np\nimport pandas as pd\nfrom woodwork.logical_types import (\n    URL,\n    Boolean,\n    Categorical,\n    CountryCode,\n    Datetime,\n    Double,\n    EmailAddress,\n    Filepath,\n    Integer,\n    IPAddress,\n    LatLong,\n    NaturalLanguage,\n    Ordinal,\n    PersonFullName,\n    PhoneNumber,\n    PostalCode,\n    SubRegionCode,\n)\n\nfrom featuretools.entityset import EntitySet\n\n\ndef make_ecommerce_entityset(with_integer_time_index=False):\n    \"\"\"Makes a entityset with the following shape:\n\n      R         Régions\n     / \\\\       .\n    S   C       Stores, Customers\n        |       .\n        S   P   Sessions, Products\n         \\\\ /   .\n          L     Log\n    \"\"\"\n    dataframes = make_ecommerce_dataframes(\n        with_integer_time_index=with_integer_time_index,\n    )\n    dataframe_names = dataframes.keys()\n    es_id = \"ecommerce\"\n    if with_integer_time_index:\n        es_id += \"_int_time_index\"\n\n    logical_types = make_logical_types(with_integer_time_index=with_integer_time_index)\n    semantic_tags = make_semantic_tags()\n    time_indexes = make_time_indexes(with_integer_time_index=with_integer_time_index)\n\n    es = EntitySet(id=es_id)\n\n    for df_name in dataframe_names:\n        time_index = time_indexes.get(df_name, None)\n        ti_name = None\n        secondary = None\n        if time_index is not None:\n            ti_name = time_index[\"name\"]\n            secondary = time_index[\"secondary\"]\n        df = dataframes[df_name]\n        es.add_dataframe(\n            df,\n            dataframe_name=df_name,\n            index=\"id\",\n            logical_types=logical_types[df_name],\n            semantic_tags=semantic_tags[df_name],\n            time_index=ti_name,\n            secondary_time_index=secondary,\n        )\n\n    es.normalize_dataframe(\n        \"customers\",\n        \"cohorts\",\n        \"cohort\",\n        additional_columns=[\"cohort_name\"],\n        make_time_index=True,\n        new_dataframe_time_index=\"cohort_end\",\n    )\n\n    es.add_relationships(\n        [\n            (\"régions\", \"id\", \"customers\", \"région_id\"),\n            (\"régions\", \"id\", \"stores\", \"région_id\"),\n            (\"customers\", \"id\", \"sessions\", \"customer_id\"),\n            (\"sessions\", \"id\", \"log\", \"session_id\"),\n            (\"products\", \"id\", \"log\", \"product_id\"),\n        ],\n    )\n\n    return es\n\n\ndef make_ecommerce_dataframes(with_integer_time_index=False):\n    region_df = pd.DataFrame(\n        {\"id\": [\"United States\", \"Mexico\"], \"language\": [\"en\", \"sp\"]},\n    )\n\n    store_df = pd.DataFrame(\n        {\n            \"id\": range(6),\n            \"région_id\": [\"United States\"] * 3 + [\"Mexico\"] * 2 + [np.nan],\n            \"num_square_feet\": list(range(30000, 60000, 6000)) + [np.nan],\n        },\n    )\n\n    product_df = pd.DataFrame(\n        {\n            \"id\": [\n                \"Haribo sugar-free gummy bears\",\n                \"car\",\n                \"toothpaste\",\n                \"brown bag\",\n                \"coke zero\",\n                \"taco clock\",\n            ],\n            \"department\": [\n                \"food\",\n                \"electronics\",\n                \"health\",\n                \"food\",\n                \"food\",\n                \"electronics\",\n            ],\n            \"rating\": [3.5, 4.0, 4.5, 1.5, 5.0, 5.0],\n            \"url\": [\n                \"google.com\",\n                \"https://www.featuretools.com/\",\n                \"amazon.com\",\n                \"www.featuretools.com\",\n                \"bit.ly\",\n                \"featuretools.com/demos/\",\n            ],\n        },\n    )\n    customer_times = {\n        \"signup_date\": [\n            datetime(2011, 4, 8),\n            datetime(2011, 4, 9),\n            datetime(2011, 4, 6),\n        ],\n        # some point after signup date\n        \"upgrade_date\": [\n            datetime(2011, 4, 10),\n            datetime(2011, 4, 11),\n            datetime(2011, 4, 7),\n        ],\n        \"cancel_date\": [\n            datetime(2011, 6, 8),\n            datetime(2011, 10, 9),\n            datetime(2012, 1, 6),\n        ],\n        \"birthday\": [datetime(1993, 3, 8), datetime(1926, 8, 2), datetime(1993, 4, 20)],\n    }\n    if with_integer_time_index:\n        customer_times[\"signup_date\"] = [6, 7, 4]\n        customer_times[\"upgrade_date\"] = [18, 26, 5]\n        customer_times[\"cancel_date\"] = [27, 28, 29]\n        customer_times[\"birthday\"] = [2, 1, 3]\n\n    customer_df = pd.DataFrame(\n        {\n            \"id\": pd.Categorical([0, 1, 2]),\n            \"age\": [33, 25, 56],\n            \"région_id\": [\"United States\"] * 3,\n            \"cohort\": [0, 1, 0],\n            \"cohort_name\": [\"Early Adopters\", \"Late Adopters\", \"Early Adopters\"],\n            \"loves_ice_cream\": [True, False, True],\n            \"favorite_quote\": [\n                \"The proletariat have nothing to lose but their chains\",\n                \"Capitalism deprives us all of self-determination\",\n                \"All members of the working classes must seize the \"\n                \"means of production.\",\n            ],\n            \"signup_date\": customer_times[\"signup_date\"],\n            # some point after signup date\n            \"upgrade_date\": customer_times[\"upgrade_date\"],\n            \"cancel_date\": customer_times[\"cancel_date\"],\n            \"cancel_reason\": [\"reason_1\", \"reason_2\", \"reason_1\"],\n            \"engagement_level\": [1, 3, 2],\n            \"full_name\": [\"Mr. John Doe\", \"Doe, Mrs. Jane\", \"James Brown\"],\n            \"email\": [\"john.smith@example.com\", np.nan, \"team@featuretools.com\"],\n            \"phone_number\": [\"555-555-5555\", \"555-555-5555\", \"1-(555)-555-5555\"],\n            \"birthday\": customer_times[\"birthday\"],\n        },\n    )\n\n    ips = [\n        \"192.168.0.1\",\n        \"2001:4860:4860::8888\",\n        \"0.0.0.0\",\n        \"192.168.1.1:2869\",\n        np.nan,\n        np.nan,\n    ]\n    filepaths = [\n        \"/home/user/docs/Letter.txt\",\n        \"./inthisdir\",\n        \"C:\\\\user\\\\docs\\\\Letter.txt\",\n        \"~/.rcinfo\",\n        \"../../greatgrandparent\",\n        \"data.json\",\n    ]\n\n    session_df = pd.DataFrame(\n        {\n            \"id\": [0, 1, 2, 3, 4, 5],\n            \"customer_id\": pd.Categorical([0, 0, 0, 1, 1, 2]),\n            \"device_type\": [0, 1, 1, 0, 0, 1],\n            \"device_name\": [\"PC\", \"Mobile\", \"Mobile\", \"PC\", \"PC\", \"Mobile\"],\n            \"ip\": ips,\n            \"filepath\": filepaths,\n        },\n    )\n\n    times = list(\n        [datetime(2011, 4, 9, 10, 30, i * 6) for i in range(5)]\n        + [datetime(2011, 4, 9, 10, 31, i * 9) for i in range(4)]\n        + [datetime(2011, 4, 9, 10, 40, 0)]\n        + [datetime(2011, 4, 10, 10, 40, i) for i in range(2)]\n        + [datetime(2011, 4, 10, 10, 41, i * 3) for i in range(3)]\n        + [datetime(2011, 4, 10, 11, 10, i * 3) for i in range(2)],\n    )\n    if with_integer_time_index:\n        times = list(range(8, 18)) + list(range(19, 26))\n\n    values = list(\n        [i * 5 for i in range(5)]\n        + [i * 1 for i in range(4)]\n        + [0]\n        + [i * 5 for i in range(2)]\n        + [i * 7 for i in range(3)]\n        + [np.nan] * 2,\n    )\n\n    values_2 = list(\n        [i * 2 for i in range(5)]\n        + [i * 1 for i in range(4)]\n        + [0]\n        + [i * 2 for i in range(2)]\n        + [i * 3 for i in range(3)]\n        + [np.nan] * 2,\n    )\n\n    values_many_nans = list(\n        [np.nan] * 5\n        + [i * 1 for i in range(4)]\n        + [0]\n        + [np.nan] * 2\n        + [i * 3 for i in range(3)]\n        + [np.nan] * 2,\n    )\n\n    latlong = list([(values[i], values_2[i]) for i, _ in enumerate(values)])\n    latlong2 = list([(values_2[i], -values[i]) for i, _ in enumerate(values)])\n    zipcodes = list(\n        [\"02116\"] * 5\n        + [\"02116-3899\"] * 4\n        + [\"0\"]\n        + [\"1234567890\"] * 2\n        + [\"12345-6789\"] * 2\n        + [np.nan] * 3,\n    )\n    countrycodes = list([\"US\"] * 5 + [\"AL\"] * 4 + [np.nan] * 5 + [\"ALB\"] * 2 + [\"USA\"])\n    subregioncodes = list(\n        [\"US-AZ\"] * 5 + [\"US-MT\"] * 4 + [np.nan] * 3 + [\"UG-219\"] * 2 + [\"ZM-06\"] * 3,\n    )\n    log_df = pd.DataFrame(\n        {\n            \"id\": range(17),\n            \"session_id\": [0] * 5 + [1] * 4 + [2] * 1 + [3] * 2 + [4] * 3 + [5] * 2,\n            \"product_id\": [\"coke zero\"] * 3\n            + [\"car\"] * 2\n            + [\"toothpaste\"] * 3\n            + [\"brown bag\"] * 2\n            + [\"Haribo sugar-free gummy bears\"]\n            + [\"coke zero\"] * 4\n            + [\"taco clock\"] * 2,\n            \"datetime\": times,\n            \"value\": values,\n            \"value_2\": values_2,\n            \"latlong\": latlong,\n            \"latlong2\": latlong2,\n            \"zipcode\": zipcodes,\n            \"countrycode\": countrycodes,\n            \"subregioncode\": subregioncodes,\n            \"value_many_nans\": values_many_nans,\n            \"priority_level\": [0] * 2 + [1] * 5 + [0] * 6 + [2] * 2 + [1] * 2,\n            \"purchased\": [True] * 11 + [False] * 4 + [True, False],\n            \"url\": [\"https://www.featuretools.com/\"] * 2\n            + [\"amazon.com\"] * 2\n            + [\n                \"www.featuretools.com\",\n                \"bit.ly\",\n                \"featuretools.com/demos/\",\n                \"www.google.co.in/\" \"http://lplay.google.co.in\",\n                \" \",\n                \"invalid_url\",\n                \"an\",\n                \"microsoft.com/search/\",\n            ]\n            + [np.nan] * 5,\n            \"email_address\": [\"john.smith@example.com\", np.nan, \"team@featuretools.com\"]\n            * 5\n            + [\" prefix@space.com\", \"suffix@space.com \"],\n            \"comments\": [coke_zero_review()]\n            + [\"I loved it\"] * 2\n            + car_reviews()\n            + toothpaste_reviews()\n            + brown_bag_reviews()\n            + [gummy_review()]\n            + [\"I loved it\"] * 4\n            + taco_clock_reviews(),\n        },\n    )\n\n    return {\n        \"régions\": region_df,\n        \"stores\": store_df,\n        \"products\": product_df,\n        \"customers\": customer_df,\n        \"sessions\": session_df,\n        \"log\": log_df,\n    }\n\n\ndef make_semantic_tags():\n    store_semantic_tags = {\"région_id\": \"foreign_key\"}\n\n    customer_semantic_tags = {\"région_id\": \"foreign_key\", \"birthday\": \"date_of_birth\"}\n\n    session_semantic_tags = {\"customer_id\": \"foreign_key\"}\n\n    log_semantic_tags = {\"session_id\": \"foreign_key\"}\n\n    return {\n        \"customers\": customer_semantic_tags,\n        \"sessions\": session_semantic_tags,\n        \"log\": log_semantic_tags,\n        \"products\": {},\n        \"stores\": store_semantic_tags,\n        \"régions\": {},\n    }\n\n\ndef make_logical_types(with_integer_time_index=False):\n    region_logical_types = {\"id\": Categorical, \"language\": Categorical}\n\n    store_logical_types = {\n        \"id\": Integer,\n        \"région_id\": Categorical,\n        \"num_square_feet\": Double,\n    }\n\n    product_logical_types = {\n        \"id\": Categorical,\n        \"rating\": Double,\n        \"department\": Categorical,\n        \"url\": URL,\n    }\n\n    customer_logical_types = {\n        \"id\": Integer,\n        \"age\": Integer,\n        \"région_id\": Categorical,\n        \"loves_ice_cream\": Boolean,\n        \"favorite_quote\": NaturalLanguage,\n        \"signup_date\": Datetime(datetime_format=\"%Y-%m-%d\"),\n        \"upgrade_date\": Datetime(datetime_format=\"%Y-%m-%d\"),\n        \"cancel_date\": Datetime(datetime_format=\"%Y-%m-%d\"),\n        \"cancel_reason\": Categorical,\n        \"engagement_level\": Ordinal(order=[1, 2, 3]),\n        \"full_name\": PersonFullName,\n        \"email\": EmailAddress,\n        \"phone_number\": PhoneNumber,\n        \"birthday\": Datetime(datetime_format=\"%Y-%m-%d\"),\n        \"cohort_name\": Categorical,\n        \"cohort\": Integer,\n    }\n\n    session_logical_types = {\n        \"id\": Integer,\n        \"customer_id\": Integer,\n        \"device_type\": Categorical,\n        \"device_name\": Categorical,\n        \"ip\": IPAddress,\n        \"filepath\": Filepath,\n    }\n\n    log_logical_types = {\n        \"id\": Integer,\n        \"session_id\": Integer,\n        \"product_id\": Categorical,\n        \"datetime\": Datetime(datetime_format=\"%Y-%m-%d\"),\n        \"value\": Double,\n        \"value_2\": Double,\n        \"latlong\": LatLong,\n        \"latlong2\": LatLong,\n        \"zipcode\": PostalCode,\n        \"countrycode\": CountryCode,\n        \"subregioncode\": SubRegionCode,\n        \"value_many_nans\": Double,\n        \"priority_level\": Ordinal(order=[0, 1, 2]),\n        \"purchased\": Boolean,\n        \"url\": URL,\n        \"email_address\": EmailAddress,\n        \"comments\": NaturalLanguage,\n    }\n    if with_integer_time_index:\n        log_logical_types[\"datetime\"] = Integer\n        customer_logical_types[\"signup_date\"] = Integer\n        customer_logical_types[\"upgrade_date\"] = Integer\n        customer_logical_types[\"cancel_date\"] = Integer\n        customer_logical_types[\"birthday\"] = Integer\n\n    return {\n        \"customers\": customer_logical_types,\n        \"sessions\": session_logical_types,\n        \"log\": log_logical_types,\n        \"products\": product_logical_types,\n        \"stores\": store_logical_types,\n        \"régions\": region_logical_types,\n    }\n\n\ndef make_time_indexes(with_integer_time_index=False):\n    return {\n        \"customers\": {\n            \"name\": \"signup_date\",\n            \"secondary\": {\"cancel_date\": [\"cancel_reason\"]},\n        },\n        \"log\": {\"name\": \"datetime\", \"secondary\": None},\n    }\n\n\ndef coke_zero_review():\n    return \"\"\"\nWhen it comes to Coca-Cola products, people tend to be die-hard fans. Many of us know someone who can't go a day without a Diet Coke (or two or three). And while Diet Coke has been a leading sugar-free soft drink since it was first released in 1982, it came to light that young adult males shied away from this beverage — identifying diet cola as a woman's drink. The company's answer to that predicament came in 2005 - in the form of a shiny black can - with the release of Coca-Cola Zero.\n\nWhile Diet Coke was created with its own flavor profile and not as a sugar-free version of the original, Coca-Cola Zero aims to taste just like the \"real Coke flavor.\" Despite their polar opposite advertising campaigns, the contents and nutritional information of the two sugar-free colas is nearly identical. With that information in hand we at HuffPost Taste needed to know: Which of these two artificially-sweetened Coca-Cola beverages actually tastes better? And can you even tell the difference between them?\n\nBefore we get to the results of our taste test, here are the facts:\n\n\nDiet Coke\n\nMotto: Always Great Tast\nNutritional Information: Many say that a can of Diet Coke actually contains somewhere between 1-4 calories, but if a serving size contains fewer than 5 calories a company is not obligated to note it in its nutritional information. Diet Coke's nutritional information reads 0 Calories, 0g Fat, 40mg Sodium, 0g Total Carbs, 0g Protein.\n\nIngredients: Carbonated water, caramel color, aspartame, phosphoric acid, potassium benzonate, natural flavors, citric acid, caffeine.\n\nArtificial sweetener: Aspartame\n\n\nCoca-Cola Zero\nMotto: Real Coca-Cola Taste AND Zero Calories\n\nNutritional Information: While the label clearly advertises this beverage as a zero calorie cola, we are not entirely certain that its minimal calorie content is simply not required to be noted in the nutritional information. Coca-Cola Zero's nutritional information reads 0 Calories, 0g Fat, 40mg Sodium, 0g Total Carbs, 0g Protein.\n\nArtificial sweetener: Aspartame and acesulfame potassium\n\nIngredients: Carbonated water, caramel color, phosphoric acid, aspartame, potassium benzonate, natural flavors, potassium citrate, acesulfame potassium, caffeine.\n\nThe Verdict:\nTwenty-four editors blind-tasted the two cokes, side by side, and...\n\n54 percent of our tasters were able to distinguish Diet Coke from Coca-Cola Zero\n50 percent of our tasters preferred Diet Coke to Coca-Cola Zero, and vice versa\nHere’s what our tasters thought of the two sugar-free soft drinks:\n\nDiet Coke: \"Tastes fake right away.\" \"Much fresher brighter, crisper.\" \"Has the wonderful flavors of Diet Coke’s artificial sweeteners.\"\n\nCoca-Cola Zero: \"Has more of a sharply sweet aftertaste I associate with diet sodas.\" \"Tastes more like regular coke, less like fake sweetener.\" \"Has an odd taste.\" \"Tastes more like regular.\" \"Very sweet.\"\n\nOverall comments: \"That was a lot more difficult than I though it would be.\" \"Both equally palatable.\" A few people said Diet Coke tasted much better ... unbeknownst to them, they were actually referring to Coca-Cola Zero.\n\nIN SUMMARY: It is a real toss up. There is not one artificially-sweetened Coca-Cola beverage that outshines the other. So how do people choose between one or the other? It is either a matter of personal taste, or maybe the marketing campaigns will influence their choice.\n\"\"\"\n\n\ndef gummy_review():\n    return \"\"\"\nThe place: BMO Harris Bradley Center\nThe event: Bucks VS Spurs\nThe snack: Satan's Diarrhea Hate Bears made by Haribo\n\nI recently took my 4 year old son to his first NBA game. He was very excited to go to the game, and I was excited because we had fantastic seats. Row C center court to be exact. I've never sat that close before. I've never had to go DOWN stairs to get to my seats. 24 stairs to get to my seats to be exact.\n\nHis favorite candy is Skittles. Mine are anything gummy. I snuck in a bag of skittles for my son, and grabbed a handful of gummy bears for myself, to be later known as Satan's Diarrhea Hate Bears, that I received for Christmas in bulk from my parents, and put them in a zip lock bag.\n\nAfter the excitement of the 1st quarter has ended I take my son out to get him a bottled water and myself a beer. We return to our seats to enjoy our candy and drinks.\n\n..............fast forward until 1 minute before half time...........\n\nI have begun to sweat a sweat that is only meant for a man on mile 19 of a marathon. I have kicked out my legs out so straight that I am violently pushing the gentleman wearing a suit seat in front of me forward. He is not happy, I do not care. My hands are on the side of my seat not unlike that of a gymnast on a pommel horse, lifting me off my chair. My son is oblivious to what is happening next to him, after all, there is a mascot running around somewhere and he is eating candy.\n\nI realize that at some point in the very near to immediate future I am going to have to allow this lava from Satan to forcefully expel itself from my innards. I also realize that I have to walk up 24 stairs just to get to level ground in hopes to make it to the bathroom. I’ll just have to sit here stiff as a board for a few moments waiting for the pain to subside. About 30 seconds later there is a slight calm in the storm of the violent hurricane that is going on in my lower intestine. I muster the courage to gently relax every muscle in my lower half and stand up. My son stands up next to me and we start to ascend up the stairs. I take a very careful and calculated step up the first stair. Then a very loud horn sounds. Halftime. Great. It’s going to be crowded. The horn also seems to have awaken the Satan's Diarrhea Hate Bears that are having a mosh pit in my stomach. It literally felt like an avalanche went down my stomach and I again have to tighten every muscle and stand straight up and focus all my energy on my poor sphincter to tighten up and perform like it has never performed before. Taking another step would be the worst idea possible, the flood gates would open. Don’t worry, Daddy has a plan. I some how mumble the question, “want to play a game?” to my son, he of course says “yes”. My idea is to hop on both feet allllll the way up the stairs, using the center railing to propel me up each stair. My son is always up for a good hopping game, so he complies and joins in on the “fun”. Some old lady 4 steps up thinks its cute that we are doing this, obviously she wasn’t looking at the panic on my face. 3 rows behind her a man about the same age as me, who must have had similar situations, notices the fear/panic/desperation on my face understands the danger that I along with my pants and anyone within a 5 yard radius spray zone are in. He just mouths the words “good luck man” to me and I press on. Half way up and there is no leakage, but my legs are getting tired and my sphincter has never endured this amount of pressure for this long of time. 16 steps/hops later…….4 steps to go…….My son trips and falls on the stairs, I have two options: keep going knowing he will catch up or bend down to pick him up relieving my sphincter of all the pressure and commotion while ruining the day of roughly the 50 people that are now watching a grown man hop up stairs while sweating profusely next to a 4 year old boy.\n\nLuckily he gets right back up and we make it to the top of the stairs. Good, the hard part was over. Or so I thought. I managed to waddle like a penguin, or someone who is about to poop their pants in 2.5 seconds, to the men's room only to find that every stall is being used. EVERY STALL. It's halftime, of course everyone has to poop at that moment. I don't know if I can wait any longer, do I go ahead and fulfil the dream of every high school boy and poop in the urinal? What kind of an example would that set for my son? On the other hand, what kind of an example would it be for his father to fill his pants with a substance that probably will be unrecognizable to man. Suddenly a stall door opens, and I think I manage to actually levitate over to the stall. I my son follows me in, luckily it was the handicap stall so there was room for him to be out of the way. I get my pants off and start to sit. I know what taking a giant poo feels like. I also know what vomiting feels like. I can now successfully say that I know what it is like to vomit out my butt. I wasn't pooping, those Satan's Diarrhea Hate Bears did something to my insides that made my sphincter vomit our the madness.\n\nI am now conscious of my surroundings. Other than the war that the bottom half of my body is currently having with this porcelain chair, it is quiet as a pin drop in the bathroom. The other men in there can sense that something isn't right, no one has heard anyone ever poop vomit before.\n\nI can sense that the worst part is over. But its not stopping, nor can I physically stop it at this point, I am leaking..it's horrible. I call out \"does anyone have a diaper?\" hoping that some gentleman was changing a baby. Nothing. No one said a word. I know people are in there, I can see the toes of shoes pointed in my direction under the stall.. \"DOES ANYONE HAVE A DIAPER!?!\" I am screaming, my son is now crying, he thinks he is witnessing the death of his father. I can't even assure him that I will make it.\n\nNot a word was said, but a diaper was thrown over the stall. I catch it, line my underwear with it, put my pants back on, and walk out of that bathroom like a champ. We go straight to our seats, grab out coats and go home. As we are walking out, the gentleman that wished me good luck earlier simply put his fist out, and I happily bumped it.\n\nMy son asks me, \"Daddy, why are we leaving early?\"\n\"Well son, I need to change my diaper\"\n\"\"\"\n\n\ndef taco_clock_reviews():\n    return [\n        \"\"\"\nThis timer does what it is supposed to do. Setup is elementary. Replacing the old one (after 12 years) was relatively easy. It has performed flawlessly since. I'm delighted I could find an esoteric product like this at Amazon. Their service, and the customer reviews, are just excellent.\n\"\"\",\n        \"\"\"\nFunny, cute clock. A little spendy for how light the clock is, but its hard to find a taco clock.\n\"\"\",\n    ]\n\n\ndef brown_bag_reviews():\n    return [\n        \"\"\"\nThese bags looked exactly like I'd hoped, however, the handles broke off of almost every single bag as soon as items were placed in them! I used these as gift bags for out-of-town guests at my wedding, so imagine my embarassment as the handles broke off as I was handing them out. I would not recommend purchaing these bags unless you plan to fill them with nothing but paper! Anything heavier will cause the handles to snap right off.\n\"\"\",\n        \"\"\"\nI purchased these in August 2014 from Big Blue Supplies. I have no problem with the seller, these arrived new condition, fine shape.\n\nI do have a slight problem with the bags. In case someone might want to know, the handles on these bags are set inside against the top. Then a piece of Kraft type packing tape is placed over the handles to hold them in place. On some of the bags, the tape is already starting to peel off. I would be really hesitant about using these bags unless I reinforced the current tape with a different adhesive.\n\nI will keep the bags, and make a tape of a holiday or decorative theme and place over in order to make certain the handles stay in place.\n\nAlso in case anybody is wondering, the label on the plastic packaging bag states these are from ORIENTAL TRADING COMPANY. On the bottom of each bag it is stamped MADE IN CHINA. Again, I will be placing a sticker over that.\n\nEven the dollar store bags I normally purchase do not have that stamped on the bottom in such prominent lettering. I purchased these because they were plain and I wanted to decorate them.\n\nI do not think I would purchase again for all the reasons stated above.\n\nAnother thing for those still wanting to purchase, the ones I received were: 12 3/4 inches high not including handle, 10 1/4 inches wide and a 5 1/4 inch depth.\n\"\"\",\n    ]\n\n\ndef car_reviews():\n    return [\n        \"\"\"\nThe full-size pickup truck and the V-8 engine were supposed to be inseparable, like the internet and cat videos. You can’t have one without the other—or so we thought.\n\nIn America’s most popular vehicle, the Ford F-150, two turbocharged six-cylinder engines marketed under the EcoBoost name have dethroned the naturally aspirated V-8. Ford’s new 2.7-liter twin-turbo V-6 is the popular choice, while the 3.5-liter twin-turbo V-6 is the top performer. The larger six allows for greater hauling capacity, accelerates the truck more quickly, and swills less gas in EPA testing than the V-8 alternative. It’s enough to make even old-school truck buyers acknowledge that there actually is a replacement for displacement.\n\nAnd yet a V-8 in a big pickup truck still feels so natural, so right. In the F-150, the Coyote 5.0-liter V-8 is tuned for torque more so than power, yet it still revs with an enthusiastic giddy-up that reminds us that this engine’s other job is powering the Mustang. The response follows the throttle pedal faithfully while the six-speed automatic clicks through gears smoothly and easily. Together they pull this 5220-pound F-150 to 60 mph in 6.3 seconds, which is 0.4 second quicker than the 5.3-liter Chevrolet Silverado with the six-speed automatic and 0.9 second quicker than the 5.3 Silverado with the new eight-speed auto. The 3.5-liter EcoBoost, though, can do the deed another half-second quicker, but its synthetic soundtrack doesn’t have the rich, multilayered tone of the V-8.\n\nIt wasn’t until we saddled our test truck with a 6400-pound trailer (well under its 9000-pound rating) that we fully understood the case for upgrading to the 3.5-liter EcoBoost. The twin-turbo engine offers an extra 2500 pounds of towing capability and handles lighter tasks with considerably less strain. The 5.0-liter truck needs more revs and a wider throttle opening to accelerate its load, so we were often coaxed into pressing the throttle to the floor for even modest acceleration. The torquier EcoBoost engine offers a heartier response at part throttle.\n\nIn real-world, non-towing situations, the twin-turbo 3.5-liter doesn’t deliver on its promise of increased fuel economy, with both the 5.0-liter V-8 and that V-6 returning 16 mpg in our hands. But given the 3.5-liter’s virtues, we can forgive it that trespass.\n\nTrucks Are the New Luxury\n\nPickups once were working-class transportation. Today, they’re proxy luxury vehicles—or at least that’s how they’re priced. If you think our test truck’s $57,240 window sticker is steep, consider that our model, the Lariat, is merely a mid-spec trim. There are three additional grades—King Ranch, Platinum, and Limited—positioned and priced above it, plus the 3.5-liter EcoBoost that costs an extra $400 as well as a plethora of options to inflate the price past 60 grand. Squint and you can almost see the six-figure trucks of the future on the horizon.\n\nFor the most part, though, the equipment in this particular Lariat lives up to the price tag. The driver and passenger seats are heated and cooled, with 10-way power adjustability and supple leather. The technology includes blind-spot monitoring, navigation, and a 110-volt AC outlet. Nods to utility include spotlights built into the side mirrors and Ford’s Pro Trailer Backup Assist, which makes reversing with a trailer as easy as turning a tiny knob on the dashboard.\n\nMiddle-Child Syndrome\n\nIn the F-150, Ford has a trifecta of engines (the fourth, a naturally aspirated 3.5-liter V-6, is best left to the fleet operators). The 2.7-liter twin-turbo V-6 delivers remarkable performance at an affordable price. The 3.5-liter twin-turbo V-6 is the workhorse, with power, torque, and hauling capability to spare. Compared with those two logical options, the middle-child 5.0-liter V-8 is the right-brain choice. Its strongest selling points may be its silky power delivery and the familiar V-8 rumble. That’s a flimsy argument when it comes to rationalizing a $50,000-plus purchase, though, so perhaps it’s no surprise that today’s boosted six-cylinders are now the engines of choice in the F-150.\n\"\"\",\n        \"\"\"\nTHE GOOD\nThe Tesla Model S 90D's electric drivetrain is substantially more efficient than any internal combustion engine, and gives the car smooth and quick acceleration. All-wheel drive comes courtesy of a smart dual motor system. The new Autopilot feature eases the stress of stop-and-go traffic and long road trips.\n\nTHE BAD\nEven at Tesla's Supercharger stations, recharging the battery takes significantly longer than refilling an internal combustion engine car's gas tank, limiting where you can drive. Tesla hasn't improved its infotainment system much from the Model S' launch.\n\nTHE BOTTOM LINE\nAmong the different flavors of Tesla Model S, the 90D is the one to get, exhibiting the best range and all-wheel drive, while offering an uncomplicated, next-generation driving experience that shows very well against equally priced competitors.\n\n\nREVIEW  SPECIFICATIONS  PHOTOS\nRoadshow Automobiles Tesla 2016 Tesla Model S\nHaving tested driver assistance systems in many cars, and even ridden in fully self-driving cars, I should have been ready for Tesla's new Autopilot feature. But engaging it while cruising the freeway in the Model S 90D, I kept my foot hovering over the brake.\n\nMy trepidation didn't come so much from the adaptive cruise control, which kept the Model S following traffic ahead at a set distance, but from the self-steering, this part of Autopilot managing to keep the Model S well-centered in its lane with no help from me. Over many miles, I built up more trust in the system, letting the car do the steering in situations from bumper-to-bumper traffic and a winding road through the hills.\n\n2016 Tesla Model S 90DEnlarge Image\nAlthough the middle of the Model S range, the 90D offers the best range and a wealth of useful tech, such as Autopilot self-driving.\nWayne Cunningham/Roadshow\nTesla added Autopilot to its Model S line as an option last year, along with all-wheel-drive. More recently, the high-tech automaker improved its batteries, upgrading its cars from their former 65 and 85 kilowatt-hour capacity to 70 and 90 kilowatt-hour. The example I drove, the 90D, represents all these advances.\n\nMore importantly, the 90D is the current range-leader among the Model S line, boasting 288 miles on a full battery charge.\n\nThe Model S' improvements fall outside of typical automotive industry product cycles, fulfilling Tesla's promise of acting more like a technology company, constantly building and deploying new features. Tesla accomplishes that goal partially through over-the-air software updates, improving existing cars, but the 90D presents significant hardware updates over the original Model S launched four years ago.\n\nSit and go\nOf course, this Model S exhibited the ease of use of the original. Walking up to the car with the key fob in my pocket, it automatically unlocked. When I got in the car, it powered up without me having to push a start button, so I only needed to put it in drive to get on the road.\n\nLikewise, the design hasn't changed, its sleek, hatchback four-door body offering excellent cargo room, both front and back, and seating space. The cabin feels less cramped than most cars due to the lack of a transmission tunnel and a dashboard bare of buttons or dials.\n\n2016 Tesla Model S 90DEnlarge Image\nThe flat floor in the Model S' cabin makes for enhanced passenger room.\nWayne Cunningham/Roadshow\nThe big, 17-inch touchscreen in the center of the dashboard shows navigation, stereo, phone, energy consumption and car settings. I easily went from full-screen to a split-screen view, the windows showing each appearing instantly. A built-in 4G/LTE data connection powers Google maps and Internet-based audio. The LCD instrument panel in front of me showed my speed, energy usage, remaining range, and intelligently swapped audio information for turn-by-turn directions when started navigation.\n\nThe instrument panel actually made the experience of driving under Autopilot more comfortable, reassuring me with graphics that showed when the Model S' sensors were detecting the lane lines and the traffic around me. Impressively, the sensors could differentiate, as shown on the screen's graphics, a passenger car from a big truck.\n\nAt speed on the freeway, Autopilot smoothly maintained the car's position in its lane, and when I took my hands off the wheel for too long, it flashed a warning on the instrument panel. In stop-and-go traffic approaching a toll booth, the car did an even better job of self-driving, recognizing traffic around it and maintaining appropriate distances.\n\nHandling surprise\nTaking over the driving myself, the ride quality proved as comfortable as any sport-luxury car, as this Model S had its optional air suspension. The electric power steering is well-tuned, turning the wheels with a quiet, natural feel and good heft.\n\nAudi S7 vs Tesla Model S\nShootout: Audi S7 vs. Tesla Model S\nWayne Cunningham/Roadshow\nThe biggest surprise came when I spent the day doing laps at the Thunderhill Raceway, negotiating a series of tight, technical turns in competition with an Audi S7. I expected the Model S to get out-of-shape in the turns, but instead it proved steady and solid. The Model S' 4,647-pound curb weight made it less than ideal for a track test, but much of that weight is in the battery pack, mounted low in the chassis. That low center of gravity helped limit body roll, ensuring good grip from all four tires. In the turns, the Model S felt nicely balanced, although not entirely nimble.\n\nHelping its grip was its native all-wheel drive, gained from having motors driving each set of wheels. The combined output of the motors comes to 417 horsepower and 485 pound-feet of torque, those numbers expressed in 0-to-60 mph times of well under 5 seconds. That thrust made for fast runs down the race track's straightaways, or simply giving me the ability to take advantage of gaps in traffic on public roads.\n\n288 miles is more than enough for most people's daily driving needs, and if you plug in every night, you will wake up to a fully charged car every morning. The Model S makes for a far different experience than driving an internal combustion car, where you need to go to a gas station to refuel. However, longer trips in the Model S require some planning, such as scheduling stops at Tesla's free Supercharger stations.\n\n\nCharging times are much lengthier than refilling a tank with gasoline. From a Level 2, 240-volt station, you get 29 miles added every hour. Tesla's Supercharger, a Level 3 charger, takes 75 minutes to fully recharge the Model S 90D's battery.\n\n2016 Tesla Model S 90DEnlarge Image\nDespite its high initial price, the Model S 90D costs less to run on a daily basis than a combustion engine car.\nWayne Cunningham/Roadshow\n\nLow maintenance\nThe 2016 Tesla Model S 90D adds features to keep it competitive against the internal combustion cars in its sport luxury set. More importantly, it remains very easy to live with. In fact, the electric drivetrain should mean greatly decreased maintenance, as there are fewer moving parts. The EPA estimates that annual electricity costs for the Model S 90D should run $650, much less than buying gasoline for an equivalent internal combustion car.\n\nLengthy charging times mean longer trips are either out of the question or require more planning than with an internal combustion car. And while the infotainment system responds quickly to touch inputs and offers useful screens, it hasn't changed much in four years. Most notably, Tesla hasn't added any music apps beyond the ones it launched with. Along with new, useful apps, it would be nice to have some themes or other aesthetic changes to the infotainment interface.\n\nThe Model S 90D's base price of $88,000 puts it out of reach of the average buyer, and the model I drove was optioned up to around $95,000. Against its Audi, BMW and Mercedes-Benz competition, however, it makes a compelling argument, especially for its uncomplicated nature.\n\"\"\",\n    ]\n\n\ndef toothpaste_reviews():\n    return [\n        \"\"\"\nToothpaste can do more harm than good\n\nThe next time a patient innocently asks me, “What’s the best toothpaste to use?” I’m going to unleash a whole Chunky Soup can of “You Want The Truth? You CAN’T HANDLE THE TRUTH!!!” Gosh, that’s such an overused movie quote. Sorry about that, but still.\n\nIf you’re a dental professional, isn’t this the most annoying question you get, day after day? Do you even care which toothpaste your patients use?\n\nNo. You don’t. Asking a dentist what toothpaste to use is like asking your physician which bar of soap or body scrub you should use to clean your skin. Your dentist and dental hygienist have never seen a tube of toothpaste that singlehandedly improves the health of all patients in their practice, and the reason is simple:\n\nToothpaste is a cosmetic.\n\nWe brush our teeth so that out mouths no longer taste like… mouth. Mouth tastes gross, right? It tastes like putrefied skin. It tastes like tongue cheese. It tastes like Cream of Barf.\n\nOn the other hand, toothpaste has been exquisitely designed to bring you a brisk rush of York Peppermint Patty, or Triple Cinnamon Heaven, or whatever flavor that drives those tubes off of the shelves in the confusing dental aisle of your local supermarket or drugstore.\n\n\nToothpaste definitely tastes better than Cream of Barf. And that’s why you use it. Not because it’s good for you. You use toothpaste because it tastes good, and because it makes you accept your mouth as part of your face again.\n\nFrom a marketing perspective, all of the other things that are in your toothpaste are in there to give it additional perceived value. So let’s deconstruct these ingredients, shall we?\n\n\n1. Fluoride.\n\nThis was probably the first additive to toothpaste that brought it under the jurisdiction of the Food & Drug Administration and made toothpaste part drug, part cosmetic. Over time, a fluoride toothpaste can improve the strength of teeth, but the fluoride itself does nothing to make teeth cleaner. Some people are scared of fluoride so they don’t use it. Their choice. Professionally speaking, I know that the benefits of a fluoride additive far outweigh the risks.\n\n2. Foam.\n\nSodium Lauryl Sulfate is soap. Soap has a creamy, thick texture that American tongues especially like and equate to the feeling of cleanliness. There’s not enough surfactant, though, in toothpaste foam to break up the goo that grows on your teeth. If these bubbles scrubbed, you’d better believe that they would also scrub your delicate gum tissues into a bloody pulp.\n\n3. Abrasive particles.\n\nMost toothpastes use hydrated silica as the grit that polishes teeth. You’re probably most familiar with it as the clear beady stuff in the “Do Not Eat” packets. Depending on the size and shape of the particles, silica is the whitening ingredient in most whitening toothpastes. But whitening toothpaste cannot get your teeth any whiter than a professional dental cleaning, because it only cleans the surface. Two weeks to a whiter smile? How about 30 minutes with your hygienist? It’s much more efficient and less harsh.\n\n4. Desensitizers.\n\nTeeth that are sensitive to hot, cold, sweets, or a combination can benefit from the addition of potassium nitrate or stannous fluoride to a toothpaste. This is more of a palliative treatment, when the pain is the problem. Good old Time will usually make teeth feel better, too, unless the pain is coming from a cavity. Yeah, I’m talking to you, the person who is trying to heal the hole in their tooth with Sensodyne.\n\n5. Tartar control.\n\nIt burns! It burns! If your toothpaste has a particular biting flavor, it might contain tetrasodium pyrophosphate, an ingredient that is supposed to keep calcium phosphate salts (tartar, or calculus) from fossilizing on the back of your lower front teeth. A little tartar on your teeth doesn’t harm you unless it gets really thick and you can no longer keep it clean. One problem with tartar control toothpastes is that in order for the active ingredient to work, it has to be dissolved in a stronger detergent than usual, which can affect people that are sensitive to a high pH.\n\n6. Triclosan.\n\nThis antimicrobial is supposed to reduce infections between the gum and tooth. However, if you just keep the germs off of your teeth in the first place it’s pretty much a waste of an extra ingredient. Its safety has been questioned but, like fluoride, the bulk of the scientific research easily demonstrates that the addition of triclosan in toothpaste does much more good than harm.\n\nWhy toothpaste can be bad for you.\n\nLet’s just say it’s not the toothpaste’s fault. It’s yours. The toothpaste is just the co-dependent enabler. You’re the one with the problem.\n\nRemember, toothpaste is a cosmetic, first and foremost. It doesn’t clean your teeth by itself. Just in case you think I’m making this up I’ve included clinical studies in the references at the end of this article that show how ineffective toothpaste really is.\n\npeasized\n\n• You’re using too much.\n\nDon’t be so suggestible! Toothpaste ads show you how to use up the tube more quickly. Just use 1/3 as much, the size of a pea. It will still taste good, I promise! And too much foam can make you lose track of where your teeth actually are located.\n\n• You’re not taking enough time.\n\nAt least two minutes. Any less and you’re missing spots. Just ’cause it tastes better doesn’t mean you did a good job.\n\n• You’re not paying attention.\n\nI’ve seen people brush the same four spots for two minutes and miss the other 60% of their mouth.brushguide The toothbrush needs to touch every crevice of every tooth, not just where it lands when you go into autopilot and start thinking about what you’re going to wear that day. It’s the toothbrush friction that cleans your teeth, not the cleaning product. Plaque is a growth, like the pink or grey mildew that grows around the edges of your shower. You’ve gotta rub it off to get it off. No tooth cleaning liquid, paste, creme, gel, or powder is going to make as much of a difference as your attention to detail will.\n\nThe solution.\n\nUse what you like. It’s that simple. If it tastes good and feels clean to you, you’ll use it more often, brush longer, feel better, be healthier.\n\nYou can use baking soda, or coconut oil, or your favorite toothpaste, or even just plain water. The key is to have a good technique and to brush often. A music video makes this demonstration a little more fun than your usual lecture at the dental office, although, in my opinion you really still need to feel what it is like to MASH THE BRISTLES OF A SOFT TOOTHBRUSH INTO YOUR GUMS:\n\n\n\n\n\nA little more serious video from my pal Dr. Mark Burhenne where he demonstrates how to be careful with your toothbrush bristles:\n\n\nFinal word.\n\n♬ It’s all about that Bass, ’bout that Bass, no bubbles. ♬ Heh, dentistry in-joke there.\n\nSeriously, though, the bottom line is that your paste will mask brushing technique issues, so don’t put so much faith in the power of toothpaste.\n\nAlso you may have heard that some toothpastes contain decorative plastic that can get swallowed. Yeah, that was a DentalBuzz report I wrote that went viral earlier this year. And while I can’t claim total victory on that front, at least the company in question has promised that the plastic will no longer be added to their toothpaste lines very soon due to the overwhelming amount of letters, emails, and phone calls that they received as a result of people reading that article and making a difference.\n\nBut now I’m tired of talking about toothpaste.\n\nNext topic?\n\nI’m bringing pyorrhea back.\n    \"\"\",\n        \"\"\"\nI’ve been a user of Colgate Total Whitening Toothpaste for many years because I’ve always tried to maintain a healthy smile (I’m a receptionist so I need a white smile). But because I drink coffee at least twice a day (sometimes more!) and a lot of herbal teas, I’ve found that using just this toothpaste alone doesn’t really get my teeth white...\n\nThe best way to get white teeth is to really try some professional products specifically for tooth whitening. I’ve tried a few products, like Crest White Strips and found that the strips are really not as good as the trays. Although the Crest White Strips are easy to use, they really DO NOT cover your teeth perfectly like some other professional dental whitening kits. This Product did cover my teeth well however because of their custom heat trays, and whitening my teeth A LOT. I would say if you really want white teeth, use the Colgate Toothpaste and least 2 times a day, along side a professional Gel product like Shine Whitening.\n    \"\"\",\n        \"\"\"\nThe first feature is the price, and it is right.\n\nNext, I consider whether it will be neat to use. It is. Sometimes when I buy those new hard plastic containers, they actually get messy. Also I cannot get all the toothpaste out. It is easy to get the paste out of Colgate Total Whitening Paste without spraying it all over the cabinet.\n\nIf it does not taste good, I won't use it. Some toothpaste burns my mouth so bad that brushing my teeth is a painful experience. This one doesn't burn. It tastes simply the way toothpaste is supposed to taste.\n\nWhitening is important. This one is supposed ot whiten. After spending money to whiten my teeth, I need a product to help ward off the bad effects of coffee and tea.\n\nAvoiding all kinds of oral pathology is a major consideration. This toothpaste claims that it can help fight cavities, gingivitis, plaque, tartar, and bad breath.\n\nI hope this product stays on the market a long time and does not change.\n    \"\"\",\n    ]\n"
  },
  {
    "path": "featuretools/tests/utils_tests/__init__.py",
    "content": ""
  },
  {
    "path": "featuretools/tests/utils_tests/test_config.py",
    "content": "import logging\nimport os\n\nfrom featuretools.config_init import initialize_logging\n\nlogging_env_vars = {\n    \"FEATURETOOLS_LOG_LEVEL\": \"debug\",\n    \"FEATURETOOLS_ES_LOG_LEVEL\": \"critical\",\n    \"FEATURETOOLS_BACKEND_LOG_LEVEL\": \"error\",\n}\n\n\ndef test_logging_defaults():\n    old_env_vars = {}\n    for env_var in logging_env_vars:\n        old_env_vars[env_var] = os.environ.get(env_var, None)\n        if old_env_vars[env_var] is not None:\n            del os.environ[env_var]\n\n    initialize_logging()\n    main_logger = logging.getLogger(\"featuretools\")\n    assert main_logger.getEffectiveLevel() == logging.INFO\n    es_logger = logging.getLogger(\"featuretools.entityset\")\n    assert es_logger.getEffectiveLevel() == logging.INFO\n    backend_logger = logging.getLogger(\"featuretools.computation_backend\")\n    assert backend_logger.getEffectiveLevel() == logging.INFO\n\n    for env_var, value in old_env_vars.items():\n        if value is not None:\n            os.environ[env_var] = value\n\n\ndef test_logging_set_via_env():\n    old_env_vars = {}\n    for env_var, value in logging_env_vars.items():\n        old_env_vars[env_var] = os.environ.get(env_var, None)\n        os.environ[env_var] = value\n\n    initialize_logging()\n    main_logger = logging.getLogger(\"featuretools\")\n    assert main_logger.getEffectiveLevel() == logging.DEBUG\n    es_logger = logging.getLogger(\"featuretools.entityset\")\n    assert es_logger.getEffectiveLevel() == logging.CRITICAL\n    backend_logger = logging.getLogger(\"featuretools.computation_backend\")\n    assert backend_logger.getEffectiveLevel() == logging.ERROR\n\n    for env_var, value in old_env_vars.items():\n        if value is not None:\n            os.environ[env_var] = value\n"
  },
  {
    "path": "featuretools/tests/utils_tests/test_description_utils.py",
    "content": "from featuretools.utils.description_utils import convert_to_nth\n\n\ndef test_first():\n    assert convert_to_nth(1) == \"1st\"\n    assert convert_to_nth(21) == \"21st\"\n    assert convert_to_nth(131) == \"131st\"\n\n\ndef test_second():\n    assert convert_to_nth(2) == \"2nd\"\n    assert convert_to_nth(22) == \"22nd\"\n    assert convert_to_nth(232) == \"232nd\"\n\n\ndef test_third():\n    assert convert_to_nth(3) == \"3rd\"\n    assert convert_to_nth(23) == \"23rd\"\n    assert convert_to_nth(133) == \"133rd\"\n\n\ndef test_nth():\n    assert convert_to_nth(4) == \"4th\"\n    assert convert_to_nth(11) == \"11th\"\n    assert convert_to_nth(12) == \"12th\"\n    assert convert_to_nth(13) == \"13th\"\n    assert convert_to_nth(111) == \"111th\"\n    assert convert_to_nth(112) == \"112th\"\n    assert convert_to_nth(113) == \"113th\"\n"
  },
  {
    "path": "featuretools/tests/utils_tests/test_entry_point.py",
    "content": "import pandas as pd\nimport pytest\n\nfrom featuretools import dfs\n\n\n@pytest.fixture\ndef entry_points_dfs():\n    cards_df = pd.DataFrame({\"id\": [1, 2, 3, 4, 5]})\n    transactions_df = pd.DataFrame(\n        {\n            \"id\": [1, 2, 3, 4, 5, 6],\n            \"card_id\": [1, 2, 1, 3, 4, 5],\n            \"transaction_time\": [10, 12, 13, 20, 21, 20],\n            \"fraud\": [True, False, True, False, True, True],\n        },\n    )\n    return cards_df, transactions_df\n\n\nclass MockEntryPoint(object):\n    def on_call(self, kwargs):\n        self.kwargs = kwargs\n\n    def on_error(self, error, runtime):\n        self.error = error\n\n    def on_return(self, return_value, runtime):\n        self.return_value = return_value\n\n    def load(self):\n        return self\n\n    def __call__(self):\n        return self\n\n\nclass MockPkgResources(object):\n    def __init__(self, entry_point):\n        self.entry_point = entry_point\n\n    def iter_entry_points(self, name):\n        return [self.entry_point]\n\n\ndef test_entry_point(es, monkeypatch):\n    entry_point = MockEntryPoint()\n    # overrides a module used in the entry_point decorator for dfs\n    # so the decorator will use this mock entry point\n    monkeypatch.setitem(\n        dfs.__globals__[\"entry_point\"].__globals__,\n        \"pkg_resources\",\n        MockPkgResources(entry_point),\n    )\n    fm, fl = dfs(entityset=es, target_dataframe_name=\"customers\")\n    assert \"entityset\" in entry_point.kwargs.keys()\n    assert \"target_dataframe_name\" in entry_point.kwargs.keys()\n    assert (fm, fl) == entry_point.return_value\n\n\ndef test_entry_point_error(es, monkeypatch):\n    entry_point = MockEntryPoint()\n    monkeypatch.setitem(\n        dfs.__globals__[\"entry_point\"].__globals__,\n        \"pkg_resources\",\n        MockPkgResources(entry_point),\n    )\n    with pytest.raises(KeyError):\n        dfs(entityset=es, target_dataframe_name=\"missing_dataframe\")\n\n    assert isinstance(entry_point.error, KeyError)\n\n\ndef test_entry_point_detect_arg(monkeypatch, entry_points_dfs):\n    cards_df = entry_points_dfs[0]\n    transactions_df = entry_points_dfs[1]\n    cards_df = pd.DataFrame({\"id\": [1, 2, 3, 4, 5]})\n    transactions_df = pd.DataFrame(\n        {\n            \"id\": [1, 2, 3, 4, 5, 6],\n            \"card_id\": [1, 2, 1, 3, 4, 5],\n            \"transaction_time\": [10, 12, 13, 20, 21, 20],\n            \"fraud\": [True, False, True, False, True, True],\n        },\n    )\n    dataframes = {\n        \"cards\": (cards_df, \"id\"),\n        \"transactions\": (transactions_df, \"id\", \"transaction_time\"),\n    }\n    relationships = [(\"cards\", \"id\", \"transactions\", \"card_id\")]\n    entry_point = MockEntryPoint()\n    monkeypatch.setitem(\n        dfs.__globals__[\"entry_point\"].__globals__,\n        \"pkg_resources\",\n        MockPkgResources(entry_point),\n    )\n    fm, fl = dfs(dataframes, relationships, target_dataframe_name=\"cards\")\n    assert \"dataframes\" in entry_point.kwargs.keys()\n    assert \"relationships\" in entry_point.kwargs.keys()\n    assert \"target_dataframe_name\" in entry_point.kwargs.keys()\n"
  },
  {
    "path": "featuretools/tests/utils_tests/test_gen_utils.py",
    "content": "import pandas as pd\nimport pytest\nfrom woodwork import list_logical_types as ww_list_logical_types\nfrom woodwork import list_semantic_tags as ww_list_semantic_tags\n\nfrom featuretools import list_logical_types, list_semantic_tags\nfrom featuretools.utils.gen_utils import (\n    camel_and_title_to_snake,\n    import_or_none,\n    import_or_raise,\n)\n\n\ndef test_import_or_raise_errors():\n    with pytest.raises(ImportError, match=\"error message\"):\n        import_or_raise(\"_featuretools\", \"error message\")\n\n\ndef test_import_or_raise_imports():\n    math = import_or_raise(\"math\", \"error message\")\n    assert math.ceil(0.1) == 1\n\n\ndef test_import_or_none():\n    math = import_or_none(\"math\")\n    assert math.ceil(0.1) == 1\n\n    bad_lib = import_or_none(\"_featuretools\")\n    assert bad_lib is None\n\n\n@pytest.fixture\ndef df():\n    return pd.DataFrame({\"id\": range(5)})\n\n\ndef test_list_logical_types():\n    ft_ltypes = list_logical_types()\n    ww_ltypes = ww_list_logical_types()\n    assert ft_ltypes.equals(ww_ltypes)\n\n\ndef test_list_semantic_tags():\n    ft_semantic_tags = list_semantic_tags()\n    ww_semantic_tags = ww_list_semantic_tags()\n    assert ft_semantic_tags.equals(ww_semantic_tags)\n\n\ndef test_camel_and_title_to_snake():\n    assert camel_and_title_to_snake(\"Top3Words\") == \"top_3_words\"\n    assert camel_and_title_to_snake(\"top3Words\") == \"top_3_words\"\n    assert camel_and_title_to_snake(\"Top100Words\") == \"top_100_words\"\n    assert camel_and_title_to_snake(\"top100Words\") == \"top_100_words\"\n    assert camel_and_title_to_snake(\"Top41\") == \"top_41\"\n    assert camel_and_title_to_snake(\"top41\") == \"top_41\"\n    assert camel_and_title_to_snake(\"41TopWords\") == \"41_top_words\"\n    assert camel_and_title_to_snake(\"TopThreeWords\") == \"top_three_words\"\n    assert camel_and_title_to_snake(\"topThreeWords\") == \"top_three_words\"\n    assert camel_and_title_to_snake(\"top_three_words\") == \"top_three_words\"\n    assert camel_and_title_to_snake(\"over_65\") == \"over_65\"\n    assert camel_and_title_to_snake(\"65_and_over\") == \"65_and_over\"\n    assert camel_and_title_to_snake(\"USDValue\") == \"usd_value\"\n"
  },
  {
    "path": "featuretools/tests/utils_tests/test_recommend_primitives.py",
    "content": "import logging\n\nimport pandas as pd\nimport pytest\nfrom woodwork.logical_types import NaturalLanguage\nfrom woodwork.table_schema import ColumnSchema\n\nfrom featuretools import EntitySet\nfrom featuretools.primitives import Day, TransformPrimitive\nfrom featuretools.utils.recommend_primitives import (\n    DEFAULT_EXCLUDED_PRIMITIVES,\n    TIME_SERIES_PRIMITIVES,\n    _recommend_non_numeric_primitives,\n    _recommend_skew_numeric_primitives,\n    get_recommended_primitives,\n)\n\n\n@pytest.fixture\ndef moderate_right_skewed_df():\n    return pd.DataFrame(\n        {\"moderately right skewed\": [2, 3, 4, 4, 4, 5, 5, 7, 9, 11, 12, 13, 15]},\n    )\n\n\n@pytest.fixture\ndef heavy_right_skewed_df():\n    return pd.DataFrame(\n        {\"heavy right skewed\": [1, 1, 1, 1, 2, 2, 3, 3, 4, 5, 9, 11, 13]},\n    )\n\n\n@pytest.fixture\ndef left_skewed_df():\n    return pd.DataFrame(\n        {\"left skewed\": [2, 3, 4, 5, 7, 9, 11, 11, 11, 12, 12, 12, 13, 15]},\n    )\n\n\n@pytest.fixture\ndef skewed_df_zeros():\n    return pd.DataFrame({\"zeros\": [-1, 0, 0, 1, 2, 2, 3, 4, 5, 7, 9]})\n\n\n@pytest.fixture\ndef normal_df():\n    return pd.DataFrame({\"normal\": [2, 3, 4, 5, 5, 6, 6, 7, 7, 8, 9, 10, 11]})\n\n\n@pytest.fixture\ndef right_skew_moderate_and_heavy_df(moderate_right_skewed_df, heavy_right_skewed_df):\n    return pd.concat([moderate_right_skewed_df, heavy_right_skewed_df], axis=1)\n\n\n@pytest.fixture\ndef es_with_skewed_dfs(\n    moderate_right_skewed_df,\n    heavy_right_skewed_df,\n    left_skewed_df,\n    skewed_df_zeros,\n    normal_df,\n    right_skew_moderate_and_heavy_df,\n):\n    es = EntitySet()\n    es.add_dataframe(moderate_right_skewed_df, \"moderate_right_skewed_df\", \"id\")\n    es.add_dataframe(heavy_right_skewed_df, \"heavy_right_skewed_df\", \"id\")\n    es.add_dataframe(left_skewed_df, \"left_skewed_df\", \"id\")\n    es.add_dataframe(skewed_df_zeros, \"skewed_df_zeros\", \"id\")\n    es.add_dataframe(normal_df, \"normal_df\", \"id\")\n    es.add_dataframe(\n        right_skew_moderate_and_heavy_df,\n        \"right_skew_moderate_and_heavy_df\",\n        \"id\",\n    )\n    return es\n\n\ndef test_recommend_skew_numeric_primitives(es_with_skewed_dfs):\n    valid_skew_primtives = set([\"square_root\", \"natural_logarithm\"])\n    valid_prims = [\n        \"cosine\",\n        \"square_root\",\n        \"natural_logarithm\",\n        \"sine\",\n    ]\n    assert _recommend_skew_numeric_primitives(\n        es_with_skewed_dfs,\n        \"moderate_right_skewed_df\",\n        valid_prims,\n    ) == set([\"square_root\"])\n    assert _recommend_skew_numeric_primitives(\n        es_with_skewed_dfs,\n        \"heavy_right_skewed_df\",\n        valid_skew_primtives,\n    ) == set([\"natural_logarithm\"])\n    assert (\n        _recommend_skew_numeric_primitives(\n            es_with_skewed_dfs,\n            \"left_skewed_df\",\n            valid_skew_primtives,\n        )\n        == set()\n    )\n    assert (\n        _recommend_skew_numeric_primitives(\n            es_with_skewed_dfs,\n            \"skewed_df_zeros\",\n            valid_skew_primtives,\n        )\n        == set()\n    )\n    assert (\n        _recommend_skew_numeric_primitives(\n            es_with_skewed_dfs,\n            \"normal_df\",\n            valid_skew_primtives,\n        )\n        == set()\n    )\n    assert (\n        _recommend_skew_numeric_primitives(\n            es_with_skewed_dfs,\n            \"right_skew_moderate_and_heavy_df\",\n            valid_skew_primtives,\n        )\n        == valid_skew_primtives\n    )\n\n\ndef test_recommend_non_numeric_primitives(make_es):\n    ecom_es_customers = EntitySet()\n    ecom_es_customers.add_dataframe(make_es[\"customers\"])\n    valid_primitives = [\n        \"day\",\n        \"num_characters\",\n        \"natural_logarithm\",\n        \"sine\",\n    ]\n    actual_recommendations = _recommend_non_numeric_primitives(\n        ecom_es_customers,\n        \"customers\",\n        valid_primitives,\n    )\n    expected_recommendations = set(\n        [\n            \"day\",\n            \"num_characters\",\n        ],\n    )\n    assert expected_recommendations == actual_recommendations\n\n\ndef test_recommend_skew_numeric_primitives_exception(make_es, caplog):\n    class MockExceptionPrimitive(TransformPrimitive):\n        \"\"\"Count the number of times the string value occurs.\"\"\"\n\n        name = \"mock_primitive_with_exception\"\n        input_types = [ColumnSchema(logical_type=NaturalLanguage)]\n        return_type = ColumnSchema(semantic_tags={\"numeric\"})\n\n        def get_function(self):\n            def make_exception(column):\n                raise Exception(\"this primitive has an exception\")\n\n            return make_exception\n\n    ecom_es_customers = EntitySet()\n    ecom_es_customers.add_dataframe(make_es[\"customers\"])\n    valid_primitives = [MockExceptionPrimitive(), Day()]\n    logger = logging.getLogger(\"featuretools\")\n    logger.propagate = True\n    actual_recommendations = _recommend_non_numeric_primitives(\n        ecom_es_customers,\n        \"customers\",\n        valid_primitives,\n    )\n    logger.propagate = False\n    expected_recommendations = set([\"day\"])\n    assert expected_recommendations == actual_recommendations\n    assert (\n        \"Exception with feature MOCK_PRIMITIVE_WITH_EXCEPTION(favorite_quote) with primitive mock_primitive_with_exception: this primitive has an exception\"\n        in caplog.text\n    )\n\n\ndef test_get_recommended_primitives_time_series(make_es):\n    ecom_es_log = EntitySet()\n    ecom_es_log.add_dataframe(make_es[\"log\"])\n    ecom_es_log[\"log\"].ww.set_time_index(\"datetime\")\n    actual_recommendations_ts = get_recommended_primitives(\n        ecom_es_log,\n        True,\n    )\n    for ts_prim in TIME_SERIES_PRIMITIVES:\n        assert ts_prim in actual_recommendations_ts\n\n\ndef test_get_recommended_primitives(make_es):\n    ecom_es_customers = EntitySet()\n    ecom_es_customers.add_dataframe(make_es[\"customers\"])\n    actual_recommendations = get_recommended_primitives(\n        ecom_es_customers,\n        False,\n    )\n    expected_recommendations = [\n        \"day\",\n        \"num_characters\",\n        \"natural_logarithm\",\n        \"punctuation_count\",\n        \"mean_characters_per_word\",\n        \"is_weekend\",\n        \"whitespace_count\",\n        \"median_word_length\",\n        \"month\",\n        \"total_word_length\",\n        \"weekday\",\n        \"day_of_year\",\n        \"week\",\n        \"quarter\",\n        \"email_address_to_domain\",\n        \"number_of_common_words\",\n        \"num_words\",\n        \"num_unique_separators\",\n        \"age\",\n        \"year\",\n        \"is_leap_year\",\n        \"days_in_month\",\n        \"is_free_email_domain\",\n        \"number_of_unique_words\",\n    ]\n    for prim in expected_recommendations:\n        assert prim in actual_recommendations\n\n    for ts_prim in TIME_SERIES_PRIMITIVES:\n        assert ts_prim not in actual_recommendations\n\n\ndef test_get_recommended_primitives_exclude(make_es):\n    ecom_es_customers = EntitySet()\n    ecom_es_customers.add_dataframe(make_es[\"customers\"])\n    extra_exclude = [\"num_characters\", \"natural_logarithm\"]\n    prims_to_exclude = DEFAULT_EXCLUDED_PRIMITIVES + extra_exclude\n    actual_recommendations = get_recommended_primitives(\n        ecom_es_customers,\n        False,\n        prims_to_exclude,\n    )\n\n    for ex_prim in extra_exclude:\n        assert ex_prim not in actual_recommendations\n\n\ndef test_get_recommended_primitives_empty_es_error():\n    error_msg = \"No DataFrame in EntitySet found. Please add a DataFrame.\"\n    empty_es = EntitySet()\n    with pytest.raises(IndexError, match=error_msg):\n        get_recommended_primitives(\n            empty_es,\n            False,\n        )\n\n\ndef test_get_recommended_primitives_multi_table_es_error(make_es):\n    error_msg = \"Multi-table EntitySets are currently not supported. Please only use a single table EntitySet.\"\n    with pytest.raises(IndexError, match=error_msg):\n        get_recommended_primitives(\n            make_es,\n            False,\n        )\n"
  },
  {
    "path": "featuretools/tests/utils_tests/test_time_utils.py",
    "content": "from datetime import datetime, timedelta\nfrom itertools import chain\n\nimport numpy as np\nimport pandas as pd\nimport pytest\n\nfrom featuretools.utils import convert_time_units, make_temporal_cutoffs\nfrom featuretools.utils.time_utils import (\n    calculate_trend,\n    convert_datetime_to_floats,\n    convert_timedelta_to_floats,\n)\n\n\ndef test_make_temporal_cutoffs():\n    instance_ids = pd.Series(range(10))\n    cutoffs = pd.date_range(start=\"1/2/2015\", periods=10, freq=\"1d\")\n    temporal_cutoffs_by_nwindows = make_temporal_cutoffs(\n        instance_ids,\n        cutoffs,\n        window_size=\"1h\",\n        num_windows=2,\n    )\n\n    assert temporal_cutoffs_by_nwindows.shape[0] == 20\n    actual_instances = chain.from_iterable([[i, i] for i in range(10)])\n    actual_times = [\n        \"1/1/2015 23:00:00\",\n        \"1/2/2015 00:00:00\",\n        \"1/2/2015 23:00:00\",\n        \"1/3/2015 00:00:00\",\n        \"1/3/2015 23:00:00\",\n        \"1/4/2015 00:00:00\",\n        \"1/4/2015 23:00:00\",\n        \"1/5/2015 00:00:00\",\n        \"1/5/2015 23:00:00\",\n        \"1/6/2015 00:00:00\",\n        \"1/6/2015 23:00:00\",\n        \"1/7/2015 00:00:00\",\n        \"1/7/2015 23:00:00\",\n        \"1/8/2015 00:00:00\",\n        \"1/8/2015 23:00:00\",\n        \"1/9/2015 00:00:00\",\n        \"1/9/2015 23:00:00\",\n        \"1/10/2015 00:00:00\",\n        \"1/10/2015 23:00:00\",\n        \"1/11/2015 00:00:00\",\n        \"1/11/2015 23:00:00\",\n    ]\n    actual_times = [pd.Timestamp(c) for c in actual_times]\n\n    for computed, actual in zip(\n        temporal_cutoffs_by_nwindows[\"instance_id\"],\n        actual_instances,\n    ):\n        assert computed == actual\n    for computed, actual in zip(temporal_cutoffs_by_nwindows[\"time\"], actual_times):\n        assert computed == actual\n\n    cutoffs = [pd.Timestamp(\"1/2/2015\")] * 9 + [pd.Timestamp(\"1/3/2015\")]\n    starts = [pd.Timestamp(\"1/1/2015\")] * 9 + [pd.Timestamp(\"1/2/2015\")]\n    actual_times = [\"1/1/2015 00:00:00\", \"1/2/2015 00:00:00\"] * 9\n    actual_times += [\"1/2/2015 00:00:00\", \"1/3/2015 00:00:00\"]\n    actual_times = [pd.Timestamp(c) for c in actual_times]\n    temporal_cutoffs_by_wsz_start = make_temporal_cutoffs(\n        instance_ids,\n        cutoffs,\n        window_size=\"1d\",\n        start=starts,\n    )\n\n    for computed, actual in zip(\n        temporal_cutoffs_by_wsz_start[\"instance_id\"],\n        actual_instances,\n    ):\n        assert computed == actual\n    for computed, actual in zip(temporal_cutoffs_by_wsz_start[\"time\"], actual_times):\n        assert computed == actual\n\n    cutoffs = [pd.Timestamp(\"1/2/2015\")] * 9 + [pd.Timestamp(\"1/3/2015\")]\n    starts = [pd.Timestamp(\"1/1/2015\")] * 10\n    actual_times = [\"1/1/2015 00:00:00\", \"1/2/2015 00:00:00\"] * 9\n    actual_times += [\"1/1/2015 00:00:00\", \"1/3/2015 00:00:00\"]\n    actual_times = [pd.Timestamp(c) for c in actual_times]\n    temporal_cutoffs_by_nw_start = make_temporal_cutoffs(\n        instance_ids,\n        cutoffs,\n        num_windows=2,\n        start=starts,\n    )\n\n    for computed, actual in zip(\n        temporal_cutoffs_by_nw_start[\"instance_id\"],\n        actual_instances,\n    ):\n        assert computed == actual\n    for computed, actual in zip(temporal_cutoffs_by_nw_start[\"time\"], actual_times):\n        assert computed == actual\n\n\ndef test_convert_time_units():\n    units = {\n        \"years\": 31540000,\n        \"months\": 2628000,\n        \"days\": 86400,\n        \"hours\": 3600,\n        \"minutes\": 60,\n        \"seconds\": 1,\n        \"milliseconds\": 0.001,\n        \"nanoseconds\": 0.000000001,\n    }\n    for each in units:\n        assert convert_time_units(units[each] * 2, each) == 2\n        assert np.isclose(convert_time_units(float(units[each] * 2), each), 2)\n\n    error_text = \"Invalid unit given, make sure it is plural\"\n    with pytest.raises(ValueError, match=error_text):\n        convert_time_units(\"jnkwjgn\", 10)\n\n\n@pytest.mark.parametrize(\n    \"dt, expected_floats\",\n    [\n        (\n            pd.Series(\n                [\n                    datetime(2010, 1, 1, 11, 45, 0),\n                    datetime(2010, 1, 1, 12, 55, 15),\n                    datetime(2010, 1, 1, 11, 57, 30),\n                    datetime(2010, 1, 1, 11, 12),\n                    datetime(2010, 1, 1, 11, 12, 15),\n                ],\n            ),\n            pd.Series([21039105.0, 21039175.25, 21039117.5, 21039072.0, 21039072.25]),\n        ),\n        (\n            pd.Series(\n                list(pd.date_range(start=\"2017-01-01\", freq=\"1d\", periods=3))\n                + list(pd.date_range(start=\"2017-01-10\", freq=\"2d\", periods=4))\n                + list(pd.date_range(start=\"2017-01-22\", freq=\"1d\", periods=7)),\n            ),\n            pd.Series(\n                [\n                    17167.0,\n                    17168.0,\n                    17169.0,\n                    17176.0,\n                    17178.0,\n                    17180.0,\n                    17182.0,\n                    17188.0,\n                    17189.0,\n                    17190.0,\n                    17191.0,\n                    17192.0,\n                    17193.0,\n                    17194.0,\n                ],\n            ),\n        ),\n    ],\n)\ndef test_convert_datetime_floats(dt, expected_floats):\n    actual_floats = convert_datetime_to_floats(dt)\n    pd.testing.assert_series_equal(pd.Series(actual_floats), expected_floats)\n\n\n@pytest.mark.parametrize(\n    \"td, expected_floats\",\n    [\n        (\n            pd.Series(\n                [\n                    pd.Timedelta(2, \"day\"),\n                    pd.Timedelta(120000000),\n                    pd.Timedelta(48, \"sec\"),\n                    pd.Timedelta(30, \"min\"),\n                    pd.Timedelta(12, \"hour\"),\n                ],\n            ),\n            pd.Series(\n                [\n                    2.0,\n                    1.388888888888889e-06,\n                    0.0005555555555555556,\n                    0.020833333333333332,\n                    0.5,\n                ],\n            ),\n        ),\n        (\n            pd.Series(\n                [\n                    timedelta(days=4),\n                    timedelta(milliseconds=4000000),\n                    timedelta(hours=2, seconds=49),\n                ],\n            ),\n            pd.Series([4.0, 0.0462962962962963, 0.08390046296296297]),\n        ),\n    ],\n)\ndef test_convert_timedelta_to_floats(td, expected_floats):\n    actual_floats = convert_timedelta_to_floats(td)\n    pd.testing.assert_series_equal(pd.Series(actual_floats), expected_floats)\n\n\n@pytest.mark.parametrize(\n    \"series,expected_trends\",\n    [\n        (\n            # using datetimes\n            pd.Series(\n                data=[0, 5, 10],\n                index=[\n                    datetime(2019, 1, 1),\n                    datetime(2019, 1, 2),\n                    datetime(2019, 1, 3),\n                ],\n            ),\n            5.0,\n        ),\n        (\n            # using pd.Timestamp\n            pd.Series(\n                data=[0, -5, 3],\n                index=pd.date_range(start=\"2019-01-01\", freq=\"1D\", periods=3),\n            ),\n            1.4999999999999998,\n        ),\n        (\n            pd.Series(\n                data=[1, 2, 4, 8, 16],\n                index=pd.date_range(start=\"2019-01-01\", freq=\"1D\", periods=5),\n            ),\n            3.6000000000000005,\n        ),\n        (\n            # using pd.Timedelta with no change in time\n            pd.Series(\n                data=[1, 2, 3],\n                index=[\n                    pd.Timedelta(120000000),\n                    pd.Timedelta(120000000),\n                    pd.Timedelta(120000000),\n                ],\n            ),\n            0,\n        ),\n    ],\n)\ndef test_calculate_trend(series, expected_trends):\n    actual_trends = calculate_trend(series)\n    assert np.isclose(actual_trends, expected_trends)\n"
  },
  {
    "path": "featuretools/tests/utils_tests/test_trie.py",
    "content": "from featuretools.utils import Trie\n\n\ndef test_get_node():\n    t = Trie(default=lambda: \"default\")\n\n    t.get_node([1, 2, 3]).value = \"123\"\n    t.get_node([1, 2, 4]).value = \"124\"\n    sub = t.get_node([1, 2])\n    assert sub.get_node([3]).value == \"123\"\n    assert sub.get_node([4]).value == \"124\"\n\n    sub.get_node([4, 5]).value = \"1245\"\n    assert t.get_node([1, 2, 4, 5]).value == \"1245\"\n\n\ndef test_setting_and_getting():\n    t = Trie(default=lambda: \"default\")\n    assert t.get_node([1, 2, 3]).value == \"default\"\n\n    t.get_node([1, 2, 3]).value = \"123\"\n    t.get_node([1, 2, 4]).value = \"124\"\n    assert t.get_node([1, 2, 3]).value == \"123\"\n    assert t.get_node([1, 2, 4]).value == \"124\"\n\n    assert t.get_node([1]).value == \"default\"\n    t.get_node([1]).value = \"1\"\n    assert t.get_node([1]).value == \"1\"\n\n    t.get_node([1, 2, 3]).value = \"updated\"\n    assert t.get_node([1, 2, 3]).value == \"updated\"\n\n\ndef test_iteration():\n    t = Trie(default=lambda: \"default\", path_constructor=tuple)\n\n    t.get_node((1, 2, 3)).value = \"123\"\n    t.get_node((1, 2, 4)).value = \"124\"\n    expected = [\n        ((), \"default\"),\n        ((1,), \"default\"),\n        ((1, 2), \"default\"),\n        ((1, 2, 3), \"123\"),\n        ((1, 2, 4), \"124\"),\n    ]\n\n    for i, value in enumerate(t):\n        assert value == expected[i]\n"
  },
  {
    "path": "featuretools/tests/utils_tests/test_utils_info.py",
    "content": "import os\n\nimport pytest\n\nfrom featuretools import __version__\nfrom featuretools.utils import (\n    get_featuretools_root,\n    get_installed_packages,\n    get_sys_info,\n    show_info,\n)\n\n\n@pytest.fixture\ndef this_dir():\n    return os.path.dirname(os.path.abspath(__file__))\n\n\ndef test_show_info(capsys):\n    show_info()\n    captured = capsys.readouterr()\n    assert \"Featuretools version\" in captured.out\n    assert \"Featuretools installation directory:\" in captured.out\n    assert __version__ in captured.out\n    assert \"SYSTEM INFO\" in captured.out\n\n\ndef test_sys_info():\n    sys_info = get_sys_info()\n    info_keys = [\n        \"python\",\n        \"python-bits\",\n        \"OS\",\n        \"OS-release\",\n        \"machine\",\n        \"processor\",\n        \"byteorder\",\n        \"LC_ALL\",\n        \"LANG\",\n        \"LOCALE\",\n    ]\n    found_keys = [k for k, _ in sys_info]\n    assert set(info_keys).issubset(found_keys)\n\n\ndef test_installed_packages():\n    installed_packages = get_installed_packages()\n    # Per PEP 426, package names are case insensitive\n    # Underscore and hyphen are equivalent\n    installed_set = {\n        name.lower().replace(\"-\", \"_\") for name in installed_packages.keys()\n    }\n    requirements = [\n        \"pandas\",\n        \"numpy\",\n        \"tqdm\",\n        \"cloudpickle\",\n        \"psutil\",\n    ]\n    assert set(requirements).issubset(installed_set)\n\n\ndef test_get_featuretools_root(this_dir):\n    root = os.path.abspath(os.path.join(this_dir, \"..\", \"..\"))\n    assert get_featuretools_root() == root\n"
  },
  {
    "path": "featuretools/utils/__init__.py",
    "content": "# flake8: noqa\nfrom featuretools.utils.api import *\n"
  },
  {
    "path": "featuretools/utils/api.py",
    "content": "# flake8: noqa\nfrom featuretools.utils.entry_point import entry_point\nfrom featuretools.utils.gen_utils import make_tqdm_iterator\nfrom featuretools.utils.time_utils import (\n    calculate_trend,\n    convert_time_units,\n    make_temporal_cutoffs,\n)\nfrom featuretools.utils.trie import Trie\nfrom featuretools.utils.utils_info import (\n    get_featuretools_root,\n    get_installed_packages,\n    get_sys_info,\n    show_info,\n)\n"
  },
  {
    "path": "featuretools/utils/common_tld_utils.py",
    "content": "# put longer TLDs first to avoid catching a small part of a longer TLD and escape periods\nCOMMON_TLDS = [\n    \"management\",\n    \"technology\",\n    \"solutions\",\n    \"delivery\",\n    \"services\",\n    \"software\",\n    \"digital\",\n    \"finance\",\n    \"monster\",\n    \"network\",\n    \"support\",\n    \"systems\",\n    \"website\",\n    \"agency\",\n    \"design\",\n    \"events\",\n    \"global\",\n    \"health\",\n    \"online\",\n    \"stream\",\n    \"studio\",\n    \"travel\",\n    \"apple\",\n    \"click\",\n    \"cloud\",\n    \"email\",\n    \"games\",\n    \"group\",\n    \"media\",\n    \"ninja\",\n    \"press\",\n    \"rocks\",\n    \"space\",\n    \"store\",\n    \"today\",\n    \"tools\",\n    \"video\",\n    \"works\",\n    \"world\",\n    \"aero\",\n    \"arpa\",\n    \"asia\",\n    \"bank\",\n    \"best\",\n    \"blog\",\n    \"buzz\",\n    \"care\",\n    \"casa\",\n    \"chat\",\n    \"club\",\n    \"coop\",\n    \"cyou\",\n    \"desi\",\n    \"farm\",\n    \"goog\",\n    \"guru\",\n    \"host\",\n    \"info\",\n    \"jobs\",\n    \"life\",\n    \"link\",\n    \"live\",\n    \"mobi\",\n    \"name\",\n    \"news\",\n    \"page\",\n    \"plus\",\n    \"shop\",\n    \"site\",\n    \"team\",\n    \"tech\",\n    \"work\",\n    \"zone\",\n    \"app\",\n    \"aws\",\n    \"bid\",\n    \"biz\",\n    \"box\",\n    \"cam\",\n    \"cat\",\n    \"com\",\n    \"dev\",\n    \"edu\",\n    \"eus\",\n    \"fun\",\n    \"gov\",\n    \"icu\",\n    \"int\",\n    \"ltd\",\n    \"mil\",\n    \"net\",\n    \"nyc\",\n    \"one\",\n    \"onl\",\n    \"org\",\n    \"ovh\",\n    \"pro\",\n    \"pub\",\n    \"run\",\n    \"sap\",\n    \"top\",\n    \"vip\",\n    \"win\",\n    \"xxx\",\n    \"xyz\",\n    \"ac\",\n    \"ad\",\n    \"ae\",\n    \"ag\",\n    \"ai\",\n    \"al\",\n    \"am\",\n    \"ar\",\n    \"at\",\n    \"au\",\n    \"az\",\n    \"ba\",\n    \"bd\",\n    \"be\",\n    \"bg\",\n    \"br\",\n    \"by\",\n    \"bz\",\n    \"ca\",\n    \"cc\",\n    \"cf\",\n    \"ch\",\n    \"cl\",\n    \"cm\",\n    \"cn\",\n    \"co\",\n    \"cr\",\n    \"cu\",\n    \"cx\",\n    \"cy\",\n    \"cz\",\n    \"de\",\n    \"dk\",\n    \"do\",\n    \"ec\",\n    \"ee\",\n    \"eg\",\n    \"es\",\n    \"eu\",\n    \"fi\",\n    \"fm\",\n    \"fr\",\n    \"ga\",\n    \"ge\",\n    \"gg\",\n    \"gl\",\n    \"gq\",\n    \"gr\",\n    \"gs\",\n    \"gt\",\n    \"hk\",\n    \"hn\",\n    \"hr\",\n    \"hu\",\n    \"id\",\n    \"ie\",\n    \"il\",\n    \"im\",\n    \"in\",\n    \"io\",\n    \"ir\",\n    \"is\",\n    \"it\",\n    \"jo\",\n    \"jp\",\n    \"ke\",\n    \"kh\",\n    \"ki\",\n    \"kr\",\n    \"kw\",\n    \"kz\",\n    \"la\",\n    \"lb\",\n    \"li\",\n    \"lk\",\n    \"lt\",\n    \"lu\",\n    \"lv\",\n    \"ly\",\n    \"ma\",\n    \"md\",\n    \"me\",\n    \"mk\",\n    \"ml\",\n    \"mm\",\n    \"mn\",\n    \"ms\",\n    \"mu\",\n    \"mx\",\n    \"my\",\n    \"nf\",\n    \"ng\",\n    \"nl\",\n    \"no\",\n    \"np\",\n    \"nu\",\n    \"nz\",\n    \"om\",\n    \"pa\",\n    \"pe\",\n    \"ph\",\n    \"pk\",\n    \"pl\",\n    \"pr\",\n    \"ps\",\n    \"pt\",\n    \"pw\",\n    \"py\",\n    \"qa\",\n    \"re\",\n    \"ro\",\n    \"rs\",\n    \"ru\",\n    \"sa\",\n    \"sc\",\n    \"se\",\n    \"sg\",\n    \"sh\",\n    \"si\",\n    \"sk\",\n    \"so\",\n    \"st\",\n    \"su\",\n    \"sv\",\n    \"sx\",\n    \"th\",\n    \"tj\",\n    \"tk\",\n    \"tn\",\n    \"to\",\n    \"tr\",\n    \"tt\",\n    \"tv\",\n    \"tw\",\n    \"ua\",\n    \"ug\",\n    \"uk\",\n    \"us\",\n    \"uy\",\n    \"vc\",\n    \"ve\",\n    \"vn\",\n    \"ws\",\n    \"za\",\n]\n"
  },
  {
    "path": "featuretools/utils/description_utils.py",
    "content": "def convert_to_nth(integer):\n    string_nth = str(integer)\n    end_int = integer % 10\n    if end_int == 1 and integer % 100 != 11:\n        return str(integer) + \"st\"\n    elif end_int == 2 and integer % 100 != 12:\n        return str(string_nth) + \"nd\"\n    elif end_int == 3 and integer % 100 != 13:\n        return str(string_nth) + \"rd\"\n    else:\n        return str(string_nth) + \"th\"\n"
  },
  {
    "path": "featuretools/utils/entry_point.py",
    "content": "import time\nfrom functools import wraps\nfrom inspect import signature\n\nimport pkg_resources\n\n\ndef entry_point(name):\n    def inner_function(func):\n        @wraps(func)\n        def function_wrapper(*args, **kwargs):\n            \"\"\"function_wrapper of greeting\"\"\"\n            # add positional args as named kwargs\n            on_call_kwargs = kwargs.copy()\n            sig = signature(func)\n            for arg, parameter in zip(args, sig.parameters):\n                on_call_kwargs[parameter] = arg\n\n            # collect and initialize all registered entry points\n            entry_points = []\n            for entry_point in pkg_resources.iter_entry_points(name):\n                entry_point = entry_point.load()\n                entry_points.append(entry_point())\n\n            # send arguments before function is called\n            for ep in entry_points:\n                ep.on_call(on_call_kwargs)\n\n            try:\n                # call function\n                start = time.time()\n                return_value = func(*args, **kwargs)\n                runtime = time.time() - start\n            except Exception as e:\n                runtime = time.time() - start\n                # send error\n                for ep in entry_points:\n                    ep.on_error(error=e, runtime=runtime)\n                raise e\n\n            # send return value\n            for ep in entry_points:\n                ep.on_return(return_value=return_value, runtime=runtime)\n\n            return return_value\n\n        return function_wrapper\n\n    return inner_function\n"
  },
  {
    "path": "featuretools/utils/gen_utils.py",
    "content": "import importlib\nimport logging\nimport re\nimport sys\n\nfrom tqdm import tqdm\n\nlogger = logging.getLogger(\"featuretools.utils\")\n\n\ndef make_tqdm_iterator(**kwargs):\n    options = {\"file\": sys.stdout, \"leave\": True}\n    options.update(kwargs)\n    return tqdm(**options)\n\n\ndef get_relationship_column_id(path):\n    _, r = path[0]\n    child_link_name = r._child_column_name\n    for _, r in path[1:]:\n        parent_link_name = child_link_name\n        child_link_name = \"%s.%s\" % (r.parent_name, parent_link_name)\n    return child_link_name\n\n\ndef find_descendents(cls):\n    \"\"\"\n    A generator which yields all descendent classes of the given class\n    (including the given class)\n\n    Args:\n        cls (Class): the class to find descendents of\n    \"\"\"\n    yield cls\n    for sub in cls.__subclasses__():\n        for c in find_descendents(sub):\n            yield c\n\n\ndef import_or_raise(library, error_msg):\n    \"\"\"\n    Attempts to import the requested library.  If the import fails, raises an\n    ImportErorr with the supplied\n\n    Args:\n        library (str): the name of the library\n        error_msg (str): error message to return if the import fails\n    \"\"\"\n    try:\n        return importlib.import_module(library)\n    except ImportError:\n        raise ImportError(error_msg)\n\n\ndef import_or_none(library):\n    \"\"\"\n    Attemps to import the requested library.\n\n    Args:\n        library (str): the name of the library\n    Returns: the library if it is installed, else None\n    \"\"\"\n    try:\n        return importlib.import_module(library)\n    except ImportError:\n        return None\n\n\ndef camel_and_title_to_snake(name):\n    name = re.sub(r\"([^_\\d]+)(\\d+)\", r\"\\1_\\2\", name)\n    name = re.sub(\"(.)([A-Z][a-z]+)\", r\"\\1_\\2\", name)\n    return re.sub(\"([a-z0-9])([A-Z])\", r\"\\1_\\2\", name).lower()\n"
  },
  {
    "path": "featuretools/utils/plot_utils.py",
    "content": "from featuretools.utils.gen_utils import import_or_raise\n\n\ndef check_graphviz():\n    GRAPHVIZ_ERR_MSG = (\n        \"Please install graphviz to plot.\"\n        + \" (See https://featuretools.alteryx.com/en/stable/install.html#installing-graphviz for\"\n        + \" details)\"\n    )\n    graphviz = import_or_raise(\"graphviz\", GRAPHVIZ_ERR_MSG)\n    # Try rendering a dummy graph to see if a working backend is installed\n    try:\n        graphviz.Digraph().pipe(format=\"svg\")\n    except graphviz.backend.ExecutableNotFound:\n        raise RuntimeError(\n            \"To plot entity sets, a graphviz backend is required.\\n\"\n            + \"Install the backend using one of the following commands:\\n\"\n            + \"  Mac OS: brew install graphviz\\n\"\n            + \"  Linux (Ubuntu): $ sudo apt install graphviz\\n\"\n            + \"  Windows (conda): conda install -c conda-forge python-graphviz\\n\"\n            + \"  Windows (pip): pip install graphviz\\n\"\n            + \"  Windows (EXE required if graphviz was installed via pip): https://graphviz.org/download/#windows\"\n            + \"  For more details visit: https://featuretools.alteryx.com/en/stable/install.html#installing-graphviz\",\n        )\n    return graphviz\n\n\ndef get_graphviz_format(graphviz, to_file):\n    if to_file:\n        # Explicitly cast to str in case a Path object was passed in\n        to_file = str(to_file)\n\n        split_path = to_file.split(\".\")\n        if len(split_path) < 2:\n            raise ValueError(\n                \"Please use a file extension like '.pdf'\"\n                + \" so that the format can be inferred\",\n            )\n\n        format_ = split_path[-1]\n        valid_formats = graphviz.FORMATS\n        if format_ not in valid_formats:\n            raise ValueError(\n                \"Unknown format. Make sure your format is\"\n                + \" amongst the following: %s\" % valid_formats,\n            )\n    else:\n        format_ = None\n    return format_\n\n\ndef save_graph(graph, to_file, format_):\n    # Graphviz always appends the format to the file name, so we need to\n    # remove it manually to avoid file names like 'file_name.pdf.pdf'\n    offset = len(format_) + 1  # Add 1 for the dot\n    output_path = to_file[:-offset]\n    graph.render(output_path, cleanup=True)\n"
  },
  {
    "path": "featuretools/utils/recommend_primitives.py",
    "content": "import logging\nfrom typing import List\n\nfrom featuretools.computational_backends import calculate_feature_matrix\nfrom featuretools.entityset import EntitySet\nfrom featuretools.primitives.utils import get_transform_primitives\nfrom featuretools.synthesis import dfs, get_valid_primitives\n\nORDERED_PRIMITIVES = [  # non-numeric primitives that require specific ordering or a time index to be set\n    \"cum_count\",\n    \"cumulative_time_since_last_false\",\n    \"cumulative_time_since_last_true\",\n    \"diff\",\n    \"diff_datetime\",\n    \"is_first_occurrence\",\n    \"is_last_occurrence\",\n    \"time_since_previous\",\n]\n\n\nDEPRECATED_PRIMITIVES = [\n    \"multiply_boolean\",  # functionality duplicated by 'and' primitive\n    \"numeric_lag\",  # deprecated and replaced with `lag`\n]\n\nREQUIRED_INPUT_PRIMITIVES = [  # non-numeric primitives that require input\n    \"count_string\",\n    \"distance_to_holiday\",\n    \"is_in_geobox\",\n    \"not_equal_scalar\",\n    \"equal_scalar\",\n    \"time_since\",\n    \"isin\",\n]\n\nOTHER_PRIMITIVES_TO_EXCLUDE = [  # Excluding some primitives that can produce too many features or aren't useful in extracting information\n    \"not\",\n    \"and\",\n    \"or\",\n    \"equal\",\n    \"not_equal\",\n]\n\nDEFAULT_EXCLUDED_PRIMITIVES = (\n    REQUIRED_INPUT_PRIMITIVES\n    + DEPRECATED_PRIMITIVES\n    + ORDERED_PRIMITIVES\n    + OTHER_PRIMITIVES_TO_EXCLUDE\n)\n\n# TODO: Make this list more dynamic\nTIME_SERIES_PRIMITIVES = [\n    \"expanding_count\",\n    \"expanding_max\",\n    \"expanding_mean\",\n    \"expanding_min\",\n    \"expanding_std\",\n    \"expanding_trend\",\n    \"lag\",\n    \"rolling_count\",\n    \"rolling_outlier_count\",\n    \"rolling_max\",\n    \"rolling_mean\",\n    \"rolling_min\",\n    \"rolling_std\",\n    \"rolling_trend\",\n]\n\n\n# TODO: Support multi-table\ndef get_recommended_primitives(\n    entityset: EntitySet,\n    include_time_series_primitives: bool = False,\n    excluded_primitives: List[str] = DEFAULT_EXCLUDED_PRIMITIVES,\n) -> List[str]:\n    \"\"\"Get a list of recommended primitives given an entity set.\n\n    Description:\n        This function works by first getting a list of valid primitives withholding any primitives specified in `excluded_primitives` that could be applied to a single-table EntitySet.\n        Secondly, engineered features are created for non-numeric fields and are checked for non-uniqueness. If the feature is non-unique, it is added to the recommendation list.\n        Then, numeric fields are checked for skewness. Depending on how skew a column is `square_root` or `natural_logarithm` will be recommended.\n        Lastly if `include_time_series_primitives` is specified as `True`, `Lag` will always be recommended,\n        as well as all Rolling and Expanding primitives if numeric columns are present.\n\n    Args:\n        entityset (EntitySet): EntitySet that only contains one dataframe.\n        include_time_series_primitives (bool): Whether or not time-series primitives should be considered. Defaults to False.\n        excluded_primitives (List[str]): List of transform primitives to exclude from recommendations. Defaults to DEFAULT_EXCLUDED_PRIMITIVES.\n\n    Note:\n        The main objective of this function is to recommend primitives that could potentially provide important features to the modeling process.\n        Non-numeric primitives do a great job in mainly serving as a way to extract information from origin features that may essentially be meaningless by themselves (e.g., NaturalLanguage, Datetime, LatLong).\n        That is why they are the main focus of this function. Numeric transform primitives are very case-by-case dependent and therefore it is hard to mathematically quantify which should be recommended.\n        Therefore, only transform primitives that address skewed numeric columns are included, as this is a standard and quantifiable transformation step. The only exception to this rule being\n        for time series problems. Because there are so few primitives that are only applicable for time series, all of them are included in the recommended primitives list.\n\n    Note:\n        This function currently only works for single table and will only recommend transform primitives.\n    \"\"\"\n    es_dataframe_list = entityset.dataframes\n    if len(es_dataframe_list) == 0:\n        raise IndexError(\"No DataFrame in EntitySet found. Please add a DataFrame.\")\n    if len(es_dataframe_list) > 1:\n        raise IndexError(\n            \"Multi-table EntitySets are currently not supported. Please only use a single table EntitySet.\",\n        )\n\n    target_dataframe_name = es_dataframe_list[0].ww.name\n\n    recommended_primitives = set()\n\n    if not include_time_series_primitives:\n        excluded_primitives += TIME_SERIES_PRIMITIVES\n\n    all_trans_primitives = get_transform_primitives()\n    selected_trans_primitives = [\n        p for name, p in all_trans_primitives.items() if name not in excluded_primitives\n    ]\n\n    valid_primitive_names = [\n        prim.name\n        for prim in get_valid_primitives(\n            entityset,\n            target_dataframe_name,\n            1,\n            selected_trans_primitives,\n        )[1]\n    ]\n\n    recommended_primitives.update(\n        _recommend_non_numeric_primitives(\n            entityset,\n            target_dataframe_name,\n            valid_primitive_names,\n        ),\n    )\n\n    recommended_primitives.update(\n        _recommend_skew_numeric_primitives(\n            entityset,\n            target_dataframe_name,\n            valid_primitive_names,\n        ),\n    )\n\n    recommended_primitives.update(\n        set(TIME_SERIES_PRIMITIVES).intersection(\n            valid_primitive_names,\n        ),\n    )\n    return list(recommended_primitives)\n\n\ndef _recommend_non_numeric_primitives(\n    entityset: EntitySet,\n    target_dataframe_name: str,\n    valid_primitives: List[str],\n) -> set:\n    \"\"\"Get a set of non-numeric primitives for a given dataset and a list of primitives.\n\n    Description:\n        Given a single table entity set with a `target_dataframe_name` and an applicable list of `valid_primitives`,\n        get a set of primitives which produce non-unique features.\n\n    Args:\n        entityset (EntitySet): EntitySet that only contains one dataframe.\n        target_dataframe_name (str): Name of target dataframe to access in `entityset`.\n        valid_primitives (List[str]): List of primitives to calculate and check output features.\n    \"\"\"\n\n    recommended_non_numeric_primitives: set[str] = set()\n    # Only want to run feature generation on non numeric primitives\n    numeric_columns_to_ignore = list(\n        entityset[target_dataframe_name]\n        .ww.select(include=\"numeric\", return_schema=True)\n        .columns,\n    )\n    features = dfs(\n        entityset=entityset,\n        target_dataframe_name=target_dataframe_name,\n        trans_primitives=valid_primitives,\n        max_depth=1,\n        features_only=True,\n        ignore_columns={target_dataframe_name: numeric_columns_to_ignore},\n    )\n\n    for f in features:\n        if (\n            f.primitive.name is not None\n            and f.primitive.name not in recommended_non_numeric_primitives\n        ):\n            try:\n                matrix = calculate_feature_matrix([f], entityset)\n                for f_name in f.get_feature_names():\n                    if len(matrix[f_name].unique()) > 1:\n                        recommended_non_numeric_primitives.add(f.primitive.name)\n            except (\n                Exception\n            ) as e:  # If error in calculating feature matrix pass on the recommendation\n                logger = logging.getLogger(\"featuretools\")\n                logger.error(\n                    f\"Exception with feature {f.get_name()} with primitive {f.primitive.name}: {str(e)}\",\n                )\n\n    return recommended_non_numeric_primitives\n\n\ndef _recommend_skew_numeric_primitives(\n    entityset: EntitySet,\n    target_dataframe_name: str,\n    valid_primitives: List[str],\n) -> set:\n    \"\"\"Get a set of recommended skew numeric primitives given an entity set.\n\n    Description:\n        Given woodwork initialized dataframe of origin features with only `numeric` semantic tags and an applicable list of `valid_skew_primitives`,\n        get a set of primitives which could be applied to address right skewness.\n\n    Args:\n        entityset (EntitySet): EntitySet that only contains one dataframe.\n        target_dataframe_name (str): Name of target dataframe to access in `entityset`.\n        valid_primitives (List[str]): List of primitives to compare.\n\n    Note:\n        We currently only have primitives to address right skewness.\n    \"\"\"\n    recommended_skew_primitives: set[str] = set()\n    skew_numeric_primitives = set([\"square_root\", \"natural_logarithm\"])\n    valid_skew_primitives = skew_numeric_primitives.intersection(valid_primitives)\n    if valid_skew_primitives:\n        numerics_only_df = entityset[target_dataframe_name].ww.select(\"numeric\")\n        recommended_skew_primitives: set[str] = set()\n        for col in numerics_only_df:\n            # Shouldn't recommend log, sqrt if nans, zeros and negative numbers are present\n            contains_nan = numerics_only_df[col].isnull().any()\n            all_above_zero = (numerics_only_df[col] > 0).all()\n            if all_above_zero and not contains_nan:\n                skew = numerics_only_df[col].skew()\n                # We currently don't have anything in featuretools to automatically handle left skewed data as well as skewed data with negative values\n                if skew > 0.5 and skew < 1 and \"square_root\" in valid_skew_primitives:\n                    recommended_skew_primitives.add(\"square_root\")\n                    # TODO: Add Box Cox here when available\n                if skew > 1 and \"natural_logarithm\" in valid_skew_primitives:\n                    recommended_skew_primitives.add(\"natural_logarithm\")\n                    # TODO: Add log base 10 transform primitive when available\n    return recommended_skew_primitives\n"
  },
  {
    "path": "featuretools/utils/s3_utils.py",
    "content": "import json\nimport shutil\n\nfrom featuretools.utils.gen_utils import import_or_raise\n\n\ndef use_smartopen_es(file_path, path, transport_params=None, read=True):\n    open = import_or_raise(\"smart_open\", SMART_OPEN_ERR_MSG).open\n    if read:\n        with open(path, \"rb\", transport_params=transport_params) as fin:\n            with open(file_path, \"wb\") as fout:\n                shutil.copyfileobj(fin, fout)\n    else:\n        with open(file_path, \"rb\") as fin:\n            with open(path, \"wb\", transport_params=transport_params) as fout:\n                shutil.copyfileobj(fin, fout)\n\n\ndef use_smartopen_features(path, features_dict=None, transport_params=None, read=True):\n    open = import_or_raise(\"smart_open\", SMART_OPEN_ERR_MSG).open\n    if read:\n        with open(path, \"r\", encoding=\"utf-8\", transport_params=transport_params) as f:\n            features_dict = json.load(f)\n            return features_dict\n    else:\n        with open(path, \"w\", transport_params=transport_params) as f:\n            json.dump(features_dict, f)\n\n\ndef get_transport_params(profile_name):\n    boto3 = import_or_raise(\"boto3\", BOTO3_ERR_MSG)\n    UNSIGNED = import_or_raise(\"botocore\", BOTOCORE_ERR_MSG).UNSIGNED\n    Config = import_or_raise(\"botocore.config\", BOTOCORE_ERR_MSG).Config\n\n    if isinstance(profile_name, str):\n        session = boto3.Session(profile_name=profile_name)\n        transport_params = {\"client\": session.client(\"s3\")}\n    elif profile_name is False or boto3.Session().get_credentials() is None:\n        session = boto3.Session()\n        client = session.client(\"s3\", config=Config(signature_version=UNSIGNED))\n        transport_params = {\"client\": client}\n    else:\n        transport_params = None\n    return transport_params\n\n\nBOTO3_ERR_MSG = (\n    \"The boto3 library is required to read and write from URLs and S3.\\n\"\n    \"Install via pip:\\n\"\n    \"    pip install boto3\\n\"\n    \"Install via conda:\\n\"\n    \"    conda install -c conda-forge boto3\"\n)\nBOTOCORE_ERR_MSG = (\n    \"The botocore library is required to read and write from URLs and S3.\\n\"\n    \"Install via pip:\\n\"\n    \"    pip install botocore\\n\"\n    \"Install via conda:\\n\"\n    \"    conda install -c conda-forge botocore\"\n)\nSMART_OPEN_ERR_MSG = (\n    \"The smart_open library is required to read and write from URLs and S3.\\n\"\n    \"Install via pip:\\n\"\n    \"    pip install 'smart-open>=5.0.0'\\n\"\n    \"Install via conda:\\n\"\n    \"    conda install -c conda-forge 'smart_open>=5.0.0'\"\n)\n"
  },
  {
    "path": "featuretools/utils/schema_utils.py",
    "content": "import logging\nimport warnings\n\nfrom packaging.version import parse\n\nfrom featuretools.version import ENTITYSET_SCHEMA_VERSION, FEATURES_SCHEMA_VERSION\n\nlogger = logging.getLogger(\"featuretools.utils\")\n\n\ndef check_schema_version(cls, cls_type):\n    \"\"\"\n    If the saved schema version is newer than the current featuretools\n    schema version, this function will output a warning saying so.\n\n    If the saved schema version is a major release or more behind\n    the current featuretools schema version, this function will log\n    a message saying so.\n    \"\"\"\n    if isinstance(cls_type, str):\n        current = None\n        saved = None\n        if cls_type == \"entityset\":\n            current = ENTITYSET_SCHEMA_VERSION\n            saved = cls.get(\"schema_version\")\n        elif cls_type == \"features\":\n            current = FEATURES_SCHEMA_VERSION\n            saved = cls.features_dict[\"schema_version\"]\n\n        if parse(current) < parse(saved):\n            warning_text_upgrade = (\n                \"The schema version of the saved %s\"\n                \"(%s) is greater than the latest supported (%s). \"\n                \"You may need to upgrade featuretools. Attempting to load %s ...\"\n                % (cls_type, saved, current, cls_type)\n            )\n            warnings.warn(warning_text_upgrade)\n\n        if parse(current).major > parse(saved).major:\n            warning_text_outdated = (\n                \"The schema version of the saved %s\"\n                \"(%s) is no longer supported by this version \"\n                \"of featuretools. Attempting to load %s ...\"\n                % (cls_type, saved, cls_type)\n            )\n            logger.warning(warning_text_outdated)\n"
  },
  {
    "path": "featuretools/utils/time_utils.py",
    "content": "from datetime import datetime, timedelta\n\nimport numpy as np\nimport pandas as pd\n\n\ndef make_temporal_cutoffs(\n    instance_ids,\n    cutoffs,\n    window_size=None,\n    num_windows=None,\n    start=None,\n):\n    \"\"\"Makes a set of equally spaced cutoff times prior to a set of input cutoffs and instance ids.\n\n    If window_size and num_windows are provided, then num_windows of size window_size will be created\n    prior to each cutoff time\n\n    If window_size and a start list is provided, then a variable number of windows will be created prior\n    to each cutoff time, with the corresponding start time as the first cutoff.\n\n    If num_windows and a start list is provided, then num_windows of variable size will be created prior\n    to each cutoff time, with the corresponding start time as the first cutoff\n\n    Args:\n        instance_ids (list, np.ndarray, or pd.Series): list of instance ids. This function will make a\n            new datetime series of multiple cutoff times for each value in this array.\n        cutoffs (list, np.ndarray, or pd.Series): list of datetime objects associated with each instance id.\n            Each one of these will be the last time in the new datetime series for each instance id\n        window_size (pd.Timedelta, optional): amount of time between each datetime in each new cutoff series\n        num_windows (int, optional): number of windows in each new cutoff series\n        start (list, optional): list of start times for each instance id\n    \"\"\"\n    if window_size is not None and num_windows is not None and start is not None:\n        raise ValueError(\n            \"Only supply 2 of the 3 optional args, window_size, num_windows and start\",\n        )\n    out = []\n    for i, id_time in enumerate(zip(instance_ids, cutoffs)):\n        _id, time = id_time\n        _window_size = window_size\n        _start = None\n        if start is not None:\n            if window_size is None:\n                _window_size = (time - start[i]) / (num_windows - 1)\n            else:\n                _start = start[i]\n        to_add = pd.DataFrame()\n        to_add[\"time\"] = pd.date_range(\n            end=time,\n            periods=num_windows,\n            freq=_window_size,\n            start=_start,\n        )\n        to_add[\"instance_id\"] = [_id] * len(to_add[\"time\"])\n        out.append(to_add)\n    return pd.concat(out).reset_index(drop=True)\n\n\ndef convert_time_units(secs, unit):\n    \"\"\"\n    Converts a time specified in seconds to a time in the given units\n\n    Args:\n        secs (integer): number of seconds. This function will convert the units of this number.\n        unit(str): units to be converted to.\n            acceptable values: years, months, days, hours, minutes, seconds, milliseconds, nanoseconds\n    \"\"\"\n    unit_divs = {\n        \"years\": 31540000,\n        \"months\": 2628000,\n        \"days\": 86400,\n        \"hours\": 3600,\n        \"minutes\": 60,\n        \"seconds\": 1,\n        \"milliseconds\": 0.001,\n        \"nanoseconds\": 0.000000001,\n    }\n    if unit not in unit_divs:\n        raise ValueError(\"Invalid unit given, make sure it is plural\")\n\n    return secs / (unit_divs[unit])\n\n\ndef convert_datetime_to_floats(x):\n    first = int(x.iloc[0].value * 1e-9)\n    x = pd.to_numeric(x).astype(np.float64).values\n    dividend = find_dividend_by_unit(first)\n    x *= 1e-9 / dividend\n    return x\n\n\ndef convert_timedelta_to_floats(x):\n    first = int(x.iloc[0].total_seconds())\n    dividend = find_dividend_by_unit(first)\n    x = pd.TimedeltaIndex(x).total_seconds().astype(np.float64) / dividend\n    return x\n\n\ndef find_dividend_by_unit(time):\n    \"\"\"Finds whether time best corresponds to a value in\n    days, hours, minutes, or seconds.\n    \"\"\"\n    for dividend in [86400, 3600, 60]:\n        div = time / dividend\n        if round(div) == div:\n            return dividend\n    return 1\n\n\ndef calculate_trend(series):\n    # numpy can't handle `Int64` values, so cast to float\n    if series.dtype == \"Int64\":\n        series = series.astype(\"float64\")\n    df = pd.DataFrame({\"x\": series.index, \"y\": series.values}).dropna()\n    if df.shape[0] <= 2:\n        return np.nan\n    if isinstance(df[\"x\"].iloc[0], (datetime, pd.Timestamp)):\n        x = convert_datetime_to_floats(df[\"x\"])\n    else:\n        x = df[\"x\"].values\n\n    if isinstance(df[\"y\"].iloc[0], (datetime, pd.Timestamp)):\n        y = convert_datetime_to_floats(df[\"y\"])\n    elif isinstance(df[\"y\"].iloc[0], (timedelta, pd.Timedelta)):\n        y = convert_timedelta_to_floats(df[\"y\"])\n    else:\n        y = df[\"y\"].values\n\n    x = x - x.mean()\n    y = y - y.mean()\n\n    # prevent divide by zero error\n    if len(np.unique(x)) == 1:\n        return 0\n\n    # consider scipy.stats.linregress for large n cases\n    coefficients = np.polyfit(x, y, 1)\n    return coefficients[0]\n"
  },
  {
    "path": "featuretools/utils/trie.py",
    "content": "class Trie(object):\n    \"\"\"\n    A trie (prefix tree) where the keys are sequences of hashable objects.\n\n    It behaves similarly to a dictionary, except that the keys can be lists or\n    other sequences.\n\n    Examples:\n        >>> from featuretools.utils import Trie\n        >>> trie = Trie(default=str)\n        >>> # Set a value\n        >>> trie.get_node([1, 2, 3]).value = '123'\n        >>> # Get a value\n        >>> trie.get_node([1, 2, 3]).value\n        '123'\n        >>> # Overwrite a value\n        >>> trie.get_node([1, 2, 3]).value = 'updated'\n        >>> trie.get_node([1, 2, 3]).value\n        'updated'\n        >>> # Getting a key that has not been set returns the default value.\n        >>> trie.get_node([1, 2]).value\n        ''\n    \"\"\"\n\n    def __init__(self, default=lambda: None, path_constructor=list):\n        \"\"\"\n        default: A function returning the value to use for new nodes.\n        path_constructor: A function which constructs a path from a list. The\n            path type must support addition (concatenation).\n        \"\"\"\n        self.value = default()\n        self._children = {}\n        self._default = default\n        self._path_constructor = path_constructor\n\n    def children(self):\n        \"\"\"\n        A list of pairs of the edges from this node and the nodes they point\n        to.\n\n        Examples:\n            >>> from featuretools.utils import Trie\n            >>> trie = Trie(default=str)\n            >>> trie.get_node([1, 2]).value = '12'\n            >>> trie.get_node([3]).value = '3'\n            >>> children = trie.children()\n            >>> first_edge, first_child = children[0]\n            >>> first_edge\n            1\n            >>> first_child.value\n            ''\n            >>> second_edge, second_child = children[1]\n            >>> second_edge\n            3\n            >>> second_child.value\n            '3'\n        \"\"\"\n        return list(self._children.items())\n\n    def get_node(self, path):\n        \"\"\"\n        Get the sub-trie at the given path. If it does not yet exist initialize\n        it with the default value.\n\n        Examples:\n            >>> from featuretools.utils import Trie\n            >>> t = Trie()\n            >>> t.get_node([1, 2, 3]).value = '123'\n            >>> t.get_node([1, 2, 4]).value = '124'\n            >>> sub = t.get_node([1, 2])\n            >>> sub.get_node([3]).value\n            '123'\n            >>> sub.get_node([4]).value\n            '124'\n        \"\"\"\n        if path:\n            first = path[0]\n            rest = path[1:]\n\n            if first in self._children:\n                sub_trie = self._children[first]\n            else:\n                sub_trie = Trie(\n                    default=self._default,\n                    path_constructor=self._path_constructor,\n                )\n                self._children[first] = sub_trie\n\n            return sub_trie.get_node(rest)\n        else:\n            return self\n\n    def __iter__(self):\n        \"\"\"\n        Iterate over all values in the trie. Yields tuples of (path, value).\n\n        Implemented using depth first search.\n        \"\"\"\n        yield self._path_constructor([]), self.value\n\n        for key, sub_trie in self.children():\n            path_to_children = self._path_constructor([key])\n\n            for sub_path, value in sub_trie:\n                path = path_to_children + sub_path\n                yield path, value\n"
  },
  {
    "path": "featuretools/utils/utils_info.py",
    "content": "import locale\nimport os\nimport platform\nimport struct\nimport sys\n\nimport pkg_resources\n\nimport featuretools\n\ndeps = [\n    \"numpy\",\n    \"pandas\",\n    \"tqdm\",\n    \"cloudpickle\",\n    \"dask\",\n    \"distributed\",\n    \"psutil\",\n    \"pip\",\n    \"setuptools\",\n]\n\n\ndef show_info():\n    print(\"Featuretools version: %s\" % featuretools.__version__)\n    print(\"Featuretools installation directory: %s\" % get_featuretools_root())\n    print_sys_info()\n    print_deps(deps)\n\n\ndef print_sys_info():\n    print(\"\\nSYSTEM INFO\")\n    print(\"-----------\")\n    sys_info = get_sys_info()\n    for k, stat in sys_info:\n        print(\"{k}: {stat}\".format(k=k, stat=stat))\n\n\ndef print_deps(dependencies):\n    print(\"\\nINSTALLED VERSIONS\")\n    print(\"------------------\")\n    installed_packages = get_installed_packages()\n\n    package_dep = []\n    for x in dependencies:\n        # prevents uninstalled deps from being printed\n        if x in installed_packages:\n            package_dep.append((x, installed_packages[x]))\n    for k, stat in package_dep:\n        print(\"{k}: {stat}\".format(k=k, stat=stat))\n\n\n# Modified from here\n# https://github.com/pandas-dev/pandas/blob/d9a037ec4ad0aab0f5bf2ad18a30554c38299e57/pandas/util/_print_versions.py#L11\ndef get_sys_info():\n    \"Returns system information as a dict\"\n\n    blob = []\n\n    try:\n        (sysname, nodename, release, version, machine, processor) = platform.uname()\n        blob.extend(\n            [\n                (\"python\", \".\".join(map(str, sys.version_info))),\n                (\"python-bits\", struct.calcsize(\"P\") * 8),\n                (\"OS\", \"{sysname}\".format(sysname=sysname)),\n                (\"OS-release\", \"{release}\".format(release=release)),\n                (\"machine\", \"{machine}\".format(machine=machine)),\n                (\"processor\", \"{processor}\".format(processor=processor)),\n                (\"byteorder\", \"{byteorder}\".format(byteorder=sys.byteorder)),\n                (\"LC_ALL\", \"{lc}\".format(lc=os.environ.get(\"LC_ALL\", \"None\"))),\n                (\"LANG\", \"{lang}\".format(lang=os.environ.get(\"LANG\", \"None\"))),\n                (\"LOCALE\", \".\".join(map(str, locale.getlocale()))),\n            ],\n        )\n    except (KeyError, ValueError):\n        pass\n\n    return blob\n\n\ndef get_installed_packages():\n    installed_packages = {}\n    for d in pkg_resources.working_set:\n        installed_packages[d.project_name] = d.version\n    return installed_packages\n\n\ndef get_featuretools_root():\n    return os.path.dirname(featuretools.__file__)\n"
  },
  {
    "path": "featuretools/utils/wrangle.py",
    "content": "import re\nimport tarfile\nfrom datetime import datetime\n\nimport numpy as np\nimport pandas as pd\nfrom woodwork.logical_types import Datetime, Ordinal\n\nfrom featuretools.entityset.timedelta import Timedelta\n\n\ndef _check_timedelta(td):\n    \"\"\"\n    Convert strings to Timedelta objects\n    Allows for both shortform and longform units, as well as any form of capitalization\n    '2 Minutes'\n    '2 minutes'\n    '2 m'\n    '1 Minute'\n    '1 minute'\n    '1 m'\n    '1 units'\n    '1 Units'\n    '1 u'\n    Shortform is fine if space is dropped\n    '2m'\n    '1u\"\n    If a pd.Timedelta object is passed, units will be converted to seconds due to the underlying representation\n        of pd.Timedelta.\n    If a pd.DateOffset object is passed, it will be converted to a Featuretools Timedelta if it has one\n        temporal parameter. Otherwise, it will remain a pd.DateOffset.\n    \"\"\"\n    if td is None:\n        return td\n    if isinstance(td, Timedelta):\n        return td\n    elif not isinstance(td, (int, float, str, pd.DateOffset, pd.Timedelta)):\n        raise ValueError(\"Unable to parse timedelta: {}\".format(td))\n    if isinstance(td, pd.Timedelta):\n        unit = \"s\"\n        value = td.total_seconds()\n        times = {unit: value}\n        return Timedelta(times, delta_obj=td)\n    elif isinstance(td, pd.DateOffset):\n        # DateOffsets\n        if td.__class__.__name__ != \"DateOffset\":\n            if hasattr(td, \"__dict__\"):\n                # Special offsets (such as BDay) - prior to pandas 1.0.0\n                value = td.__dict__[\"n\"]\n            else:\n                # Special offsets (such as BDay) - after pandas 1.0.0\n                value = td.n\n            unit = td.__class__.__name__\n            times = dict([(unit, value)])\n        else:\n            times = dict()\n            for td_unit, td_value in td.kwds.items():\n                times[td_unit] = td_value\n        return Timedelta(times, delta_obj=td)\n    else:\n        pattern = \"([0-9]+) *([a-zA-Z]+)$\"\n        match = re.match(pattern, td)\n        value, unit = match.groups()\n        try:\n            value = int(value)\n        except Exception:\n            try:\n                value = float(value)\n            except Exception:\n                raise ValueError(\n                    \"Unable to parse value {} from \".format(value)\n                    + \"timedelta string: {}\".format(td),\n                )\n        times = {unit: value}\n        return Timedelta(times)\n\n\ndef _check_time_against_column(time, time_column):\n    \"\"\"\n    Check to make sure that time is compatible with time_column,\n    where time could be a timestamp, or a Timedelta, number, or None,\n    and time_column is a Woodwork initialized column. Compatibility means that\n    arithmetic can be performed between time and elements of time_column\n\n    If time is None, then we don't care if arithmetic can be performed\n    (presumably it won't ever be performed)\n    \"\"\"\n    if time is None:\n        return True\n    elif isinstance(time, (int, float)):\n        return time_column.ww.schema.is_numeric\n    elif isinstance(time, (pd.Timestamp, datetime, pd.DateOffset)):\n        return time_column.ww.schema.is_datetime\n    elif isinstance(time, Timedelta):\n        if time_column.ww.schema.is_datetime:\n            return True\n        elif time.unit not in Timedelta._time_units:\n            if (\n                isinstance(time_column.ww.logical_type, Ordinal)\n                or \"numeric\" in time_column.ww.semantic_tags\n                or \"time_index\" in time_column.ww.semantic_tags\n            ):\n                return True\n    return False\n\n\ndef _check_time_type(time):\n    \"\"\"\n    Checks if `time` is an instance of common int, float, or datetime types.\n    Returns \"numeric\" or Datetime based on results\n    \"\"\"\n    time_type = None\n    if isinstance(time, (datetime, np.datetime64)):\n        time_type = Datetime\n    elif (\n        isinstance(time, (int, float))\n        or np.issubdtype(time, np.integer)\n        or np.issubdtype(time, np.floating)\n    ):\n        time_type = \"numeric\"\n    return time_type\n\n\ndef _is_s3(string):\n    \"\"\"\n    Checks if the given string is a s3 path.\n    Returns a boolean.\n    \"\"\"\n    return string.startswith(\"s3://\")\n\n\ndef _is_url(string):\n    \"\"\"\n    Checks if the given string is an url path.\n    Returns a boolean.\n    \"\"\"\n    return string.startswith(\"http\")\n\n\ndef _is_local_tar(string):\n    \"\"\"\n    Checks if the given string is a local tarfile path.\n    Returns a boolean.\n    \"\"\"\n    return string.endswith(\".tar\") and tarfile.is_tarfile(string)\n"
  },
  {
    "path": "featuretools/version.py",
    "content": "__version__ = \"1.31.0\"\nENTITYSET_SCHEMA_VERSION = \"9.0.0\"\nFEATURES_SCHEMA_VERSION = \"10.0.0\"\n"
  },
  {
    "path": "pyproject.toml",
    "content": "[project]\nname = \"featuretools\"\nreadme = \"README.md\"\ndescription = \"a framework for automated feature engineering\"\ndynamic = [\"version\"]\nclassifiers = [\n    \"Development Status :: 5 - Production/Stable\",\n    \"Intended Audience :: Science/Research\",\n    \"Intended Audience :: Developers\",\n    \"Topic :: Software Development\",\n    \"Topic :: Scientific/Engineering\",\n    \"Programming Language :: Python\",\n    \"Programming Language :: Python :: 3\",\n    \"Programming Language :: Python :: 3.9\",\n    \"Programming Language :: Python :: 3.10\",\n    \"Programming Language :: Python :: 3.11\",\n    \"Programming Language :: Python :: 3.12\",\n    \"Operating System :: Microsoft :: Windows\",\n    \"Operating System :: POSIX\",\n    \"Operating System :: Unix\",\n    \"Operating System :: MacOS\",\n]\nauthors = [\n    {name=\"Alteryx, Inc.\", email=\"open_source_support@alteryx.com\"}\n]\nmaintainers = [\n    {name=\"Alteryx, Inc.\", email=\"open_source_support@alteryx.com\"}\n]\nkeywords = [\"feature engineering\", \"data science\", \"machine learning\"]\nlicense = {text = \"BSD 3-clause\"}\nrequires-python = \">=3.9,<4\"\ndependencies = [\n    \"cloudpickle >= 1.5.0\",\n    \"holidays >= 0.17\",\n    \"numpy >= 1.25.0, < 2.0.0\",\n    \"packaging >= 20.0\",\n    \"pandas >= 2.0.0\",\n    \"psutil >= 5.7.0\",\n    \"scipy >= 1.10.0\",\n    \"tqdm >= 4.66.3\",\n    \"woodwork >= 0.28.0\",\n]\n\n[project.urls]\n\"Documentation\" = \"https://featuretools.alteryx.com\"\n\"Source Code\"= \"https://github.com/alteryx/featuretools/\"\n\"Changes\" = \"https://featuretools.alteryx.com/en/latest/release_notes.html\"\n\"Issue Tracker\" = \"https://github.com/alteryx/featuretools/issues\"\n\"Twitter\" = \"https://twitter.com/alteryxoss\"\n\"Chat\" = \"https://join.slack.com/t/alteryx-oss/shared_invite/zt-182tyvuxv-NzIn6eiCEf8TBziuKp0bNA\"\n\n[project.optional-dependencies]\ntest = [\n    \"boto3 >= 1.34.32\",\n    \"composeml >= 0.8.0\",\n    \"graphviz >= 0.8.4\",\n    \"moto[all] >= 5.0.0\",\n    \"pip >= 23.3.0\",\n    \"pyarrow >= 14.0.1\",\n    \"pympler >= 0.8\",\n    \"pytest >= 7.1.2\",\n    \"pytest-cov >= 3.0.0\",\n    \"pytest-xdist >= 2.5.0\",\n    \"smart-open >= 5.0.0\",\n    \"urllib3 >= 1.26.18\",\n    \"pytest-timeout >= 2.1.0\",\n]\ndask = [\n    \"dask[dataframe] >= 2023.2.0\",\n    \"distributed >= 2023.2.0\",\n]\ntsfresh = [\n    \"featuretools-tsfresh-primitives >= 1.0.0\",\n]\nautonormalize = [\n    \"autonormalize >= 2.0.1\",\n]\nsql = [\n    \"featuretools_sql >= 0.0.1\",\n    \"psycopg2-binary >= 2.9.3\",\n]\nsklearn = [\n    \"featuretools-sklearn-transformer >= 1.0.0\",\n]\npremium = [\n    \"premium-primitives >= 0.0.3\",\n]\nnlp = [\n    \"nlp-primitives >= 2.12.0\",\n]\ndocs = [\n    \"ipython == 8.4.0\",\n    \"jupyter == 1.0.0\",\n    \"jupyter-client >= 8.0.2\",\n    \"matplotlib == 3.7.2\",\n    \"Sphinx == 5.1.1\",\n    \"nbsphinx == 0.8.9\",\n    \"nbconvert == 6.5.0\",\n    \"pydata-sphinx-theme == 0.9.0\",\n    \"sphinx-inline-tabs == 2022.1.2b11\",\n    \"sphinx-copybutton == 0.5.0\",\n    \"myst-parser == 0.18.0\",\n    \"autonormalize >= 2.0.1\",\n    \"click >= 7.0.0\",\n    \"featuretools[dask,test]\",\n]\ndev = [\n    \"ruff >= 0.1.6\",\n    \"black[jupyter] >= 23.1.0\",\n    \"pre-commit >= 2.20.0\",\n    \"featuretools[docs,dask,test]\",\n]\ncomplete = [\n    \"featuretools[premium,nlp,dask]\",\n]\n\n[tool.setuptools]\ninclude-package-data = true\nlicense-files = [\n    \"LICENSE\",\n    \"featuretools/primitives/data/free_email_provider_domains_license\"\n]\n\n[tool.setuptools.packages.find]\nnamespaces = true\n\n[tool.setuptools.package-data]\n\"*\" = [\n    \"*.txt\",\n    \"README.md\",\n]\n\"featuretools\" = [\n    \"primitives/data/*.csv\",\n    \"primitives/data/*.txt\",\n]\n\n[tool.setuptools.exclude-package-data]\n\"*\" = [\n    \"* __pycache__\",\n    \"*.py[co]\",\n    \"docs/*\"\n]\n\n[tool.setuptools.dynamic]\nversion = {attr = \"featuretools.version.__version__\"}\n\n[tool.pytest.ini_options]\naddopts = \"--doctest-modules --ignore=featuretools/tests/entry_point_tests/add-ons\"\ntestpaths = [\n    \"featuretools/tests/*\"\n]\nfilterwarnings = [\n    \"ignore::DeprecationWarning\",\n    \"ignore::PendingDeprecationWarning\"\n]\n\n[tool.ruff]\nline-length = 88\ntarget-version = \"py311\"\nlint.ignore = [\"E501\"]\nlint.select = [\n    # Pyflakes\n    \"F\",\n    # Pycodestyle\n    \"E\",\n    \"W\",\n    # isort\n    \"I001\"\n]\nsrc = [\"featuretools\"]\n\n[tool.ruff.lint.per-file-ignores]\n\"__init__.py\" = [\"E402\", \"F401\", \"I001\", \"E501\"]\n\n[tool.ruff.lint.isort]\nknown-first-party = [\"featuretools\"]\n\n[tool.coverage.run]\nsource = [\"featuretools\"]\nomit = [\n    \"*/add-ons/**/*\"\n]\n\n[tool.coverage.report]\nexclude_lines =[\n    \"pragma: no cover\",\n    \"def __repr__\",\n    \"raise AssertionError\",\n    \"raise NotImplementedError\",\n    \"if __name__ == .__main__.:\",\n    \"if self._verbose:\",\n    \"if verbose:\",\n    \"if profile:\",\n    \"pytest.skip\"\n]\n[build-system]\nrequires = [\n    \"setuptools >= 61.0.0\",\n    \"wheel\"\n]\nbuild-backend = \"setuptools.build_meta\"\n"
  },
  {
    "path": "release.md",
    "content": "# Release Process\n\n## 0. Pre-Release Checklist\n\nBefore starting the release process, verify the following:\n\n- All work required for this release has been completed and the team is ready to release.\n- [All Github Actions Tests are green on main](https://github.com/alteryx/featuretools/actions?query=branch%3Amain).\n- EvalML Tests are green with Featuretools main\n  - [![Unit Tests - EvalML with Featuretools main branch](https://github.com/alteryx/evalml/actions/workflows/unit_tests_with_featuretools_main_branch.yaml/badge.svg?branch=main)](https://github.com/alteryx/evalml/actions/workflows/unit_tests_with_featuretools_main_branch.yaml)\n- Looking Glass performance tests runs should not show any significant performance regressions when comparing the last commit on `main` with the previous release of Featuretools. See Step 1 below for instructions on manually launching the performance tests runs.\n- The [ReadtheDocs build](https://readthedocs.com/projects/feature-labs-inc-featuretools/) for \"latest\" is marked as passed. To avoid mysterious errors, best practice is to empty your browser cache when reading new versions of the docs!\n- The [public documentation for the \"latest\" branch](https://featuretools.alteryx.com/en/latest/) looks correct, and the [release notes](https://featuretools.alteryx.com/en/latest/release_notes.html) includes the last change which was made on `main`.\n- Get agreement on the version number to use for the release.\n\n#### Version Numbering\n\nFeaturetools uses [semantic versioning](https://semver.org/). Every release has a major, minor and patch version number, and are displayed like so: `<majorVersion>.<minorVersion>.<patchVersion>`.\n\nIn certain instances, it may be necessary to create a backport release. This is when commits from a newer version of a library are ported to an older version of the software and then released. This occurs when anything but the latest commit on main is used as the target for release, but can go so far as to add a further patch release, such as 0.11.2, to be released after a 0.12.0 version had already been released. If a backport release is being performed, please see the [Backport Release Guide](docs/backport_release.md) for instructions on how to proceed, as some steps from this guide should be performed differently.\n\nIf you'd like to create a development release, which won't be deployed to pypi and conda and marked as a generally-available production release, please add a \"dev\" prefix to the patch version, i.e. `X.X.devX`. Note this claims the patch number--if the previous release was `0.12.0`, a subsequent dev release would be `0.12.dev1`, and the following release would be `0.12.2`, _not_ `0.12.1`. Development releases deploy to [test.pypi.org](https://test.pypi.org/project/featuretools/) instead of to [pypi.org](https://pypi.org/project/featuretools).\n\n## 1. Evaluate Performance Test Results\n\nBefore releasing Featuretools, the person performing the release should launch a performance test run and evaluate the results to make sure no significant performance regressions will be introduced by the release. This can be done by launching a Looking Glass performance test run, which will then post results to Slack. \n\nTo manually launch a Looking Glass performance test run, follow these steps:\n1. Navigate to the [Looking Glass performance tests](https://github.com/alteryx/featuretools/actions/workflows/looking_glass_performance_tests.yaml) GitHub action\n2. Click on the Run workflow dropdown to set up the run\n3. Make sure that the \"use workflow from\" dropdown is set to `main` to use the workflow version in Featuretools `main`\n4. Enter the hash of the most recent commit to `main` in the \"new commit to evaluate\" field. For example: `cee9607`\n5. Enter the version tag of the last release of Featuretools in the \"previous commit to evaluate\" field. For example, if the last release of Featuretools was version 1.20.0, you would enter `v1.20.0` here.\n6. Click the \"Run workflow\" button to launch the jobs\n\nOnce the job has been completed, the results summaries will be posted to Slack automatically. Review the results and make sure the performance has not degraded. If any significant performance issues are noted, discuss with the development team before proceeding.\n\nNote: The procedure above can also be used to launch performance tests runs at any time, even outside of the release process. When launching a test run, the commit fields can take any commit hash, GitHub branch or tag as input to specify the new and previous commits to compare.\n\n## 2. Create Featuretools release on Github\n\n#### Create Release Branch\n\n1. Branch off of featuretools main. For the branch name, please use \"release_vX.Y.Z\" as the naming scheme (e.g. \"release_v0.13.3\"). Doing so will bypass our release notes checkin test which requires all other PRs to add a release note entry.\n\n#### Bump Version Number\n\n1. Bump `__version__` in `featuretools/version.py`, and `featuretools/tests/test_version.py`.\n\n#### Update Release Notes\n\n1. Replace \"Future Release\" in `docs/source/release_notes.rst` with the current date\n\n   ```\n   v0.13.3 Sep 28, 2020\n   ====================\n   ```\n\n2. Remove any unused Release Notes sections for this release (e.g. Fixes, Testing Changes)\n3. Add yourself to the list of contributors to this release and **put the contributors in alphabetical order**\n4. The release PR does not need to be mentioned in the list of changes\n5. Add a commented out \"Future Release\" section with all of the Release Notes sections above the current section\n\n   ```\n   .. Future Release\n     ==============\n       * Enhancements\n       * Fixes\n       * Changes\n       * Documentation Changes\n       * Testing Changes\n\n   .. Thanks to the following people for contributing to this release:\n   ```\n\n#### Create Release PR\n\nA [release pr](https://github.com/alteryx/featuretools/pull/856) should have **the version number as the title** and the release notes for that release as the PR body text. The contributors list is not necessary. The special sphinx docs syntax (:pr:\\`547\\`) needs to be changed to github link syntax (#547).\n\nChecklist before merging:\n\n- The title of the PR is the version number.\n- All tests are currently green on checkin and on `main`.\n- The ReadtheDocs build for the release PR branch has passed, and the resulting docs contain the expected release notes.\n- PR has been reviewed and approved.\n- Confirm with the team that `main` will be frozen until step 3 (Github Release) is complete.\n\nAfter merging, verify again that ReadtheDocs \"latest\" is correct.\n\n## 3. Create Github Release\n\nAfter the release pull request has been merged into the `main` branch, it is time draft the github release. [Example release](https://github.com/alteryx/featuretools/releases/tag/v0.13.3)\n\n- The target should be the `main` branch\n- The tag should be the version number with a v prefix (e.g. v0.13.3)\n- Release title is the same as the tag\n- Release description should be the full Release Notes updates for the release, including the line thanking contributors. Contributors should also have their links changed from the docs syntax (:user:\\`gsheni\\`) to github syntax (@gsheni)\n- This is not a pre-release\n- Publishing the release will automatically upload the package to PyPI\n\n## 4. Release on conda-forge\n\nIn order to release on conda-forge, you can either wait for a bot to create a pull request, or use a GitHub Actions workflow\n\n### Option a: Use a GitHub Action workflow\n\n1. After the package has been uploaded on PyPI, the **Create Feedstock Pull Request** workflow should automatically kickoff a job. \n    * If it does not, go [here](https://github.com/alteryx/featuretools/actions/workflows/create_feedstock_pr.yaml)\n    * Click **Run workflow** and input the letter `v` followed by the release version (e.g. `v0.13.3`)\n    * Kickoff the GitHub Action, and monitor the Job Summary.\n2. Once the job has been completed, you will see summary output, with a URL. \n    * Visit that URL and create a pull request.\n    * Alternatively, create the pull request by clicking the branch name (e.g. - `v0.13.3`): \n      - https://github.com/alteryx/featuretools-feedstock/branches\n3. Verify that the PR has the following: \n    * The `build['number']` is 0 (in __recipe/meta.yml__).\n    * The `requirements['run']` (in __recipe/meta.yml__) matches the `[project]['dependencies']` in __featuretools/pyproject.toml__.\n    * The `test['requires']` (in __recipe/meta.yml__) matches the `[project.optional-dependencies]['test']` in __featuretools/pyproject.toml__\n    > There will be 2 entries for graphviz: `graphviz` and `python-graphviz`. \n    > Make sure `python-graphviz` (in __recipe/meta.yml__) matches `graphviz` in `[project.optional-dependencies]['test']` in __featuretools/pyproject.toml__.\n4. Satisfy the conditions in pull request description and **merge it if the CI passes**. \n\n### Option b: Waiting for bot to create new PR\n\n1. A bot should automatically create a new PR in [conda-forge/featuretools-feedstock](https://github.com/conda-forge/featuretools-feedstock/pulls) - note, the PR may take up to a few hours to be created\n2. Update requirements changes in `recipe/meta.yaml` (bot should have handled version and source links on its own)\n3. After tests pass, a maintainer will merge the PR in\n\n# Miscellaneous\n## Add new maintainers to featuretools-feedstock\n\nPer the instructions [here](https://conda-forge.org/docs/maintainer/updating_pkgs.html#updating-the-maintainer-list):\n1. Ask an existing maintainer to create an issue on the [repo](https://github.com/conda-forge/featuretools-feedstock).\n  a. Select *Bot commands* and put the following title (change `username`):\n\n  ```text\n  @conda-forge-admin, please add user @username\n  ```\n\n2. A PR will be auto-created on the repo, and will need to be merged by an existing maintainer.\n3. The new user will need to **check their email for an invite link to click**, which should be https://github.com/conda-forge\n"
  }
]