[
  {
    "path": ".circleci/config.yml",
    "content": "version: 2.1\n\njobs:\n  python3:\n    docker:\n      - image: cimg/python:3.9\n    environment:\n      - OMP_NUM_THREADS: 1\n    steps:\n      - checkout\n      - run: ./build_tools/circle/checkout_merge_commit.sh\n      - run:\n          command: ./build_tools/circle/build_doc.sh\n          no_output_timeout: 30m\n      - store_artifacts:\n          path: doc/_build/html\n          destination: doc\n      - store_artifacts:\n          path: ~/log.txt\n          destination: log.txt\n      - persist_to_workspace:\n          root: doc/_build/html\n          paths: .\n\n  deploy:\n    docker:\n      - image: cimg/python:3.9\n    environment:\n      - USERNAME: \"glemaitre\"\n      - ORGANIZATION: \"imbalanced-learn\"\n      - DOC_REPO: \"imbalanced-learn.github.io\"\n      - EMAIL: \"g.lemaitre58@gmail.com\"\n    steps:\n      - checkout\n      - run: ./build_tools/circle/checkout_merge_commit.sh\n      - attach_workspace:\n          at: doc/_build/html\n      - run: ls -ltrh doc/_build/html\n      - deploy:\n          command: |\n            if [[ \"${CIRCLE_BRANCH}\" =~ ^master$|^[0-9]+\\.[0-9]+\\.X$ ]]; then\n              bash ./build_tools/circle/push_doc.sh doc/_build/html\n            fi\n\nworkflows:\n  version: 2\n  build-doc-and-deploy:\n    jobs:\n      - python3\n      - deploy:\n          requires:\n            - python3\n"
  },
  {
    "path": ".coveragerc",
    "content": "[run]\nbranch = True\n\n[report]\nexclude_lines =\n    if self.debug:\n    pragma: no cover\n    raise NotImplementedError\nignore_errors = True\nomit =\n    */tests/*\n    **/setup.py\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/bug_report.md",
    "content": "---\nname: Bug report\nabout: Create a report to help us reproduce and correct the bug\ntitle: \"[BUG]\"\nlabels: bug\nassignees: ''\n\n---\n\n#### Describe the bug\nA clear and concise description of what the bug is.\n\n#### Steps/Code to Reproduce\n<!--\nExample:\n```python\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom sklearn.decomposition import LatentDirichletAllocation\n\ndocs = [\"Help I have a bug\" for i in range(1000)]\n\nvectorizer = CountVectorizer(input=docs, analyzer='word')\nlda_features = vectorizer.fit_transform(docs)\n\nlda_model = LatentDirichletAllocation(\n    n_topics=10,\n    learning_method='online',\n    evaluate_every=10,\n    n_jobs=4,\n)\nmodel = lda_model.fit(lda_features)\n```\nIf the code is too long, feel free to put it in a public gist and link\nit in the issue: https://gist.github.com\n-->\n\n```\nSample code to reproduce the problem\n```\n\n#### Expected Results\n<!-- Example: No error is thrown. Please paste or describe the expected results.-->\n\n#### Actual Results\n<!-- Please paste or specifically describe the actual output or traceback. -->\n\n#### Versions\n<!--\nPlease run the following snippet and paste the output below.\nFor scikit-learn >= 0.20:\nimport sklearn; sklearn.show_versions()\nFor scikit-learn < 0.20:\nimport platform; print(platform.platform())\nimport sys; print(\"Python\", sys.version)\nimport numpy; print(\"NumPy\", numpy.__version__)\nimport scipy; print(\"SciPy\", scipy.__version__)\nimport sklearn; print(\"Scikit-Learn\", sklearn.__version__)\nimport imblearn; print(\"Imbalanced-Learn\", imblearn.__version__)\n-->\n\n\n<!-- Thanks for contributing! -->\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/documentation-improvement.md",
    "content": "---\nname: Documentation improvement\nabout: Create a report to help us improve the documentation\ntitle: \"[DOC]\"\nlabels: Documentation, help wanted, good first issue\nassignees: ''\n\n---\n\n#### Describe the issue linked to the documentation\n\nTell us about the confusion introduce in the documentation.\n\n#### Suggest a potential alternative/fix\n\nTell us how we could improve the documentation in this regard.\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/feature_request.md",
    "content": "---\nname: Feature request\nabout: Suggest an new algorithm, enhancement to an existing algorithm, etc.\ntitle: \"[ENH]\"\nlabels: enhancement\nassignees: ''\n\n---\n\n<--\nIf you want to propose a new algorithm, please refer first to the scikit-learn inclusion criterion:\nhttps://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms\n-->\n\n#### Is your feature request related to a problem? Please describe\n\n#### Describe the solution you'd like\n\n#### Describe alternatives you've considered\n\n#### Additional context\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/other--blank-template-.md",
    "content": "---\nname: Other (blank template)\nabout: For all other issues to reach the community...\ntitle: ''\nlabels: ''\nassignees: ''\n\n---\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/question.md",
    "content": "---\nname: Question\nabout: If you have a usage question\ntitle: ''\nlabels: ''\nassignees: ''\n\n---\n\n**\nIf your issue is a usage question, submit it here instead:\n- The imbalanced learn gitter: https://gitter.im/scikit-learn-contrib/imbalanced-learn\n**\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/usage-question.md",
    "content": "---\nname: Usage question\nabout: If you have a usage question\ntitle: \"[SO]\"\nlabels: question\nassignees: ''\n\n---\n\n** If your issue is a usage question, submit it here instead:**\n- **The imbalanced learn gitter: https://gitter.im/scikit-learn-contrib/imbalanced-learn**\n- **StackOverflow with the imblearn (or imbalanced-learn) tag:https://stackoverflow.com/questions/tagged/imblearn**\n\nWe are going to automatically close this issue if this is not link to a bug or an enhancement.\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE.md",
    "content": "<!--\nIf your issue is a usage question, submit it here instead:\n- The imbalanced learn gitter: https://gitter.im/scikit-learn-contrib/imbalanced-learn\n-->\n\n<!-- Instructions For Filing a Bug: https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/CONTRIBUTING.md#filing-bugs -->\n\n#### Description\n<!-- Example: Joblib Error thrown when calling fit on LatentDirichletAllocation with evaluate_every > 0-->\n\n#### Steps/Code to Reproduce\n<!--\nExample:\n```\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom sklearn.decomposition import LatentDirichletAllocation\n\ndocs = [\"Help I have a bug\" for i in range(1000)]\n\nvectorizer = CountVectorizer(input=docs, analyzer='word')\nlda_features = vectorizer.fit_transform(docs)\n\nlda_model = LatentDirichletAllocation(\n    n_topics=10,\n    learning_method='online',\n    evaluate_every=10,\n    n_jobs=4,\n)\nmodel = lda_model.fit(lda_features)\n```\nIf the code is too long, feel free to put it in a public gist and link\nit in the issue: https://gist.github.com\n-->\n\n#### Expected Results\n<!-- Example: No error is thrown. Please paste or describe the expected results.-->\n\n#### Actual Results\n<!-- Please paste or specifically describe the actual output or traceback. -->\n\n#### Versions\n<!--\nPlease run the following snippet and paste the output below.\nimport platform; print(platform.platform())\nimport sys; print(\"Python\", sys.version)\nimport numpy; print(\"NumPy\", numpy.__version__)\nimport scipy; print(\"SciPy\", scipy.__version__)\nimport sklearn; print(\"Scikit-Learn\", sklearn.__version__)\nimport imblearn; print(\"Imbalanced-Learn\", imblearn.__version__)\n-->\n\n\n<!-- Thanks for contributing! -->\n"
  },
  {
    "path": ".github/PULL_REQUEST_TEMPLATE.md",
    "content": "<!--\nThanks for contributing a pull request! Please ensure you have taken a look at\nthe contribution guidelines: https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/CONTRIBUTING.md#contributing-pull-requests\n-->\n#### Reference Issue\n<!-- Example: Fixes #1234 -->\n\n\n#### What does this implement/fix? Explain your changes.\n\n\n#### Any other comments?\n\n\n<!--\nPlease be aware that we are a loose team of volunteers so patience is\nnecessary; assistance handling other issues is very welcome. We value\nall user contributions, no matter how minor they are. If we are slow to\nreview, either the pull request needs some benchmarking, tinkering,\nconvincing, etc. or more likely the reviewers are simply busy. In either\ncase, we ask for your understanding during the review process.\n\nThanks for contributing!\n-->\n"
  },
  {
    "path": ".github/check-changelog.yml",
    "content": "name: Check Changelog\n# This check makes sure that the changelog is properly updated\n# when a PR introduces a change in a test file.\n# To bypass this check, label the PR with \"No Changelog Needed\".\non:\n  pull_request:\n    types: [opened, edited, labeled, unlabeled, synchronize]\n\njobs:\n  check:\n    name: A reviewer will let you know if it is required or can be bypassed\n    runs-on: ubuntu-latest\n    if: ${{ contains(github.event.pull_request.labels.*.name, 'No Changelog Needed') == 0 }}\n    steps:\n      - name: Get PR number and milestone\n        run: |\n          echo \"PR_NUMBER=${{ github.event.pull_request.number }}\" >> $GITHUB_ENV\n          echo \"TAGGED_MILESTONE=${{ github.event.pull_request.milestone.title }}\" >> $GITHUB_ENV\n      - uses: actions/checkout@v4\n        with:\n          fetch-depth: '0'\n      - name: Check the changelog entry\n        run: |\n          set -xe\n          changed_files=$(git diff --name-only origin/main)\n          # Changelog should be updated only if tests have been modified\n          if [[ ! \"$changed_files\" =~ tests ]]\n          then\n            exit 0\n          fi\n          all_changelogs=$(cat ./doc/whats_new/v*.rst)\n          if [[ \"$all_changelogs\" =~ :pr:\\`$PR_NUMBER\\` ]]\n          then\n            echo \"Changelog has been updated.\"\n            # If the pull request is milestoned check the correspondent changelog\n            if exist -f ./doc/whats_new/v${TAGGED_MILESTONE:0:4}.rst\n            then\n              expected_changelog=$(cat ./doc/whats_new/v${TAGGED_MILESTONE:0:4}.rst)\n              if [[ \"$expected_changelog\" =~ :pr:\\`$PR_NUMBER\\` ]]\n              then\n                echo \"Changelog and milestone correspond.\"\n              else\n                echo \"Changelog and milestone do not correspond.\"\n                echo \"If you see this error make sure that the tagged milestone for the PR\"\n                echo \"and the edited changelog filename properly match.\"\n                exit 1\n              fi\n            fi\n          else\n            echo \"A Changelog entry is missing.\"\n            echo \"\"\n            echo \"Please add an entry to the changelog at 'doc/whats_new/v*.rst'\"\n            echo \"to document your change assuming that the PR will be merged\"\n            echo \"in time for the next release of imbalanced-learn.\"\n            echo \"\"\n            echo \"Look at other entries in that file for inspiration and please\"\n            echo \"reference this pull request using the ':pr:' directive and\"\n            echo \"credit yourself (and other contributors if applicable) with\"\n            echo \"the ':user:' directive.\"\n            echo \"\"\n            echo \"If you see this error and there is already a changelog entry,\"\n            echo \"check that the PR number is correct.\"\n            echo \"\"\n            echo \"If you believe that this PR does not warrant a changelog\"\n            echo \"entry, say so in a comment so that a maintainer will label\"\n            echo \"the PR with 'No Changelog Needed' to bypass this check.\"\n            exit 1\n          fi\n"
  },
  {
    "path": ".github/dependabot.yml",
    "content": "version: 2\nupdates:\n  # Maintain dependencies for GitHub Actions as recommended in SPEC8:\n  # https://github.com/scientific-python/specs/pull/325\n  # At the time of writing, release critical workflows such as\n  # pypa/gh-action-pypi-publish should use hash-based versioning for security\n  # reasons. This strategy may be generalized to all other github actions\n  # in the future.\n  - package-ecosystem: \"github-actions\"\n    directory: \"/\"\n    schedule:\n      interval: \"weekly\"\n    groups:\n      actions:\n        patterns:\n          - \"*\"\n    reviewers:\n      - \"glemaitre\"\n"
  },
  {
    "path": ".github/workflows/circleci-artifacts-redirector.yml",
    "content": "name: CircleCI artifacts redirector\n\non: [status]\n\n# Restrict the permissions granted to the use of secrets.GITHUB_TOKEN in this\n# github actions workflow:\n# https://docs.github.com/en/actions/security-guides/automatic-token-authentication\npermissions:\n  statuses: write\n\njobs:\n  circleci_artifacts_redirector_job:\n    runs-on: ubuntu-latest\n    # For testing this action on a fork, remove the \"github.repository ==\"\" condition.\n    if: \"github.repository == 'scikit-learn-contrib/imbalanced-learn' && github.event.context == 'ci/circleci: doc'\"\n    name: Run CircleCI artifacts redirector\n    steps:\n      - name: GitHub Action step\n        uses: scientific-python/circleci-artifacts-redirector-action@v1\n        with:\n          repo-token: ${{ secrets.GITHUB_TOKEN }}\n          api-token: ${{ secrets.CIRCLE_CI }}\n          artifact-path: 0/doc/index.html\n          circleci-jobs: doc\n          job-title: Check the rendered docs here!\n"
  },
  {
    "path": ".github/workflows/linters.yml",
    "content": "name: Run code format checks\n\non:\n  push:\n    branches:\n      - \"main\"\n  pull_request:\n    branches:\n      - '*'\n\njobs:\n  run-pre-commit-checks:\n    name: Run pre-commit checks\n    runs-on: ubuntu-latest\n\n    steps:\n      - uses: actions/checkout@v6\n      - uses: prefix-dev/setup-pixi@v0.9.3\n        with:\n          pixi-version: v0.51.0\n          frozen: true\n\n      - name: Run tests\n        run: pixi run -e linters linters\n"
  },
  {
    "path": ".github/workflows/tests.yml",
    "content": "name: 'tests'\n\non:\n  push:\n    branches:\n      - \"main\"\n  pull_request:\n    branches:\n      - '*'\n\njobs:\n  test:\n    strategy:\n      matrix:\n        os: [windows-latest, ubuntu-latest, macos-latest]\n        environment: [\n            ci-py310-min-dependencies,\n            ci-py310-min-optional-dependencies,\n            ci-py310-min-keras,\n            ci-py310-min-tensorflow,\n            ci-py311-sklearn-1-4,\n            ci-py311-sklearn-1-5,\n            ci-py312-sklearn-1-6,\n            ci-py311-latest-keras,\n            ci-py311-latest-tensorflow,\n            ci-py314-latest-dependencies,\n            ci-py314-latest-optional-dependencies,\n        ]\n        exclude:\n            - os: windows-latest\n              environment: ci-py310-min-keras\n            - os: windows-latest\n              environment: ci-py310-min-tensorflow\n            - os: windows-latest\n              environment: ci-py311-latest-keras\n            - os: windows-latest\n              environment: ci-py311-latest-tensorflow\n    runs-on: ${{ matrix.os }}\n    steps:\n      - uses: actions/checkout@v6\n      - uses: prefix-dev/setup-pixi@v0.9.3\n        with:\n          pixi-version: v0.51.0\n          environments: ${{ matrix.environment }}\n          # we can freeze the environment and manually bump the dependencies to the\n          # latest version time to time.\n          frozen: true\n\n      - name: Run tests\n        run: pixi run -e ${{ matrix.environment }} tests -n 3\n\n      - name: Upload coverage reports to Codecov\n        uses: codecov/codecov-action@v5.5.2\n        with:\n          token: ${{ secrets.CODECOV_TOKEN }}\n          slug: scikit-learn-contrib/imbalanced-learn\n"
  },
  {
    "path": ".gitignore",
    "content": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\nenv/\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\n*.egg-info/\n.installed.cfg\n*.egg\nPipfile\nPipfile.lock\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*,cover\n.pytest_cache/\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\n\n# Sphinx documentation\ndocs/_build/\n\n# PyBuilder\ntarget/\n\n# vim\n*.swp\n\n# emacs\n*~\n\n# Visual Studio\n*.sln\n*.pyproj\n*.suo\n*.vs\n.vscode/\n\n# PyCharm\n.idea/\n\n# Cython\n*.pyc\n*.pyo\n__pycache__\n*.so\n*.o\n\n*.egg\n*.egg-info\n\nCython/Compiler/*.c\nCython/Plex/*.c\nCython/Runtime/refnanny.c\nCython/Tempita/*.c\nCython/*.c\n\nTools/*.elc\n\n/TEST_TMP/\n/build/\n/wheelhouse*/\n!tests/build/\n/dist/\n.gitrev\n.coverage\n*.orig\n*.rej\n*.dep\n*.swp\n*~\n\n.ipynb_checkpoints\ndocs/build\n\ntags\nTAGS\nMANIFEST\n\n.tox\n\ncythonize.dat\n\n# build documentation\ndoc/_build/\ndoc/auto_examples/\ndoc/generated/\ndoc/references/generated/\ndoc/bibtex/auto\ndoc/min_dependency_table.rst\n\n# MacOS\n.DS_Store\n\n# Pixi folder\n.pixi/\n\n# Generated files\ndoc/min_dependency_substitutions.rst\ndoc/sg_execution_times.rst\n"
  },
  {
    "path": ".pre-commit-config.yaml",
    "content": "repos:\n-   repo: https://github.com/pre-commit/pre-commit-hooks\n    rev: v4.3.0\n    hooks:\n    -   id: check-yaml\n    -   id: end-of-file-fixer\n    -   id: trailing-whitespace\n-   repo: https://github.com/astral-sh/ruff-pre-commit\n    # Ruff version.\n    rev: v0.4.8\n    hooks:\n    -   id: ruff\n        args: [\"--fix\", \"--output-format=full\"]\n-   repo: https://github.com/psf/black\n    rev: 23.3.0\n    hooks:\n    -   id: black\n"
  },
  {
    "path": "AUTHORS.rst",
    "content": "History\n-------\n\nDevelopment lead\n~~~~~~~~~~~~~~~~\n\nThe project started in August 2014 by Fernando Nogueira and focused on SMOTE implementation.\nTogether with Guillaume Lemaitre, Dayvid Victor, and Christos Aridas, additional under-sampling and over-sampling methods have been implemented as well as major changes in the API to be fully compatible with scikit-learn_.\n\nContributors\n------------\n\nRefers to GitHub contributors page_.\n\n.. _scikit-learn: http://scikit-learn.org\n.. _page: https://github.com/scikit-learn-contrib/imbalanced-learn/graphs/contributors\n"
  },
  {
    "path": "CONTRIBUTING.md",
    "content": "Contributing code\n=================\n\nThis guide is adapted from [scikit-learn](https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md).\n\nHow to contribute\n-----------------\n\nThe preferred way to contribute to imbalanced-learn is to fork the\n[main repository](https://github.com/scikit-learn-contrib/imbalanced-learn) on\nGitHub:\n\n1. Fork the [project repository](https://github.com/scikit-learn-contrib/imbalanced-learn):\n   click on the 'Fork' button near the top of the page. This creates\n   a copy of the code under your account on the GitHub server.\n\n2. Clone this copy to your local disk:\n\n        $ git clone git@github.com:YourLogin/imbalanced-learn.git\n        $ cd imblearn\n\n3. Create a branch to hold your changes:\n\n        $ git checkout -b my-feature\n\n   and start making changes. Never work in the ``master`` branch!\n\n4. Work on this copy on your computer using Git to do the version\n   control. When you're done editing, do:\n\n        $ git add modified_files\n        $ git commit\n\n   to record your changes in Git, then push them to GitHub with:\n\n        $ git push -u origin my-feature\n\nFinally, go to the web page of your fork of the imbalanced-learn repo,\nand click 'Pull request' to send your changes to the maintainers for\nreview. This will send an email to the committers.\n\n(If any of the above seems like magic to you, then look up the\n[Git documentation](https://git-scm.com/documentation) on the web.)\n\nContributing Pull Requests\n--------------------------\n\nIt is recommended to check that your contribution complies with the\nfollowing rules before submitting a pull request:\n\n-  Follow the\n   [coding-guidelines](http://scikit-learn.org/dev/developers/contributing.html#coding-guidelines)\n   as for scikit-learn.\n\n-  When applicable, use the validation tools and other code in the\n   `sklearn.utils` submodule.  A list of utility routines available\n   for developers can be found in the\n   [Utilities for Developers](http://scikit-learn.org/dev/developers/utilities.html#developers-utils)\n   page.\n\n-  If your pull request addresses an issue, please use the title to describe\n   the issue and mention the issue number in the pull request description to\n   ensure a link is created to the original issue.\n\n-  All public methods should have informative docstrings with sample\n   usage presented as doctests when appropriate.\n\n-  Please prefix the title of your pull request with `[MRG]` if the\n   contribution is complete and should be subjected to a detailed review.\n   Incomplete contributions should be prefixed `[WIP]` to indicate a work\n   in progress (and changed to `[MRG]` when it matures). WIPs may be useful\n   to: indicate you are working on something to avoid duplicated work,\n   request broad review of functionality or API, or seek collaborators.\n   WIPs often benefit from the inclusion of a\n   [task list](https://github.com/blog/1375-task-lists-in-gfm-issues-pulls-comments)\n   in the PR description.\n\n-  All other tests pass when everything is rebuilt from scratch. On\n   Unix-like systems, check with (from the toplevel source folder):\n\n        $ make\n\n-  When adding additional functionality, provide at least one\n   example script in the ``examples/`` folder. Have a look at other\n   examples for reference. Examples should demonstrate why the new\n   functionality is useful in practice and, if possible, compare it\n   to other methods available in scikit-learn.\n\n-  Documentation and high-coverage tests are necessary for enhancements\n   to be accepted.\n\n-  At least one paragraph of narrative documentation with links to\n   references in the literature (with PDF links when possible) and\n   the example.\n\nYou can also check for common programming errors with the following\ntools:\n\n-  Code with good unittest coverage (at least 80%), check with:\n\n        $ pip install pytest pytest-cov\n        $ pytest --cov=imblearn imblearn\n\n-  No pyflakes warnings, check with:\n\n        $ pip install pyflakes\n        $ pyflakes path/to/module.py\n\n-  No PEP8 warnings, check with:\n\n        $ pip install pycodestyle\n        $ pycodestyle path/to/module.py\n\n-  AutoPEP8 can help you fix some of the easy redundant errors:\n\n        $ pip install autopep8\n        $ autopep8 path/to/pep8.py\n\nFiling bugs\n-----------\nWe use Github issues to track all bugs and feature requests; feel free to\nopen an issue if you have found a bug or wish to see a feature implemented.\n\nIt is recommended to check that your issue complies with the\nfollowing rules before submitting:\n\n-  Verify that your issue is not being currently addressed by other\n   [issues](https://github.com/scikit-learn-contrib/imbalanced-learn/issues)\n   or [pull requests](https://github.com/scikit-learn-contrib/imbalanced-learn/pulls).\n\n-  Please ensure all code snippets and error messages are formatted in\n   appropriate code blocks.\n   See [Creating and highlighting code blocks](https://help.github.com/articles/creating-and-highlighting-code-blocks).\n\n-  Please include your operating system type and version number, as well\n   as your Python, scikit-learn, numpy, and scipy versions. This information\n   can be found by runnning the following code snippet:\n\n   ```python\n   import platform; print(platform.platform())\n   import sys; print(\"Python\", sys.version)\n   import numpy; print(\"NumPy\", numpy.__version__)\n   import scipy; print(\"SciPy\", scipy.__version__)\n   import sklearn; print(\"Scikit-Learn\", sklearn.__version__)\n   import imblearn; print(\"Imbalanced-Learn\", imblearn.__version__)\n   ```\n\n-  Please be specific about what estimators and/or functions are involved\n   and the shape of the data, as appropriate; please include a\n   [reproducible](https://stackoverflow.com/help/mcve) code snippet\n   or link to a [gist](https://gist.github.com). If an exception is raised,\n   please provide the traceback.\n\nDocumentation\n-------------\n\nWe are glad to accept any sort of documentation: function docstrings,\nreStructuredText documents (like this one), tutorials, etc.\nreStructuredText documents live in the source code repository under the\ndoc/ directory.\n\nYou can edit the documentation using any text editor and then generate\nthe HTML output by typing ``make html`` from the doc/ directory.\nAlternatively, ``make`` can be used to quickly generate the\ndocumentation without the example gallery. The resulting HTML files will\nbe placed in _build/html/ and are viewable in a web browser. See the\nREADME file in the doc/ directory for more information.\n\nFor building the documentation, you will need\n[sphinx](http://sphinx-doc.org),\n[matplotlib](https://matplotlib.org), and\n[pillow](https://pillow.readthedocs.io).\n\nWhen you are writing documentation, it is important to keep a good\ncompromise between mathematical and algorithmic details, and give\nintuition to the reader on what the algorithm does. It is best to always\nstart with a small paragraph with a hand-waving explanation of what the\nmethod does to the data and a figure (coming from an example)\nillustrating it.\n"
  },
  {
    "path": "LICENSE",
    "content": "The MIT License (MIT)\n\nCopyright (c) 2014-2020 The imbalanced-learn developers.\nAll rights reserved.\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "MANIFEST.in",
    "content": "\nrecursive-include doc *\nrecursive-include examples *\ninclude AUTHORS.rst\ninclude CONTRIBUTING.md\ninclude LICENSE\ninclude README.rst\n"
  },
  {
    "path": "README.rst",
    "content": ".. -*- mode: rst -*-\n\n.. _scikit-learn: http://scikit-learn.org/stable/\n\n.. _scikit-learn-contrib: https://github.com/scikit-learn-contrib\n\n|GitHubActions|_ |Codecov|_ |CircleCI|_ |PythonVersion|_ |Pypi|_ |Gitter|_ |Black|_\n\n.. |GitHubActions| image:: https://github.com/scikit-learn-contrib/imbalanced-learn/actions/workflows/tests.yml/badge.svg\n.. _GitHubActions: https://github.com/scikit-learn-contrib/imbalanced-learn/actions/workflows/tests.yml\n\n.. |Codecov| image:: https://codecov.io/gh/scikit-learn-contrib/imbalanced-learn/branch/master/graph/badge.svg\n.. _Codecov: https://codecov.io/gh/scikit-learn-contrib/imbalanced-learn\n\n.. |CircleCI| image:: https://circleci.com/gh/scikit-learn-contrib/imbalanced-learn.svg?style=shield\n.. _CircleCI: https://circleci.com/gh/scikit-learn-contrib/imbalanced-learn/tree/master\n\n.. |PythonVersion| image:: https://img.shields.io/pypi/pyversions/imbalanced-learn.svg\n.. _PythonVersion: https://img.shields.io/pypi/pyversions/imbalanced-learn.svg\n\n.. |Pypi| image:: https://badge.fury.io/py/imbalanced-learn.svg\n.. _Pypi: https://badge.fury.io/py/imbalanced-learn\n\n.. |Gitter| image:: https://badges.gitter.im/scikit-learn-contrib/imbalanced-learn.svg\n.. _Gitter: https://gitter.im/scikit-learn-contrib/imbalanced-learn?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge\n\n.. |Black| image:: https://img.shields.io/badge/code%20style-black-000000.svg\n.. _Black: :target: https://github.com/psf/black\n\n.. |PythonMinVersion| replace:: 3.10\n.. |NumPyMinVersion| replace:: 1.25.2\n.. |SciPyMinVersion| replace:: 1.11.4\n.. |ScikitLearnMinVersion| replace:: 1.4.2\n.. |MatplotlibMinVersion| replace:: 3.7.3\n.. |PandasMinVersion| replace:: 2.0.3\n.. |TensorflowMinVersion| replace:: 2.16.1\n.. |KerasMinVersion| replace:: 3.3.3\n.. |SeabornMinVersion| replace:: 0.12.2\n.. |PytestMinVersion| replace:: 7.2.2\n\nimbalanced-learn\n================\n\nimbalanced-learn is a python package offering a number of re-sampling techniques\ncommonly used in datasets showing strong between-class imbalance.\nIt is compatible with scikit-learn_ and is part of scikit-learn-contrib_\nprojects.\n\nDocumentation\n-------------\n\nInstallation documentation, API documentation, and examples can be found on the\ndocumentation_.\n\n.. _documentation: https://imbalanced-learn.org/stable/\n\nInstallation\n------------\n\nDependencies\n~~~~~~~~~~~~\n\n`imbalanced-learn` requires the following dependencies:\n\n- Python (>= |PythonMinVersion|)\n- NumPy (>= |NumPyMinVersion|)\n- SciPy (>= |SciPyMinVersion|)\n- Scikit-learn (>= |ScikitLearnMinVersion|)\n- Pytest (>= |PytestMinVersion|)\n\nAdditionally, `imbalanced-learn` requires the following optional dependencies:\n\n- Pandas (>= |PandasMinVersion|) for dealing with dataframes\n- Tensorflow (>= |TensorflowMinVersion|) for dealing with TensorFlow models\n- Keras (>= |KerasMinVersion|) for dealing with Keras models\n\nThe examples will requires the following additional dependencies:\n\n- Matplotlib (>= |MatplotlibMinVersion|)\n- Seaborn (>= |SeabornMinVersion|)\n\nInstallation\n~~~~~~~~~~~~\n\nFrom PyPi or conda-forge repositories\n.....................................\n\nimbalanced-learn is currently available on the PyPi's repositories and you can\ninstall it via `pip`::\n\n  pip install -U imbalanced-learn\n\nThe package is release also in Anaconda Cloud platform::\n\n  conda install -c conda-forge imbalanced-learn\n\nFrom source available on GitHub\n...............................\n\nIf you prefer, you can clone it and run the setup.py file. Use the following\ncommands to get a copy from Github and install all dependencies::\n\n  git clone https://github.com/scikit-learn-contrib/imbalanced-learn.git\n  cd imbalanced-learn\n  pip install .\n\nBe aware that you can install in developer mode with::\n\n  pip install --no-build-isolation --editable .\n\nIf you wish to make pull-requests on GitHub, we advise you to install\npre-commit::\n\n  pip install pre-commit\n  pre-commit install\n\nTesting\n~~~~~~~\n\nAfter installation, you can use `pytest` to run the test suite::\n\n  make coverage\n\nDevelopment\n-----------\n\nThe development of this scikit-learn-contrib is in line with the one\nof the scikit-learn community. Therefore, you can refer to their\n`Development Guide\n<http://scikit-learn.org/stable/developers>`_.\n\nEndorsement of the Scientific Python Specification\n--------------------------------------------------\n\nWe endorse good practices from the Scientific Python Ecosystem Coordination (SPEC).\nThe full list of recommendations is available `here`_.\n\nSee below the list of recommendations that we endorse for the imbalanced-learn project.\n\n|SPEC 0 — Minimum Supported Dependencies|\n\n.. |SPEC 0 — Minimum Supported Dependencies| image:: https://img.shields.io/badge/SPEC-0-green?labelColor=%23004811&color=%235CA038\n   :target: https://scientific-python.org/specs/spec-0000/\n\n.. _here: https://scientific-python.org/specs/\n\nAbout\n-----\n\nIf you use imbalanced-learn in a scientific publication, we would appreciate\ncitations to the following paper::\n\n  @article{JMLR:v18:16-365,\n  author  = {Guillaume  Lema{{\\^i}}tre and Fernando Nogueira and Christos K. Aridas},\n  title   = {Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning},\n  journal = {Journal of Machine Learning Research},\n  year    = {2017},\n  volume  = {18},\n  number  = {17},\n  pages   = {1-5},\n  url     = {http://jmlr.org/papers/v18/16-365}\n  }\n\nMost classification algorithms will only perform optimally when the number of\nsamples of each class is roughly the same. Highly skewed datasets, where the\nminority is heavily outnumbered by one or more classes, have proven to be a\nchallenge while at the same time becoming more and more common.\n\nOne way of addressing this issue is by re-sampling the dataset as to offset this\nimbalance with the hope of arriving at a more robust and fair decision boundary\nthan you would otherwise.\n\nYou can refer to the `imbalanced-learn`_ documentation to find details about\nthe implemented algorithms.\n\n.. _imbalanced-learn: https://imbalanced-learn.org/stable/user_guide.html\n"
  },
  {
    "path": "build_tools/circle/build_doc.sh",
    "content": "#!/usr/bin/env bash\nset -x\nset -e\n\n# deactivate circleci virtualenv and setup a miniconda env instead\nif [[ `type -t deactivate` ]]; then\n    deactivate\nfi\n\n# Install pixi\ncurl -fsSL https://pixi.sh/install.sh | bash\nexport PATH=/home/circleci/.pixi/bin:$PATH\n\n# The pipefail is requested to propagate exit code\nset -o pipefail && pixi run --frozen -e docs build-docs 2>&1 | tee ~/log.txt\nset +o pipefail\n"
  },
  {
    "path": "build_tools/circle/checkout_merge_commit.sh",
    "content": "#!/bin/bash\n\n# Add `master` branch to the update list.\n# Otherwise CircleCI will give us a cached one.\nFETCH_REFS=\"+master:master\"\n\n# Update PR refs for testing.\nif [[ -n \"${CIRCLE_PR_NUMBER}\" ]]\nthen\n    FETCH_REFS=\"${FETCH_REFS} +refs/pull/${CIRCLE_PR_NUMBER}/head:pr/${CIRCLE_PR_NUMBER}/head\"\n    FETCH_REFS=\"${FETCH_REFS} +refs/pull/${CIRCLE_PR_NUMBER}/merge:pr/${CIRCLE_PR_NUMBER}/merge\"\nfi\n\n# Retrieve the refs.\ngit fetch -u origin ${FETCH_REFS}\n\n# Checkout the PR merge ref.\nif [[ -n \"${CIRCLE_PR_NUMBER}\" ]]\nthen\n    git checkout -qf \"pr/${CIRCLE_PR_NUMBER}/merge\" || (\n        echo Could not fetch merge commit. >&2\n        echo There may be conflicts in merging PR \\#${CIRCLE_PR_NUMBER} with master. >&2;\n        exit 1)\nfi\n\n# Check for merge conflicts.\nif [[ -n \"${CIRCLE_PR_NUMBER}\" ]]\nthen\n    git branch --merged | grep master > /dev/null\n    git branch --merged | grep \"pr/${CIRCLE_PR_NUMBER}/head\" > /dev/null\nfi\n"
  },
  {
    "path": "build_tools/circle/linting.sh",
    "content": "#!/bin/bash\n\n# This script is used in CircleCI to check that PRs do not add obvious\n# flake8 violations. It relies on two things:\n#   - find common ancestor between branch and\n#     scikit-learn/scikit-learn remote\n#   - run flake8 --diff on the diff between the branch and the common\n#     ancestor\n#\n# Additional features:\n#   - the line numbers in Travis match the local branch on the PR\n#     author machine.\n#   - ./build_tools/circle/flake8_diff.sh can be run locally for quick\n#     turn-around\n\nset -e\n# pipefail is necessary to propagate exit codes\nset -o pipefail\n\nPROJECT=scikit-learn-contrib/imbalanced-learn\nPROJECT_URL=https://github.com/$PROJECT.git\n\n# Find the remote with the project name (upstream in most cases)\nREMOTE=$(git remote -v | grep $PROJECT | cut -f1 | head -1 || echo '')\n\n# Add a temporary remote if needed. For example this is necessary when\n# Travis is configured to run in a fork. In this case 'origin' is the\n# fork and not the reference repo we want to diff against.\nif [[ -z \"$REMOTE\" ]]; then\n    TMP_REMOTE=tmp_reference_upstream\n    REMOTE=$TMP_REMOTE\n    git remote add $REMOTE $PROJECT_URL\nfi\n\necho \"Remotes:\"\necho '--------------------------------------------------------------------------------'\ngit remote --verbose\n\n# Travis does the git clone with a limited depth (50 at the time of\n# writing). This may not be enough to find the common ancestor with\n# $REMOTE/master so we unshallow the git checkout\nif [[ -a .git/shallow ]]; then\n    echo -e '\\nTrying to unshallow the repo:'\n    echo '--------------------------------------------------------------------------------'\n    git fetch --unshallow\nfi\n\nif [[ \"$TRAVIS\" == \"true\" ]]; then\n    if [[ \"$TRAVIS_PULL_REQUEST\" == \"false\" ]]\n    then\n        # In main repo, using TRAVIS_COMMIT_RANGE to test the commits\n        # that were pushed into a branch\n        if [[ \"$PROJECT\" == \"$TRAVIS_REPO_SLUG\" ]]; then\n            if [[ -z \"$TRAVIS_COMMIT_RANGE\" ]]; then\n                echo \"New branch, no commit range from Travis so passing this test by convention\"\n                exit 0\n            fi\n            COMMIT_RANGE=$TRAVIS_COMMIT_RANGE\n        fi\n    else\n        # We want to fetch the code as it is in the PR branch and not\n        # the result of the merge into master. This way line numbers\n        # reported by Travis will match with the local code.\n        LOCAL_BRANCH_REF=travis_pr_$TRAVIS_PULL_REQUEST\n        # In Travis the PR target is always origin\n        git fetch origin pull/$TRAVIS_PULL_REQUEST/head:refs/$LOCAL_BRANCH_REF\n    fi\nfi\n\n# If not using the commit range from Travis we need to find the common\n# ancestor between $LOCAL_BRANCH_REF and $REMOTE/master\nif [[ -z \"$COMMIT_RANGE\" ]]; then\n    if [[ -z \"$LOCAL_BRANCH_REF\" ]]; then\n        LOCAL_BRANCH_REF=$(git rev-parse --abbrev-ref HEAD)\n    fi\n    echo -e \"\\nLast 2 commits in $LOCAL_BRANCH_REF:\"\n    echo '--------------------------------------------------------------------------------'\n    git --no-pager log -2 $LOCAL_BRANCH_REF\n\n    REMOTE_MASTER_REF=\"$REMOTE/master\"\n    # Make sure that $REMOTE_MASTER_REF is a valid reference\n    echo -e \"\\nFetching $REMOTE_MASTER_REF\"\n    echo '--------------------------------------------------------------------------------'\n    git fetch $REMOTE master:refs/remotes/$REMOTE_MASTER_REF\n    LOCAL_BRANCH_SHORT_HASH=$(git rev-parse --short $LOCAL_BRANCH_REF)\n    REMOTE_MASTER_SHORT_HASH=$(git rev-parse --short $REMOTE_MASTER_REF)\n\n    COMMIT=$(git merge-base $LOCAL_BRANCH_REF $REMOTE_MASTER_REF) || \\\n        echo \"No common ancestor found for $(git show $LOCAL_BRANCH_REF -q) and $(git show $REMOTE_MASTER_REF -q)\"\n\n    if [ -z \"$COMMIT\" ]; then\n        exit 1\n    fi\n\n    COMMIT_SHORT_HASH=$(git rev-parse --short $COMMIT)\n\n    echo -e \"\\nCommon ancestor between $LOCAL_BRANCH_REF ($LOCAL_BRANCH_SHORT_HASH)\"\\\n         \"and $REMOTE_MASTER_REF ($REMOTE_MASTER_SHORT_HASH) is $COMMIT_SHORT_HASH:\"\n    echo '--------------------------------------------------------------------------------'\n    git --no-pager show --no-patch $COMMIT_SHORT_HASH\n\n    COMMIT_RANGE=\"$COMMIT_SHORT_HASH..$LOCAL_BRANCH_SHORT_HASH\"\n\n    if [[ -n \"$TMP_REMOTE\" ]]; then\n        git remote remove $TMP_REMOTE\n    fi\n\nelse\n    echo \"Got the commit range from Travis: $COMMIT_RANGE\"\nfi\n\necho -e '\\nRunning flake8 on the diff in the range' \"$COMMIT_RANGE\" \\\n     \"($(git rev-list $COMMIT_RANGE | wc -l) commit(s)):\"\necho '--------------------------------------------------------------------------------'\n\n# We ignore files from sklearn/externals. Unfortunately there is no\n# way to do it with flake8 directly (the --exclude does not seem to\n# work with --diff). We could use the exclude magic in the git pathspec\n# ':!sklearn/externals' but it is only available on git 1.9 and Travis\n# uses git 1.8.\n# We need the following command to exit with 0 hence the echo in case\n# there is no match\nMODIFIED_FILES=\"$(git diff --name-only $COMMIT_RANGE | grep -v 'sklearn/externals' | \\\n                     grep -v 'doc/sphinxext' || echo \"no_match\")\"\n\ncheck_files() {\n    files=\"$1\"\n    shift\n    options=\"$*\"\n    if [ -n \"$files\" ]; then\n        # Conservative approach: diff without context (--unified=0) so that code\n        # that was not changed does not create failures\n        git diff --unified=0 $COMMIT_RANGE -- $files | flake8 --diff --max-line-length=88 --show-source $options\n    fi\n}\n\nif [[ \"$MODIFIED_FILES\" == \"no_match\" ]]; then\n    echo \"No file outside sklearn/externals and doc/sphinxext has been modified\"\nelse\n\n    check_files \"$(echo \"$MODIFIED_FILES\" | grep -v ^examples)\"\n    check_files \"$(echo \"$MODIFIED_FILES\" | grep ^examples)\" \\\n        --config ./setup.cfg\nfi\necho -e \"No problem detected by flake8\\n\"\n\n# For docstrings and warnings of deprecated attributes to be rendered\n# properly, the property decorator must come before the deprecated decorator\n# (else they are treated as functions)\n\n# do not error when grep -B1 \"@property\" finds nothing\nset +e\nbad_deprecation_property_order=`git grep -A 10 \"@property\"  -- \"*.py\" | awk '/@property/,/def /' | grep -B1 \"@deprecated\"`\n\nif [ ! -z \"$bad_deprecation_property_order\" ]\nthen\n    echo \"property decorator should come before deprecated decorator\"\n    echo \"found the following occurrencies:\"\n    echo $bad_deprecation_property_order\n    exit 1\nfi\n"
  },
  {
    "path": "build_tools/circle/push_doc.sh",
    "content": "#!/bin/bash\n# This script is meant to be called in the \"deploy\" step defined in\n# circle.yml. See https://circleci.com/docs/ for more details.\n# The behavior of the script is controlled by environment variable defined\n# in the circle.yml in the top level folder of the project.\n\nGENERATED_DOC_DIR=$1\n\nif [[ -z \"$GENERATED_DOC_DIR\" ]]; then\n    echo \"Need to pass directory of the generated doc as argument\"\n    echo \"Usage: $0 <generated_doc_dir>\"\n    exit 1\nfi\n\n# Absolute path needed because we use cd further down in this script\nGENERATED_DOC_DIR=$(readlink -f $GENERATED_DOC_DIR)\n\nif [ \"$CIRCLE_BRANCH\" = \"master\" ]\nthen\n    dir=dev\nelse\n    # Strip off .X\n    dir=\"${CIRCLE_BRANCH::-2}\"\nfi\n\nMSG=\"Pushing the docs to $dir/ for branch: $CIRCLE_BRANCH, commit $CIRCLE_SHA1\"\n\ncd $HOME\nif [ ! -d $DOC_REPO ];\nthen git clone --depth 1 --no-checkout -b master \"git@github.com:\"$ORGANIZATION\"/\"$DOC_REPO\".git\";\nfi\ncd $DOC_REPO\ngit config core.sparseCheckout true\necho $dir > .git/info/sparse-checkout\ngit checkout master\ngit reset --hard origin/master\ngit rm -rf $dir/ && rm -rf $dir/\ncp -R $GENERATED_DOC_DIR $dir\ntouch $dir/.nojekyll\ngit config --global user.email $EMAIL\ngit config --global user.name $USERNAME\ngit config --global push.default matching\ngit add -f $dir/\ngit commit -m \"$MSG\" $dir\ngit push origin master\n\necho $MSG\n"
  },
  {
    "path": "conftest.py",
    "content": "# This file is here so that when running from the root folder\n# ./imblearn is added to sys.path by pytest.\n# See https://docs.pytest.org/en/latest/pythonpath.html for more details.\n# For example, this allows to build extensions in place and run pytest\n# doc/modules/clustering.rst and use imblearn from the local folder\n# rather than the one from site-packages.\n\nimport os\n\nimport numpy as np\nimport pytest\nfrom sklearn.utils.fixes import parse_version\n\n# use legacy numpy print options to avoid failures due to NumPy 2.+ scalar\n# representation\nif parse_version(np.__version__) > parse_version(\"2.0.0\"):\n    np.set_printoptions(legacy=\"1.25\")\n\n\ndef pytest_runtest_setup(item):\n    fname = item.fspath.strpath\n    if (\n        fname.endswith(os.path.join(\"keras\", \"_generator.py\"))\n        or fname.endswith(os.path.join(\"tensorflow\", \"_generator.py\"))\n        or fname.endswith(\"miscellaneous.rst\")\n    ):\n        try:\n            import tensorflow  # noqa\n        except ImportError:\n            pytest.skip(\"The tensorflow package is not installed.\")\n"
  },
  {
    "path": "doc/Makefile",
    "content": "# Makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line.\nSPHINXOPTS    = -v\nSPHINXBUILD   = sphinx-build\nPAPER         =\nBUILDDIR      = _build\n\n# User-friendly check for sphinx-build\nifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1)\n$(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/)\nendif\n\n# Internal variables.\nPAPEROPT_a4     = -D latex_paper_size=a4\nPAPEROPT_letter = -D latex_paper_size=letter\nALLSPHINXOPTS   = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .\n# the i18n builder cannot share the environment and doctrees with the others\nI18NSPHINXOPTS  = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .\n\n.PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext\n\nhelp:\n\t@echo \"Please use \\`make <target>' where <target> is one of\"\n\t@echo \"  html       to make standalone HTML files\"\n\t@echo \"  dirhtml    to make HTML files named index.html in directories\"\n\t@echo \"  singlehtml to make a single large HTML file\"\n\t@echo \"  pickle     to make pickle files\"\n\t@echo \"  json       to make JSON files\"\n\t@echo \"  htmlhelp   to make HTML files and a HTML help project\"\n\t@echo \"  qthelp     to make HTML files and a qthelp project\"\n\t@echo \"  devhelp    to make HTML files and a Devhelp project\"\n\t@echo \"  epub       to make an epub\"\n\t@echo \"  latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter\"\n\t@echo \"  latexpdf   to make LaTeX files and run them through pdflatex\"\n\t@echo \"  latexpdfja to make LaTeX files and run them through platex/dvipdfmx\"\n\t@echo \"  text       to make text files\"\n\t@echo \"  man        to make manual pages\"\n\t@echo \"  texinfo    to make Texinfo files\"\n\t@echo \"  info       to make Texinfo files and run them through makeinfo\"\n\t@echo \"  gettext    to make PO message catalogs\"\n\t@echo \"  changes    to make an overview of all changed/added/deprecated items\"\n\t@echo \"  xml        to make Docutils-native XML files\"\n\t@echo \"  pseudoxml  to make pseudoxml-XML files for display purposes\"\n\t@echo \"  linkcheck  to check all external links for integrity\"\n\t@echo \"  doctest    to run all doctests embedded in the documentation (if enabled)\"\n\nclean:\n\t-rm -rf $(BUILDDIR)/*\n\t-rm -rf auto_examples/\n\t-rm -rf generated/*\n\t-rm -rf modules/generated/*\n\nhtml:\n\t# These two lines make the build a bit more lengthy, and the\n\t# the embedding of images more robust\n\trm -rf $(BUILDDIR)/html/_images\n\t#rm -rf _build/doctrees/\n\t$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html\n\ttouch $(BUILDDIR)/html/.nojekyll\n\t@echo\n\t@echo \"Build finished. The HTML pages are in $(BUILDDIR)/html.\"\n\ndirhtml:\n\t$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml\n\t@echo\n\t@echo \"Build finished. The HTML pages are in $(BUILDDIR)/dirhtml.\"\n\nsinglehtml:\n\t$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml\n\t@echo\n\t@echo \"Build finished. The HTML page is in $(BUILDDIR)/singlehtml.\"\n\npickle:\n\t$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle\n\t@echo\n\t@echo \"Build finished; now you can process the pickle files.\"\n\njson:\n\t$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json\n\t@echo\n\t@echo \"Build finished; now you can process the JSON files.\"\n\nhtmlhelp:\n\t$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp\n\t@echo\n\t@echo \"Build finished; now you can run HTML Help Workshop with the\" \\\n\t      \".hhp project file in $(BUILDDIR)/htmlhelp.\"\n\nqthelp:\n\t$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp\n\t@echo\n\t@echo \"Build finished; now you can run \"qcollectiongenerator\" with the\" \\\n\t      \".qhcp project file in $(BUILDDIR)/qthelp, like this:\"\n\t@echo \"# qcollectiongenerator $(BUILDDIR)/qthelp/imbalanced-learn.qhcp\"\n\t@echo \"To view the help file:\"\n\t@echo \"# assistant -collectionFile $(BUILDDIR)/qthelp/imbalanced-learn.qhc\"\n\ndevhelp:\n\t$(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp\n\t@echo\n\t@echo \"Build finished.\"\n\t@echo \"To view the help file:\"\n\t@echo \"# mkdir -p $$HOME/.local/share/devhelp/imbalanced-learn\"\n\t@echo \"# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/imbalanced-learn\"\n\t@echo \"# devhelp\"\n\nepub:\n\t$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub\n\t@echo\n\t@echo \"Build finished. The epub file is in $(BUILDDIR)/epub.\"\n\nlatex:\n\t$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex\n\t@echo\n\t@echo \"Build finished; the LaTeX files are in $(BUILDDIR)/latex.\"\n\t@echo \"Run \\`make' in that directory to run these through (pdf)latex\" \\\n\t      \"(use \\`make latexpdf' here to do that automatically).\"\n\nlatexpdf:\n\t$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex\n\t@echo \"Running LaTeX files through pdflatex...\"\n\t$(MAKE) -C $(BUILDDIR)/latex all-pdf\n\t@echo \"pdflatex finished; the PDF files are in $(BUILDDIR)/latex.\"\n\nlatexpdfja:\n\t$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex\n\t@echo \"Running LaTeX files through platex and dvipdfmx...\"\n\t$(MAKE) -C $(BUILDDIR)/latex all-pdf-ja\n\t@echo \"pdflatex finished; the PDF files are in $(BUILDDIR)/latex.\"\n\ntext:\n\t$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text\n\t@echo\n\t@echo \"Build finished. The text files are in $(BUILDDIR)/text.\"\n\nman:\n\t$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man\n\t@echo\n\t@echo \"Build finished. The manual pages are in $(BUILDDIR)/man.\"\n\ntexinfo:\n\t$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo\n\t@echo\n\t@echo \"Build finished. The Texinfo files are in $(BUILDDIR)/texinfo.\"\n\t@echo \"Run \\`make' in that directory to run these through makeinfo\" \\\n\t      \"(use \\`make info' here to do that automatically).\"\n\ninfo:\n\t$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo\n\t@echo \"Running Texinfo files through makeinfo...\"\n\tmake -C $(BUILDDIR)/texinfo info\n\t@echo \"makeinfo finished; the Info files are in $(BUILDDIR)/texinfo.\"\n\ngettext:\n\t$(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale\n\t@echo\n\t@echo \"Build finished. The message catalogs are in $(BUILDDIR)/locale.\"\n\nchanges:\n\t$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes\n\t@echo\n\t@echo \"The overview file is in $(BUILDDIR)/changes.\"\n\nlinkcheck:\n\t$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck\n\t@echo\n\t@echo \"Link check complete; look for any errors in the above output \" \\\n\t      \"or in $(BUILDDIR)/linkcheck/output.txt.\"\n\ndoctest:\n\t$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest\n\t@echo \"Testing of doctests in the sources finished, look at the \" \\\n\t      \"results in $(BUILDDIR)/doctest/output.txt.\"\n\nxml:\n\t$(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml\n\t@echo\n\t@echo \"Build finished. The XML files are in $(BUILDDIR)/xml.\"\n\npseudoxml:\n\t$(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml\n\t@echo\n\t@echo \"Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml.\"\n"
  },
  {
    "path": "doc/_static/css/imbalanced-learn.css",
    "content": "@import url(\"theme.css\");\n\n.highlight a {\n  text-decoration: underline;\n}\n\n.deprecated p {\n  padding: 10px 7px 10px 10px;\n  color: #b94a48;\n  background-color: #f3e5e5;\n  border: 1px solid #eed3d7;\n}\n\n.deprecated p span.versionmodified {\n  font-weight: bold;\n}\n\n.wy-nav-content {\n  max-width: 1200px !important;\n}\n\n/* Override some aspects of the pydata-sphinx-theme */\n\n/* Main index page overview cards */\n\n.intro-card {\n  padding: 30px 10px 20px 10px;\n}\n\n.intro-card .sd-card-img-top {\n  margin: 10px;\n  height: 52px;\n  background: none !important;\n}\n\n.intro-card .sd-card-title {\n  color: var(--pst-color-primary);\n  font-size: var(--pst-font-size-h5);\n  padding: 1rem 0rem 0.5rem 0rem;\n}\n\n.intro-card .sd-card-footer {\n  border: none !important;\n}\n\n.intro-card .sd-card-footer p.sd-card-text {\n  max-width: 220px;\n  margin-left: auto;\n  margin-right: auto;\n}\n\n.intro-card .sd-btn-secondary {\n  background-color: #6c757d !important;\n  border-color: #6c757d !important;\n}\n\n.intro-card .sd-btn-secondary:hover {\n  background-color: #5a6268 !important;\n  border-color: #545b62 !important;\n}\n\n.card, .card img {\n  background-color: var(--pst-color-background);\n}\n"
  },
  {
    "path": "doc/_static/js/copybutton.js",
    "content": "$(document).ready(function() {\n    /* Add a [>>>] button on the top-right corner of code samples to hide\n     * the >>> and ... prompts and the output and thus make the code\n     * copyable. */\n    var div = $('.highlight-python .highlight,' +\n                '.highlight-python3 .highlight,' +\n                '.highlight-pycon .highlight,' +\n\t\t'.highlight-default .highlight')\n    var pre = div.find('pre');\n\n    // get the styles from the current theme\n    pre.parent().parent().css('position', 'relative');\n    var hide_text = 'Hide the prompts and output';\n    var show_text = 'Show the prompts and output';\n    var border_width = pre.css('border-top-width');\n    var border_style = pre.css('border-top-style');\n    var border_color = pre.css('border-top-color');\n    var button_styles = {\n        'cursor':'pointer', 'position': 'absolute', 'top': '0', 'right': '0',\n        'border-color': border_color, 'border-style': border_style,\n        'border-width': border_width, 'color': border_color, 'text-size': '75%',\n        'font-family': 'monospace', 'padding-left': '0.2em', 'padding-right': '0.2em',\n        'border-radius': '0 3px 0 0'\n    }\n\n    // create and add the button to all the code blocks that contain >>>\n    div.each(function(index) {\n        var jthis = $(this);\n        if (jthis.find('.gp').length > 0) {\n            var button = $('<span class=\"copybutton\">&gt;&gt;&gt;</span>');\n            button.css(button_styles)\n            button.attr('title', hide_text);\n            button.data('hidden', 'false');\n            jthis.prepend(button);\n        }\n        // tracebacks (.gt) contain bare text elements that need to be\n        // wrapped in a span to work with .nextUntil() (see later)\n        jthis.find('pre:has(.gt)').contents().filter(function() {\n            return ((this.nodeType == 3) && (this.data.trim().length > 0));\n        }).wrap('<span>');\n    });\n\n    // define the behavior of the button when it's clicked\n    $('.copybutton').click(function(e){\n        e.preventDefault();\n        var button = $(this);\n        if (button.data('hidden') === 'false') {\n            // hide the code output\n            button.parent().find('.go, .gp, .gt').hide();\n            button.next('pre').find('.gt').nextUntil('.gp, .go').css('visibility', 'hidden');\n            button.css('text-decoration', 'line-through');\n            button.attr('title', show_text);\n            button.data('hidden', 'true');\n        } else {\n            // show the code output\n            button.parent().find('.go, .gp, .gt').show();\n            button.next('pre').find('.gt').nextUntil('.gp, .go').css('visibility', 'visible');\n            button.css('text-decoration', 'none');\n            button.attr('title', hide_text);\n            button.data('hidden', 'false');\n        }\n    });\n});\n"
  },
  {
    "path": "doc/_templates/class.rst",
    "content": "{{objname}}\n{{ underline }}==============\n\n.. currentmodule:: {{ module }}\n\n.. autoclass:: {{ objname }}\n\n   {% block methods %}\n\n   {% if methods %}\n   .. rubric:: Methods\n\n   .. autosummary::\n   {% for item in methods %}\n      {% if '__init__' not in item %}\n        ~{{ name }}.{{ item }}\n      {% endif %}\n   {%- endfor %}\n   {% endif %}\n   {% endblock %}\n\n.. include:: {{module}}.{{objname}}.examples\n\n.. raw:: html\n\n    <div style='clear:both'></div>\n"
  },
  {
    "path": "doc/_templates/function.rst",
    "content": "{{objname}}\n{{ underline }}====================\n\n.. currentmodule:: {{ module }}\n\n.. autofunction:: {{ objname }}\n\n.. include:: {{module}}.{{objname}}.examples\n\n.. raw:: html\n\n    <div style='clear:both'></div>\n"
  },
  {
    "path": "doc/_templates/numpydoc_docstring.rst",
    "content": "{{index}}\n{{summary}}\n{{extended_summary}}\n{{parameters}}\n{{returns}}\n{{yields}}\n{{other_parameters}}\n{{attributes}}\n{{raises}}\n{{warns}}\n{{warnings}}\n{{see_also}}\n{{notes}}\n{{references}}\n{{examples}}\n{{methods}}\n"
  },
  {
    "path": "doc/_templates/sidebar-search-bs.html",
    "content": "<div class=\"navbar-brand-box\">\n  <a class=\"navbar-brand-box text-wrap\" href=\"{{ pathto('index') }}\">\n    {% if logo %}\n    <img\n      src=\"{{ pathto('_static/' + logo, 1) }}\"\n      class=\"logo\"\n      style=\"width: 60%\"\n      alt=\"logo\"\n    />\n    {% endif %} {% if docstitle %}\n    <h4 class=\"site-logo\" id=\"site-title\">{{ docstitle }}</h4>\n    {% endif %}\n  </a>\n</div>\n"
  },
  {
    "path": "doc/about.rst",
    "content": "About us\n========\n\n.. include:: ../AUTHORS.rst\n\n.. _citing-imbalanced-learn:\n\nCiting imbalanced-learn\n-----------------------\n\nIf you use imbalanced-learn in a scientific publication, we would appreciate\ncitations to the following paper::\n\n  @article{JMLR:v18:16-365,\n  author  = {Guillaume  Lema{{\\^i}}tre and Fernando Nogueira and Christos K. Aridas},\n  title   = {Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning},\n  journal = {Journal of Machine Learning Research},\n  year    = {2017},\n  volume  = {18},\n  number  = {17},\n  pages   = {1-5},\n  url     = {http://jmlr.org/papers/v18/16-365.html}\n  }\n"
  },
  {
    "path": "doc/bibtex/refs.bib",
    "content": "@inproceedings{mani2003knn,\n  title={kNN approach to unbalanced data distributions: a case study involving information extraction},\n  author={Mani, Inderjeet and Zhang, I},\n  booktitle={Proceedings of workshop on learning from imbalanced datasets},\n  volume={126},\n  year={2003}\n}\n\n\n@article{batista2004study,\n  title={A study of the behavior of several methods for balancing machine learning training data},\n  author={Batista, Gustavo EAPA and Prati, Ronaldo C and Monard, Maria Carolina},\n  journal={ACM SIGKDD explorations newsletter},\n  volume={6},\n  number={1},\n  pages={20--29},\n  year={2004},\n  publisher={ACM}\n}\n\n@inproceedings{batista2003balancing,\n  title={Balancing Training Data for Automated Annotation of Keywords: a Case Study.},\n  author={Batista, Gustavo EAPA and Bazzan, Ana LC and Monard, Maria Carolina},\n  booktitle={WOB},\n  pages={10--18},\n  year={2003}\n}\n\n@article{chen2004using,\n  title={Using random forest to learn imbalanced data},\n  author={Chen, Chao and Liaw, Andy and Breiman, Leo and others},\n  journal={University of California, Berkeley},\n  volume={110},\n  number={1-12},\n  pages={24},\n  year={2004}\n}\n\n@article{liu2008exploratory,\n  title={Exploratory undersampling for class-imbalance learning},\n  author={Liu, Xu-Ying and Wu, Jianxin and Zhou, Zhi-Hua},\n  journal={IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics)},\n  volume={39},\n  number={2},\n  pages={539--550},\n  year={2008},\n  publisher={IEEE}\n}\n\n@article{seiffert2009rusboost,\n  title={RUSBoost: A hybrid approach to alleviating class imbalance},\n  author={Seiffert, Chris and Khoshgoftaar, Taghi M and Van Hulse, Jason and Napolitano, Amri},\n  journal={IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans},\n  volume={40},\n  number={1},\n  pages={185--197},\n  year={2009},\n  publisher={IEEE}\n}\n\n@inproceedings{kubat1997addressing,\n  title={Addressing the curse of imbalanced training sets: one-sided selection},\n  author={Kubat, Miroslav and Matwin, Stan and others},\n  booktitle={Icml},\n  volume={97},\n  pages={179--186},\n  year={1997},\n  organization={Nashville, USA}\n}\n\n@article{barandela2003strategies,\n  title={Strategies for learning in class imbalance problems},\n  author={Barandela, Ricardo and S{\\'a}nchez, Jos{\\'e} Salvador and Garca, V and Rangel, Edgar},\n  journal={Pattern Recognition},\n  volume={36},\n  number={3},\n  pages={849--851},\n  year={2003},\n  publisher={Elsevier Science Publishing Company, Inc.}\n}\n\n@article{garcia2012effectiveness,\n  title={On the effectiveness of preprocessing methods when dealing with different levels of class imbalance},\n  author={Garc{\\'\\i}a, Vicente and S{\\'a}nchez, Jos{\\'e} Salvador and Mollineda, Ram{\\'o}n Alberto},\n  journal={Knowledge-Based Systems},\n  volume={25},\n  number={1},\n  pages={13--21},\n  year={2012},\n  publisher={Elsevier}\n}\n\n@inproceedings{he2008adasyn,\n  title={ADASYN: Adaptive synthetic sampling approach for imbalanced learning},\n  author={He, Haibo and Bai, Yang and Garcia, Edwardo A and Li, Shutao},\n  booktitle={2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence)},\n  pages={1322--1328},\n  year={2008},\n  organization={IEEE}\n}\n\n@article{chawla2002smote,\n  title={SMOTE: synthetic minority over-sampling technique},\n  author={Chawla, Nitesh V and Bowyer, Kevin W and Hall, Lawrence O and Kegelmeyer, W Philip},\n  journal={Journal of artificial intelligence research},\n  volume={16},\n  pages={321--357},\n  year={2002}\n}\n\n@inproceedings{han2005borderline,\n  title={Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning},\n  author={Han, Hui and Wang, Wen-Yuan and Mao, Bing-Huan},\n  booktitle={International conference on intelligent computing},\n  pages={878--887},\n  year={2005},\n  organization={Springer}\n}\n\n@inproceedings{nguyen2009borderline,\n  title={Borderline over-sampling for imbalanced data classification},\n  author={Nguyen, Hien M and Cooper, Eric W and Kamei, Katsuari},\n  booktitle={Proceedings: Fifth International Workshop on Computational Intelligence \\& Applications},\n  volume={2009},\n  number={1},\n  pages={24--29},\n  year={2009},\n  organization={IEEE SMC Hiroshima Chapter}\n}\n\n@article{last2017oversampling,\n  title={Oversampling for Imbalanced Learning Based on K-Means and SMOTE},\n  author={Last, Felix and Douzas, Georgios and Bacao, Fernando},\n  journal={arXiv preprint arXiv:1711.00837},\n  year={2017}\n}\n\n@article{tomek1976two,\n  title={Two modifications of CNN},\n  author={Tomek, Ivan},\n  journal={IEEE Trans. Systems, Man and Cybernetics},\n  volume={6},\n  pages={769--772},\n  year={1976}\n}\n\n@article{wilson1972asymptotic,\n  title={Asymptotic properties of nearest neighbor rules using edited data},\n  author={Wilson, Dennis L},\n  journal={IEEE Transactions on Systems, Man, and Cybernetics},\n  number={3},\n  pages={408--421},\n  year={1972},\n  publisher={IEEE}\n}\n\n@article{tomek1976experiment,\n  title={An experiment with the edited nearest-neighbor rule},\n  author={Tomek, Ivan},\n  journal={IEEE Transactions on systems, Man, and Cybernetics},\n  volume={6},\n  number={6},\n  pages={448--452},\n  year={1976}\n}\n\n@article{hart1968condensed,\n  title={The condensed nearest neighbor rule (Corresp.)},\n  author={Hart, Peter},\n  journal={IEEE transactions on information theory},\n  volume={14},\n  number={3},\n  pages={515--516},\n  year={1968},\n  publisher={Citeseer}\n}\n\n@inproceedings{laurikkala2001improving,\n  title={Improving identification of difficult small classes by balancing class distribution},\n  author={Laurikkala, Jorma},\n  booktitle={Conference on Artificial Intelligence in Medicine in Europe},\n  pages={63--66},\n  year={2001},\n  organization={Springer}\n}\n\n@article{smith2014instance,\n  title={An instance level analysis of data complexity},\n  author={Smith, Michael R and Martinez, Tony and Giraud-Carrier, Christophe},\n  journal={Machine learning},\n  volume={95},\n  number={2},\n  pages={225--256},\n  year={2014},\n  publisher={Springer}\n}\n\n@article{torelli2014rose,\n  author = {Menardi, Giovanna and Torelli, Nicola},\n  title={Training and assessing classification rules with imbalanced data},\n  journal={Data Mining and Knowledge Discovery},\n  volume={28},\n  pages={92-122},\n  year={2014},\n  publisher={Springer},\n  issue = {1},\n  issn = {1573-756X},\n  url = {https://doi.org/10.1007/s10618-012-0295-5},\n  doi = {10.1007/s10618-012-0295-5}\n}\n\n@article{esuli2009ordinal,\n  author = {A. Esuli and S. Baccianella and F. Sebastiani},\n  title = {Evaluation Measures for Ordinal Regression},\n  journal = {Intelligent Systems Design and Applications, International Conference on},\n  year = {2009},\n  volume = {1},\n  issn = {},\n  pages = {283-287},\n  keywords = {ordinal regression;ordinal classification;evaluation measures;class imbalance;product reviews},\n  doi = {10.1109/ISDA.2009.230},\n  url = {https://doi.ieeecomputersociety.org/10.1109/ISDA.2009.230},\n  publisher = {IEEE Computer Society},\n  address = {Los Alamitos, CA, USA},\n  month = {dec}\n}\n\n@article{stanfill1986toward,\n  title={Toward memory-based reasoning},\n  author={Stanfill, Craig and Waltz, David},\n  journal={Communications of the ACM},\n  volume={29},\n  number={12},\n  pages={1213--1228},\n  year={1986},\n  publisher={ACM New York, NY, USA}\n}\n\n@article{wilson1997improved,\n  title={Improved heterogeneous distance functions},\n  author={Wilson, D Randall and Martinez, Tony R},\n  journal={Journal of artificial intelligence research},\n  volume={6},\n  pages={1--34},\n  year={1997}\n}\n\n@inproceedings{wang2009diversity,\n  title={Diversity analysis on imbalanced data sets by using ensemble models},\n  author={Wang, Shuo and Yao, Xin},\n  booktitle={2009 IEEE symposium on computational intelligence and data mining},\n  pages={324--331},\n  year={2009},\n  organization={IEEE}\n}\n\n@article{hido2009roughly,\n  title={Roughly balanced bagging for imbalanced data},\n  author={Hido, Shohei and Kashima, Hisashi and Takahashi, Yutaka},\n  journal={Statistical Analysis and Data Mining: The ASA Data Science Journal},\n  volume={2},\n  number={5-6},\n  pages={412--426},\n  year={2009},\n  publisher={Wiley Online Library}\n}\n\n@article{maclin1997empirical,\n  title={An empirical evaluation of bagging and boosting},\n  author={Maclin, Richard and Opitz, David},\n  journal={AAAI/IAAI},\n  volume={1997},\n  pages={546--551},\n  year={1997}\n}\n"
  },
  {
    "path": "doc/combine.rst",
    "content": ".. _combine:\n\n=======================================\nCombination of over- and under-sampling\n=======================================\n\n.. currentmodule:: imblearn.over_sampling\n\nWe previously presented :class:`SMOTE` and showed that this method can generate\nnoisy samples by interpolating new points between marginal outliers and\ninliers. This issue can be solved by cleaning the space resulting\nfrom over-sampling.\n\n.. currentmodule:: imblearn.combine\n\nIn this regard, Tomek's link and edited nearest-neighbours are the two cleaning\nmethods that have been added to the pipeline after applying SMOTE over-sampling\nto obtain a cleaner space. The two ready-to use classes imbalanced-learn\nimplements for combining over- and undersampling methods are: (i)\n:class:`SMOTETomek` :cite:`batista2004study` and (ii) :class:`SMOTEENN`\n:cite:`batista2003balancing`.\n\nThose two classes can be used like any other sampler with parameters identical\nto their former samplers::\n\n  >>> from collections import Counter\n  >>> from sklearn.datasets import make_classification\n  >>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,\n  ...                            n_redundant=0, n_repeated=0, n_classes=3,\n  ...                            n_clusters_per_class=1,\n  ...                            weights=[0.01, 0.05, 0.94],\n  ...                            class_sep=0.8, random_state=0)\n  >>> print(sorted(Counter(y).items()))\n  [(0, 64), (1, 262), (2, 4674)]\n  >>> from imblearn.combine import SMOTEENN\n  >>> smote_enn = SMOTEENN(random_state=0)\n  >>> X_resampled, y_resampled = smote_enn.fit_resample(X, y)\n  >>> print(sorted(Counter(y_resampled).items()))\n  [(0, 4060), (1, 4381), (2, 3502)]\n  >>> from imblearn.combine import SMOTETomek\n  >>> smote_tomek = SMOTETomek(random_state=0)\n  >>> X_resampled, y_resampled = smote_tomek.fit_resample(X, y)\n  >>> print(sorted(Counter(y_resampled).items()))\n  [(0, 4499), (1, 4566), (2, 4413)]\n\nWe can also see in the example below that :class:`SMOTEENN` tends to clean more\nnoisy samples than :class:`SMOTETomek`.\n\n.. image:: ./auto_examples/combine/images/sphx_glr_plot_comparison_combine_001.png\n   :target: ./auto_examples/combine/plot_comparison_combine.html\n   :scale: 60\n   :align: center\n\n.. topic:: Examples\n\n  * :ref:`sphx_glr_auto_examples_combine_plot_comparison_combine.py`\n"
  },
  {
    "path": "doc/common_pitfalls.rst",
    "content": ".. _common_pitfalls:\n\n=========================================\nCommon pitfalls and recommended practices\n=========================================\n\nThis section is a complement to the documentation given\n`[here] <https://scikit-learn.org/dev/common_pitfalls.html>`_ in scikit-learn.\nIndeed, we will highlight the issue of misusing resampling, leading to a\n**data leakage**. Due to this leakage, the performance of a model reported\nwill be over-optimistic.\n\nData leakage\n============\n\nAs mentioned in the scikit-learn documentation, data leakage occurs when\ninformation that would not be available at prediction time is used when\nbuilding the model.\n\nIn the resampling setting, there is a common pitfall that corresponds to\nresample the **entire** dataset before splitting it into a train and a test\npartitions. Note that it would be equivalent to resample the train and test\npartitions as well.\n\nSuch of a processing leads to two issues:\n\n* the model will not be tested on a dataset with class distribution similar\n  to the real use-case. Indeed, by resampling the entire dataset, both the\n  training and testing set will be potentially balanced while the model should\n  be tested on the natural imbalanced dataset to evaluate the potential bias\n  of the model;\n* the resampling procedure might use information about samples in the dataset\n  to either generate or select some of the samples. Therefore, we might use\n  information of samples which will be later used as testing samples which\n  is the typical data leakage issue.\n\nWe will demonstrate the wrong and right ways to do some sampling and emphasize\nthe tools that one should use, avoiding to fall in the trap.\n\nWe will use the adult census dataset. For the sake of simplicity, we will only\nuse the numerical features. Also, we will make the dataset more imbalanced to\nincrease the effect of the wrongdoings::\n\n  >>> from sklearn.datasets import fetch_openml\n  >>> from imblearn.datasets import make_imbalance\n  >>> X, y = fetch_openml(\n  ...     data_id=1119, as_frame=True, return_X_y=True\n  ... )\n  >>> X = X.select_dtypes(include=\"number\")\n  >>> X, y = make_imbalance(\n  ...     X, y, sampling_strategy={\">50K\": 300}, random_state=1\n  ... )\n\nLet's first check the balancing ratio on this dataset::\n\n  >>> from collections import Counter\n  >>> {key: value / len(y) for key, value in Counter(y).items()}\n  {'<=50K': 0.988..., '>50K': 0.011...}\n\nTo later highlight some of the issue, we will keep aside a left-out set that we\nwill not use for the evaluation of the model::\n\n  >>> from sklearn.model_selection import train_test_split\n  >>> X, X_left_out, y, y_left_out = train_test_split(\n  ...     X, y, stratify=y, random_state=0\n  ... )\n\nWe will use a :class:`sklearn.ensemble.HistGradientBoostingClassifier` as a\nbaseline classifier. First, we will train and check the performance of this\nclassifier, without any preprocessing to alleviate the bias toward the majority\nclass. We evaluate the generalization performance of the classifier via\ncross-validation::\n\n  >>> from sklearn.ensemble import HistGradientBoostingClassifier\n  >>> from sklearn.model_selection import cross_validate\n  >>> model = HistGradientBoostingClassifier(random_state=0)\n  >>> cv_results = cross_validate(\n  ...     model, X, y, scoring=\"balanced_accuracy\",\n  ...     return_train_score=True, return_estimator=True,\n  ...     n_jobs=-1\n  ... )\n  >>> print(\n  ...     f\"Balanced accuracy mean +/- std. dev.: \"\n  ...     f\"{cv_results['test_score'].mean():.3f} +/- \"\n  ...     f\"{cv_results['test_score'].std():.3f}\"\n  ... )\n  Balanced accuracy mean +/- std. dev.: 0.609 +/- 0.024\n\nWe see that the classifier does not give good performance in terms of balanced\naccuracy mainly due to the class imbalance issue.\n\nIn the cross-validation, we stored the different classifiers of all folds. We\nwill show that evaluating these classifiers on the left-out data will give\nclose statistical performance::\n\n  >>> import numpy as np\n  >>> from sklearn.metrics import balanced_accuracy_score\n  >>> scores = []\n  >>> for fold_id, cv_model in enumerate(cv_results[\"estimator\"]):\n  ...     scores.append(\n  ...         balanced_accuracy_score(\n  ...             y_left_out, cv_model.predict(X_left_out)\n  ...         )\n  ...     )\n  >>> print(\n  ...     f\"Balanced accuracy mean +/- std. dev.: \"\n  ...     f\"{np.mean(scores):.3f} +/- {np.std(scores):.3f}\"\n  ... )\n  Balanced accuracy mean +/- std. dev.: 0.628 +/- 0.009\n\nLet's now show the **wrong** pattern to apply when it comes to resampling to\nalleviate the class imbalance issue. We will use a sampler to balance the\n**entire** dataset and check the statistical performance of our classifier via\ncross-validation::\n\n  >>> from imblearn.under_sampling import RandomUnderSampler\n  >>> sampler = RandomUnderSampler(random_state=0)\n  >>> X_resampled, y_resampled = sampler.fit_resample(X, y)\n  >>> model = HistGradientBoostingClassifier(random_state=0)\n  >>> cv_results = cross_validate(\n  ...     model, X_resampled, y_resampled, scoring=\"balanced_accuracy\",\n  ...     return_train_score=True, return_estimator=True,\n  ...     n_jobs=-1\n  ... )\n  >>> print(\n  ...     f\"Balanced accuracy mean +/- std. dev.: \"\n  ...     f\"{cv_results['test_score'].mean():.3f} +/- \"\n  ...     f\"{cv_results['test_score'].std():.3f}\"\n  ... )\n  Balanced accuracy mean +/- std. dev.: 0.724 +/- 0.042\n\nThe cross-validation performance looks good, but evaluating the classifiers\non the left-out data shows a different picture::\n\n  >>> scores = []\n  >>> for fold_id, cv_model in enumerate(cv_results[\"estimator\"]):\n  ...     scores.append(\n  ...         balanced_accuracy_score(\n  ...             y_left_out, cv_model.predict(X_left_out)\n  ...        )\n  ...     )\n  >>> print(\n  ...     f\"Balanced accuracy mean +/- std. dev.: \"\n  ...     f\"{np.mean(scores):.3f} +/- {np.std(scores):.3f}\"\n  ... )\n  Balanced accuracy mean +/- std. dev.: 0.698 +/- 0.014\n\nWe see that the performance is now worse than the cross-validated performance.\nIndeed, the data leakage gave us too optimistic results due to the reason\nstated earlier in this section.\n\nWe will now illustrate the correct pattern to use. Indeed, as in scikit-learn,\nusing a :class:`~imblearn.pipeline.Pipeline` avoids to make any data leakage\nbecause the resampling will be delegated to imbalanced-learn and does not\nrequire any manual steps::\n\n  >>> from imblearn.pipeline import make_pipeline\n  >>> model = make_pipeline(\n  ...     RandomUnderSampler(random_state=0),\n  ...     HistGradientBoostingClassifier(random_state=0)\n  ... )\n  >>> cv_results = cross_validate(\n  ...     model, X, y, scoring=\"balanced_accuracy\",\n  ...     return_train_score=True, return_estimator=True,\n  ...     n_jobs=-1\n  ... )\n  >>> print(\n  ...     f\"Balanced accuracy mean +/- std. dev.: \"\n  ...     f\"{cv_results['test_score'].mean():.3f} +/- \"\n  ...     f\"{cv_results['test_score'].std():.3f}\"\n  ... )\n  Balanced accuracy mean +/- std. dev.: 0.732 +/- 0.019\n\nWe observe that we get good statistical performance as well. However, now we\ncan check the performance of the model from each cross-validation fold to\nensure that we have similar performance::\n\n  >>> scores = []\n  >>> for fold_id, cv_model in enumerate(cv_results[\"estimator\"]):\n  ...     scores.append(\n  ...         balanced_accuracy_score(\n  ...             y_left_out, cv_model.predict(X_left_out)\n  ...        )\n  ...     )\n  >>> print(\n  ...     f\"Balanced accuracy mean +/- std. dev.: \"\n  ...     f\"{np.mean(scores):.3f} +/- {np.std(scores):.3f}\"\n  ... )\n  Balanced accuracy mean +/- std. dev.: 0.727 +/- 0.008\n\nWe see that the statistical performance are very close to the cross-validation\nstudy that we perform, without any sign of over-optimistic results.\n"
  },
  {
    "path": "doc/conf.py",
    "content": "#\n# imbalanced-learn documentation build configuration file, created by\n# sphinx-quickstart on Mon Jan 18 14:44:12 2016.\n#\n# This file is execfile()d with the current directory set to its\n# containing dir.\n#\n# Note that not all possible configuration values are present in this\n# autogenerated file.\n#\n# All configuration values have a default; values that are commented out\n# serve to show the default.\n\nimport os\nimport sys\nfrom datetime import datetime\nfrom io import StringIO\nfrom pathlib import Path\n\n# If extensions (or modules to document with autodoc) are in another directory,\n# add these directories to sys.path here. If the directory is relative to the\n# documentation root, use os.path.abspath to make it absolute, like shown here.\nsys.path.insert(0, os.path.abspath(\"sphinxext\"))\nfrom github_link import make_linkcode_resolve  # noqa\n\n# -- General configuration ------------------------------------------------\n\n# If your documentation needs a minimal Sphinx version, state it here.\n# needs_sphinx = '1.0'\n\n# Add any Sphinx extension module names here, as strings. They can be\n# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom\n# ones.\nextensions = [\n    \"sphinx.ext.autodoc\",\n    \"sphinx.ext.autosummary\",\n    \"sphinx.ext.doctest\",\n    \"sphinx.ext.intersphinx\",\n    \"sphinx.ext.linkcode\",\n    \"sphinxcontrib.bibtex\",\n    \"numpydoc\",\n    \"sphinx_issues\",\n    \"sphinx_gallery.gen_gallery\",\n    \"sphinx_copybutton\",\n    \"sphinx_design\",\n]\n\n# Specify how to identify the prompt when copying code snippets\ncopybutton_prompt_text = r\">>> |\\.\\.\\. \"\ncopybutton_prompt_is_regexp = True\n\n# Add any paths that contain templates here, relative to this directory.\ntemplates_path = [\"_templates\"]\n\n# The suffix of source filenames.\nsource_suffix = \".rst\"\n\n# The master toctree document.\nmaster_doc = \"index\"\n\n# General information about the project.\nproject = \"imbalanced-learn\"\ncopyright = f\"2014-{datetime.now().year}, The imbalanced-learn developers\"\n\n# The version info for the project you're documenting, acts as replacement for\n# |version| and |release|, also used in various other places throughout the\n# built documents.\n#\n# The short X.Y version.\nfrom imblearn import __version__  # noqa\n\nversion = __version__\n# The full version, including alpha/beta/rc tags.\nrelease = __version__\n\n# List of patterns, relative to source directory, that match files and\n# directories to ignore when looking for source files.\nexclude_patterns = [\"_build\", \"_templates\"]\n\n# The reST default role (used for this markup: `text`) to use for all\n# documents.\ndefault_role = \"literal\"\n\n# If true, '()' will be appended to :func: etc. cross-reference text.\nadd_function_parentheses = False\n\n# The name of the Pygments (syntax highlighting) style to use.\npygments_style = \"sphinx\"\n\n# -- Options for HTML output ----------------------------------------------\n\n# The theme to use for HTML and HTML Help pages.  See the documentation for\n# a list of builtin themes.\nhtml_theme = \"pydata_sphinx_theme\"\nhtml_title = f\"Version {version}\"\nhtml_favicon = \"_static/img/favicon.ico\"\nhtml_logo = \"_static/img/logo_wide.png\"\nhtml_style = \"css/imbalanced-learn.css\"\nhtml_css_files = [\n    \"css/imbalanced-learn.css\",\n]\nhtml_sidebars = {\n    \"changelog\": [],\n}\n\nhtml_theme_options = {\n    \"external_links\": [],\n    \"github_url\": \"https://github.com/scikit-learn-contrib/imbalanced-learn\",\n    \"use_edit_page_button\": True,\n    \"show_toc_level\": 1,\n    # \"navbar_align\": \"right\",  # For testing that the navbar items align properly\n    \"logo\": {\n        \"image_dark\": (\n            \"https://imbalanced-learn.org/stable/_static/img/logo_wide_dark.png\"\n        )\n    },\n}\n\nhtml_context = {\n    \"github_user\": \"scikit-learn-contrib\",\n    \"github_repo\": \"imbalanced-learn\",\n    \"github_version\": \"master\",\n    \"doc_path\": \"doc\",\n}\n\n# Add any paths that contain custom static files (such as style sheets) here,\n# relative to this directory. They are copied after the builtin static files,\n# so a file named \"default.css\" will overwrite the builtin \"default.css\".\nhtml_static_path = [\"_static\"]\n\n# Output file base name for HTML help builder.\nhtmlhelp_basename = \"imbalanced-learndoc\"\n\n# -- Options for autodoc ------------------------------------------------------\n\nautodoc_default_options = {\n    \"members\": True,\n    \"inherited-members\": True,\n}\n\n# generate autosummary even if no references\nautosummary_generate = True\n\n# -- Options for numpydoc -----------------------------------------------------\n\n# this is needed for some reason...\n# see https://github.com/numpy/numpydoc/issues/69\nnumpydoc_show_class_members = False\n\n# -- Options for sphinxcontrib-bibtex -----------------------------------------\n\n# bibtex file\nbibtex_bibfiles = [\"bibtex/refs.bib\"]\n\n# -- Options for intersphinx --------------------------------------------------\n\n# intersphinx configuration\nintersphinx_mapping = {\n    \"python\": (f\"https://docs.python.org/{sys.version_info.major}\", None),\n    \"numpy\": (\"https://numpy.org/doc/stable\", None),\n    \"scipy\": (\"https://docs.scipy.org/doc/scipy/reference\", None),\n    \"matplotlib\": (\"https://matplotlib.org/\", None),\n    \"pandas\": (\"https://pandas.pydata.org/pandas-docs/stable/\", None),\n    \"joblib\": (\"https://joblib.readthedocs.io/en/latest/\", None),\n    \"seaborn\": (\"https://seaborn.pydata.org/\", None),\n}\n\n# -- Options for sphinx-gallery -----------------------------------------------\n\n# Generate the plot for the gallery\nplot_gallery = True\n\n# sphinx-gallery configuration\nsphinx_gallery_conf = {\n    \"doc_module\": \"imblearn\",\n    \"backreferences_dir\": os.path.join(\"references/generated\"),\n    \"show_memory\": True,\n    \"reference_url\": {\"imblearn\": None},\n}\n\n# -- Options for github link for what's new -----------------------------------\n\n# Config for sphinx_issues\nissues_uri = \"https://github.com/scikit-learn-contrib/imbalanced-learn/issues/{issue}\"\nissues_github_path = \"scikit-learn-contrib/imbalanced-learn\"\nissues_user_uri = \"https://github.com/{user}\"\n\n# The following is used by sphinx.ext.linkcode to provide links to github\nlinkcode_resolve = make_linkcode_resolve(\n    \"imblearn\",\n    (\n        \"https://github.com/scikit-learn-contrib/\"\n        \"imbalanced-learn/blob/{revision}/\"\n        \"{package}/{path}#L{lineno}\"\n    ),\n)\n\n# -- Options for LaTeX output ---------------------------------------------\n\nlatex_elements = {\n    # The paper size ('letterpaper' or 'a4paper').\n    # 'papersize': 'letterpaper',\n    # The font size ('10pt', '11pt' or '12pt').\n    # 'pointsize': '10pt',\n    # Additional stuff for the LaTeX preamble.\n    # 'preamble': '',\n}\n\n# Grouping the document tree into LaTeX files. List of tuples\n# (source start file, target name, title,\n#  author, documentclass [howto, manual, or own class]).\nlatex_documents = [\n    (\n        \"index\",\n        \"imbalanced-learn.tex\",\n        \"imbalanced-learn Documentation\",\n        \"The imbalanced-learn developers\",\n        \"manual\",\n    ),\n]\n\n# -- Options for manual page output ---------------------------------------\n\n# If false, no module index is generated.\n# latex_domain_indices = True\n\n\n# One entry per manual page. List of tuples\n# (source start file, name, description, authors, manual section).\nman_pages = [\n    (\n        \"index\",\n        \"imbalanced-learn\",\n        \"imbalanced-learn Documentation\",\n        [\"The imbalanced-learn developers\"],\n        1,\n    )\n]\n\n# If true, show URL addresses after external links.\n# man_show_urls = False\n\n# -- Options for Texinfo output -------------------------------------------\n\n# Grouping the document tree into Texinfo files. List of tuples\n# (source start file, target name, title, author,\n#  dir menu entry, description, category)\ntexinfo_documents = [\n    (\n        \"index\",\n        \"imbalanced-learn\",\n        \"imbalanced-learn Documentation\",\n        \"The imbalanced-learn developerss\",\n        \"imbalanced-learn\",\n        \"Toolbox for imbalanced dataset in machine learning.\",\n        \"Miscellaneous\",\n    ),\n]\n\n# -- Dependencies generation ----------------------------------------------\n\n\ndef generate_min_dependency_table(app):\n    \"\"\"Generate min dependency table for docs.\"\"\"\n    from sklearn._min_dependencies import dependent_packages\n\n    # get length of header\n    package_header_len = max(len(package) for package in dependent_packages) + 4\n    version_header_len = len(\"Minimum Version\") + 4\n    tags_header_len = max(len(tags) for _, tags in dependent_packages.values()) + 4\n\n    output = StringIO()\n    output.write(\n        \" \".join(\n            [\"=\" * package_header_len, \"=\" * version_header_len, \"=\" * tags_header_len]\n        )\n    )\n    output.write(\"\\n\")\n    dependency_title = \"Dependency\"\n    version_title = \"Minimum Version\"\n    tags_title = \"Purpose\"\n\n    output.write(\n        f\"{dependency_title:<{package_header_len}} \"\n        f\"{version_title:<{version_header_len}} \"\n        f\"{tags_title}\\n\"\n    )\n\n    output.write(\n        \" \".join(\n            [\"=\" * package_header_len, \"=\" * version_header_len, \"=\" * tags_header_len]\n        )\n    )\n    output.write(\"\\n\")\n\n    for package, (version, tags) in dependent_packages.items():\n        output.write(\n            f\"{package:<{package_header_len}} {version:<{version_header_len}} {tags}\\n\"\n        )\n\n    output.write(\n        \" \".join(\n            [\"=\" * package_header_len, \"=\" * version_header_len, \"=\" * tags_header_len]\n        )\n    )\n    output.write(\"\\n\")\n    output = output.getvalue()\n\n    with (Path(\".\") / \"min_dependency_table.rst\").open(\"w\") as f:\n        f.write(output)\n\n\ndef generate_min_dependency_substitutions(app):\n    \"\"\"Generate min dependency substitutions for docs.\"\"\"\n    from sklearn._min_dependencies import dependent_packages\n\n    output = StringIO()\n\n    for package, (version, _) in dependent_packages.items():\n        package = package.capitalize()\n        output.write(f\".. |{package}MinVersion| replace:: {version}\")\n        output.write(\"\\n\")\n\n    output = output.getvalue()\n\n    with (Path(\".\") / \"min_dependency_substitutions.rst\").open(\"w\") as f:\n        f.write(output)\n\n\n# -- Additional temporary hacks -----------------------------------------------\n\n\ndef setup(app):\n    app.connect(\"builder-inited\", generate_min_dependency_table)\n    app.connect(\"builder-inited\", generate_min_dependency_substitutions)\n"
  },
  {
    "path": "doc/datasets/index.rst",
    "content": ".. _datasets:\n\n=========================\nDataset loading utilities\n=========================\n\n.. currentmodule:: imblearn.datasets\n\nThe :mod:`imblearn.datasets` package is complementing the\n:mod:`sklearn.datasets` package. The package provides both: (i) a set of\nimbalanced datasets to perform systematic benchmark and (ii) a utility to\ncreate an imbalanced dataset from an original balanced dataset.\n\n.. _zenodo:\n\nImbalanced datasets for benchmark\n=================================\n\n:func:`fetch_datasets` allows to fetch 27 datasets which are imbalanced and\nbinarized. The following data sets are available:\n\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |ID|Name          | Repository & Target           | Ratio | #S      | #F  |\n    +==+==============+===============================+=======+=========+=====+\n    |1 |ecoli         | UCI, target: imU              | 8.6:1 | 336     | 7   |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |2 |optical_digits| UCI, target: 8                | 9.1:1 | 5,620   | 64  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |3 |satimage      | UCI, target: 4                | 9.3:1 | 6,435   | 36  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |4 |pen_digits    | UCI, target: 5                | 9.4:1 | 10,992  | 16  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |5 |abalone       | UCI, target: 7                | 9.7:1 | 4,177   | 10  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |6 |sick_euthyroid| UCI, target: sick euthyroid   | 9.8:1 | 3,163   | 42  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |7 |spectrometer  | UCI, target: >=44             | 11:1  | 531     | 93  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |8 |car_eval_34   | UCI, target: good, v good     | 12:1  | 1,728   | 21  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |9 |isolet        | UCI, target: A, B             | 12:1  | 7,797   | 617 |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |10|us_crime      | UCI, target: >0.65            | 12:1  | 1,994   | 100 |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |11|yeast_ml8     | LIBSVM, target: 8             | 13:1  | 2,417   | 103 |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |12|scene         | LIBSVM, target: >one label    | 13:1  | 2,407   | 294 |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |13|libras_move   | UCI, target: 1                | 14:1  | 360     | 90  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |14|thyroid_sick  | UCI, target: sick             | 15:1  | 3,772   | 52  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |15|coil_2000     | KDD, CoIL, target: minority   | 16:1  | 9,822   | 85  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |16|arrhythmia    | UCI, target: 06               | 17:1  | 452     | 278 |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |17|solar_flare_m0| UCI, target: M->0             | 19:1  | 1,389   | 32  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |18|oil           | UCI, target: minority         | 22:1  | 937     | 49  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |19|car_eval_4    | UCI, target: vgood            | 26:1  | 1,728   | 21  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |20|wine_quality  | UCI, wine, target: <=4        | 26:1  | 4,898   | 11  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |21|letter_img    | UCI, target: Z                | 26:1  | 20,000  | 16  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |22|yeast_me2     | UCI, target: ME2              | 28:1  | 1,484   | 8   |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |23|webpage       | LIBSVM, w7a, target: minority | 33:1  | 34,780  | 300 |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |24|ozone_level   | UCI, ozone, data              | 34:1  | 2,536   | 72  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |25|mammography   | UCI, target: minority         | 42:1  | 11,183  | 6   |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |26|protein_homo  | KDD CUP 2004, minority        | 11:1  | 145,751 | 74  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |27|abalone_19    | UCI, target: 19               | 130:1 | 4,177   | 10  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n\n\nA specific data set can be selected as::\n\n  >>> from collections import Counter\n  >>> from imblearn.datasets import fetch_datasets\n  >>> ecoli = fetch_datasets()['ecoli']\n  >>> ecoli.data.shape\n  (336, 7)\n  >>> print(sorted(Counter(ecoli.target).items()))\n  [(-1, 301), (1, 35)]\n\n.. _make_imbalanced:\n\nImbalanced generator\n====================\n\n:func:`make_imbalance` turns an original dataset into an imbalanced\ndataset. This behaviour is driven by the parameter ``sampling_strategy`` which\nbehave similarly to other resampling algorithm. ``sampling_strategy`` can be\ngiven as a dictionary where the key corresponds to the class and the value is\nthe number of samples in the class::\n\n  >>> from sklearn.datasets import load_iris\n  >>> from imblearn.datasets import make_imbalance\n  >>> iris = load_iris()\n  >>> sampling_strategy = {0: 20, 1: 30, 2: 40}\n  >>> X_imb, y_imb = make_imbalance(iris.data, iris.target,\n  ...                               sampling_strategy=sampling_strategy)\n  >>> sorted(Counter(y_imb).items())\n  [(0, 20), (1, 30), (2, 40)]\n\nNote that all samples of a class are passed-through if the class is not mentioned\nin the dictionary::\n\n  >>> sampling_strategy = {0: 10}\n  >>> X_imb, y_imb = make_imbalance(iris.data, iris.target,\n  ...                               sampling_strategy=sampling_strategy)\n  >>> sorted(Counter(y_imb).items())\n  [(0, 10), (1, 50), (2, 50)]\n\nInstead of a dictionary, a function can be defined and directly pass to\n``sampling_strategy``::\n\n  >>> def ratio_multiplier(y):\n  ...     multiplier = {0: 0.5, 1: 0.7, 2: 0.95}\n  ...     target_stats = Counter(y)\n  ...     for key, value in target_stats.items():\n  ...         target_stats[key] = int(value * multiplier[key])\n  ...     return target_stats\n  >>> X_imb, y_imb = make_imbalance(iris.data, iris.target,\n  ...                               sampling_strategy=ratio_multiplier)\n  >>> sorted(Counter(y_imb).items())\n  [(0, 25), (1, 35), (2, 47)]\n\nIt would also work with pandas dataframe::\n\n  >>> from sklearn.datasets import fetch_openml\n  >>> df, y = fetch_openml(\n  ...     'iris', version=1, return_X_y=True, as_frame=True)\n  >>> df_resampled, y_resampled = make_imbalance(\n  ...     df, y, sampling_strategy={'Iris-setosa': 10, 'Iris-versicolor': 20},\n  ...     random_state=42)\n  >>> df_resampled.head()\n          sepallength  sepalwidth  petallength  petalwidth\n    13          4.3         3.0          1.1         0.1\n    39          5.1         3.4          1.5         0.2\n    30          4.8         3.1          1.6         0.2\n    45          4.8         3.0          1.4         0.3\n    17          5.1         3.5          1.4         0.3\n  >>> Counter(y_resampled)\n  Counter({'Iris-virginica': 50, 'Iris-versicolor': 20, 'Iris-setosa': 10})\n\nSee :ref:`sphx_glr_auto_examples_datasets_plot_make_imbalance.py` and\n:ref:`sphx_glr_auto_examples_api_plot_sampling_strategy_usage.py`.\n"
  },
  {
    "path": "doc/developers_utils.rst",
    "content": ".. _developers-utils:\n\n===================\nDeveloper guideline\n===================\n\nDeveloper utilities\n-------------------\n\nImbalanced-learn contains a number of utilities to help with development. These are\nlocated in :mod:`imblearn.utils`, and include tools in a number of categories.\nAll the following functions and classes are in the module :mod:`imblearn.utils`.\n\n.. warning ::\n\n   These utilities are meant to be used internally within the imbalanced-learn\n   package. They are not guaranteed to be stable between versions of\n   imbalanced-learn. Backports, in particular, will be removed as the\n   imbalanced-learn dependencies evolve.\n\n\nValidation Tools\n~~~~~~~~~~~~~~~~\n\n.. currentmodule:: imblearn.utils\n\nThese are tools used to check and validate input. When you write a function\nwhich accepts arrays, matrices, or sparse matrices as arguments, the following\nshould be used when applicable.\n\n- :func:`check_neighbors_object`: Check the objects is consistent to be a NN.\n- :func:`check_target_type`: Check the target types to be conform to the current\n  samplers.\n- :func:`check_sampling_strategy`: Checks that sampling target is consistent with\n  the type and return a dictionary containing each targeted class with its\n  corresponding number of pixel.\n\n\nDeprecation\n~~~~~~~~~~~\n\n.. currentmodule:: imblearn.utils.deprecation\n\n.. warning ::\n   Apart from :func:`deprecate_parameter` the rest of this section is taken from\n   scikit-learn. Please refer to their original documentation.\n\nIf any publicly accessible method, function, attribute or parameter\nis renamed, we still support the old one for two releases and issue\na deprecation warning when it is called/passed/accessed.\nE.g., if the function ``zero_one`` is renamed to ``zero_one_loss``,\nwe add the decorator ``deprecated`` (from ``sklearn.utils``)\nto ``zero_one`` and call ``zero_one_loss`` from that function::\n\n    from ..utils import deprecated\n\n    def zero_one_loss(y_true, y_pred, normalize=True):\n        # actual implementation\n        pass\n\n    @deprecated(\"Function 'zero_one' was renamed to 'zero_one_loss' \"\n                \"in version 0.13 and will be removed in release 0.15. \"\n                \"Default behavior is changed from 'normalize=False' to \"\n                \"'normalize=True'\")\n    def zero_one(y_true, y_pred, normalize=False):\n        return zero_one_loss(y_true, y_pred, normalize)\n\nIf an attribute is to be deprecated,\nuse the decorator ``deprecated`` on a property.\nE.g., renaming an attribute ``labels_`` to ``classes_`` can be done as::\n\n    @property\n    @deprecated(\"Attribute labels_ was deprecated in version 0.13 and \"\n                \"will be removed in 0.15. Use 'classes_' instead\")\n    def labels_(self):\n        return self.classes_\n\nIf a parameter has to be deprecated, use ``FutureWarning`` appropriately.\nIn the following example, k is deprecated and renamed to n_clusters::\n\n    import warnings\n\n    def example_function(n_clusters=8, k=None):\n        if k is not None:\n            warnings.warn(\"'k' was renamed to n_clusters in version 0.13 and \"\n                          \"will be removed in 0.15.\", DeprecationWarning)\n            n_clusters = k\n\nAs in these examples, the warning message should always give both the\nversion in which the deprecation happened and the version in which the\nold behavior will be removed. If the deprecation happened in version\n0.x-dev, the message should say deprecation occurred in version 0.x and\nthe removal will be in 0.(x+2). For example, if the deprecation happened\nin version 0.18-dev, the message should say it happened in version 0.18\nand the old behavior will be removed in version 0.20.\n\nIn addition, a deprecation note should be added in the docstring, recalling the\nsame information as the deprecation warning as explained above. Use the\n``.. deprecated::`` directive::\n\n  .. deprecated:: 0.13\n     ``k`` was renamed to ``n_clusters`` in version 0.13 and will be removed\n     in 0.15.\n\nOn the top of all the functionality provided by scikit-learn. imbalanced-learn\nprovides :func:`deprecate_parameter`: which is used to deprecate a sampler's\nparameter (attribute) by another one.\n\nMaking a release\n----------------\nThis section document the different steps that are necessary to make a new\nimbalanced-learn release.\n\nMajor release\n~~~~~~~~~~~~~\n\n* Update the release note `whats_new/v0.<version number>.rst` by giving a date\n  and removing the status \"Under development\" from the title.\n* Run `bumpversion release`. It will remove the `dev0` tag.\n* Commit the change `git commit -am \"bumpversion 0.<version number>.0\"`\n  (e.g., `git commit -am \"bumpversion 0.5.0\"`).\n* Create a branch for this version\n  (e.g., `git checkout -b 0.<version number>.X`).\n* Push the new branch into the upstream remote imbalanced-learn repository.\n* Change the `symlink` in the\n  `imbalanced-learn website repository <https://github.com/imbalanced-learn/imbalanced-learn.github.io>`_\n  such that stable points to the latest release version,\n  i.e, `0.<version number>`. To do this, clone the repository,\n  `run unlink stable`, followed by `ln -s 0.<version number> stable`. To check\n  that this was performed correctly, ensure that stable has the new version\n  number using `ls -l`.\n* Return to your imbalanced-learn repository, in the branch\n  `0.<version number>.X`.\n* Create the source distribution and wheel: `python setup.py sdist` and\n  `python setup.py bdist_wheel`.\n* Upload these file to PyPI using `twine upload dist/*`\n* Switch to the `master` branch and run `bumpversion minor`, commit and push on\n  upstream. We are officially at `0.<version number + 1>.0.dev0`.\n* Create a GitHub release by clicking on \"Draft a new release\" here.\n  \"Tag version\" should be the latest version number (e.g., `0.<version>.0`),\n  \"Target\" should be the branch for that the release\n  (e.g., `0.<version number>.X`) and \"Release title\" should be\n  \"Version <version number>\". Add the notes from the release notes there.\n* Add a new `v0.<version number + 1>.rst` file in `doc/whats_new/` and\n  `.. include::` this new file in `doc/whats_new.rst`. Mark the version as the\n  version under development.\n* Finally, go to the `conda-forge feedstock <https://github.com/conda-forge/imbalanced-learn-feedstock>`_\n  and a new PR will be created when the feedstock will synchronizing with the\n  PyPI repository. Merge this PR such that we have the binary for `conda`\n  available.\n\nBug fix release\n~~~~~~~~~~~~~~~\n\n* Find the commit(s) hash of the bug fix commit you wish to back port using\n  `git log`.\n* Checkout the branch for the lastest release, e.g.,\n  `git checkout 0.<version number>.X`.\n* Append the bug fix commit(s) to the branch using `git cherry-pick <hash>`.\n  Alternatively, you can use interactive rebasing from the `master` branch.\n* Bump the version number with bumpversion patch. This will bump the patch\n  version, for example from `0.X.0` to `0.X.* dev0`.\n* Mark the current version as a release version (as opposed to `dev` version)\n  with `bumpversion release --allow-dirty`. It will bump the version, for\n  example from `0.X.* dev0` to `0.X.1`.\n* Commit the changes with `git commit -am 'bumpversion <new version>'`.\n* Push the changes to the release branch in upstream, e.g.\n  `git push <upstream remote> <release branch>`.\n* Use the same process as in a major release to upload on PyPI and conda-forge.\n"
  },
  {
    "path": "doc/ensemble.rst",
    "content": ".. _ensemble:\n\n====================\nEnsemble of samplers\n====================\n\n.. currentmodule:: imblearn.ensemble\n\n.. _ensemble_meta_estimators:\n\nClassifier including inner balancing samplers\n=============================================\n\n.. _bagging:\n\nBagging classifier\n------------------\n\nIn ensemble classifiers, bagging methods build several estimators on different\nrandomly selected subset of data. In scikit-learn, this classifier is named\n:class:`~sklearn.ensemble.BaggingClassifier`. However, this classifier does not\nallow each subset of data to be balanced. Therefore, when training on an imbalanced\ndata set, this classifier will favor the majority classes::\n\n  >>> from sklearn.datasets import make_classification\n  >>> X, y = make_classification(n_samples=10000, n_features=2, n_informative=2,\n  ...                            n_redundant=0, n_repeated=0, n_classes=3,\n  ...                            n_clusters_per_class=1,\n  ...                            weights=[0.01, 0.05, 0.94], class_sep=0.8,\n  ...                            random_state=0)\n  >>> from sklearn.model_selection import train_test_split\n  >>> from sklearn.metrics import balanced_accuracy_score\n  >>> from sklearn.ensemble import BaggingClassifier\n  >>> from sklearn.tree import DecisionTreeClassifier\n  >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n  >>> bc = BaggingClassifier(DecisionTreeClassifier(), random_state=0)\n  >>> bc.fit(X_train, y_train) #doctest:\n  BaggingClassifier(...)\n  >>> y_pred = bc.predict(X_test)\n  >>> balanced_accuracy_score(y_test, y_pred)\n  0.77...\n\nIn :class:`BalancedBaggingClassifier`, each bootstrap sample will be further\nresampled to achieve the `sampling_strategy` desired. Therefore,\n:class:`BalancedBaggingClassifier` takes the same parameters as the\nscikit-learn :class:`~sklearn.ensemble.BaggingClassifier`. In addition, the\nsampling is controlled by the parameter `sampler` or the two parameters\n`sampling_strategy` and `replacement`, if one wants to use the\n:class:`~imblearn.under_sampling.RandomUnderSampler`::\n\n  >>> from imblearn.ensemble import BalancedBaggingClassifier\n  >>> bbc = BalancedBaggingClassifier(DecisionTreeClassifier(),\n  ...                                 sampling_strategy='auto',\n  ...                                 replacement=False,\n  ...                                 random_state=0)\n  >>> bbc.fit(X_train, y_train)\n  BalancedBaggingClassifier(...)\n  >>> y_pred = bbc.predict(X_test)\n  >>> balanced_accuracy_score(y_test, y_pred)\n  0.8...\n\nChanging the `sampler` will give rise to different known implementations\n:cite:`maclin1997empirical`, :cite:`hido2009roughly`,\n:cite:`wang2009diversity`. You can refer to the following example which shows these\ndifferent methods in practice:\n:ref:`sphx_glr_auto_examples_ensemble_plot_bagging_classifier.py`\n\n.. _forest:\n\nForest of randomized trees\n--------------------------\n\n:class:`BalancedRandomForestClassifier` is another ensemble method in which\neach tree of the forest will be provided a balanced bootstrap sample\n:cite:`chen2004using`. This class provides all functionality of the\n:class:`~sklearn.ensemble.RandomForestClassifier`::\n\n  >>> from imblearn.ensemble import BalancedRandomForestClassifier\n  >>> brf = BalancedRandomForestClassifier(\n  ...     n_estimators=100, random_state=0, sampling_strategy=\"all\", replacement=True,\n  ...     bootstrap=False,\n  ... )\n  >>> brf.fit(X_train, y_train)\n  BalancedRandomForestClassifier(...)\n  >>> y_pred = brf.predict(X_test)\n  >>> balanced_accuracy_score(y_test, y_pred)\n  0.8...\n\n.. _boosting:\n\nBoosting\n--------\n\nSeveral methods taking advantage of boosting have been designed.\n\n:class:`RUSBoostClassifier` randomly under-samples the dataset before performing\na boosting iteration :cite:`seiffert2009rusboost`::\n\n  >>> from imblearn.ensemble import RUSBoostClassifier\n  >>> rusboost = RUSBoostClassifier(n_estimators=200, random_state=0)\n  >>> rusboost.fit(X_train, y_train)\n  RUSBoostClassifier(...)\n  >>> y_pred = rusboost.predict(X_test)\n  >>> balanced_accuracy_score(y_test, y_pred)\n  0...\n\nA specific method which uses :class:`~sklearn.ensemble.AdaBoostClassifier` as\nlearners in the bagging classifier is called \"EasyEnsemble\". The\n:class:`EasyEnsembleClassifier` allows bagging AdaBoost learners which are\ntrained on balanced bootstrap samples :cite:`liu2008exploratory`. Similarly to\nthe :class:`BalancedBaggingClassifier` API, one can construct the ensemble as::\n\n  >>> from imblearn.ensemble import EasyEnsembleClassifier\n  >>> eec = EasyEnsembleClassifier(random_state=0)\n  >>> eec.fit(X_train, y_train)\n  EasyEnsembleClassifier(...)\n  >>> y_pred = eec.predict(X_test)\n  >>> balanced_accuracy_score(y_test, y_pred)\n  0.6...\n\n.. topic:: Examples\n\n  * :ref:`sphx_glr_auto_examples_ensemble_plot_comparison_ensemble_classifier.py`\n"
  },
  {
    "path": "doc/index.rst",
    "content": ".. project-template documentation master file, created by\n   sphinx-quickstart on Mon Jan 18 14:44:12 2016.\n   You can adapt this file completely to your liking, but it should at least\n   contain the root `toctree` directive.\n\n:notoc:\n\n##############################\nimbalanced-learn documentation\n##############################\n\n**Date**: |today| **Version**: |version|\n\n**Useful links**:\n`Binary Installers <https://pypi.org/project/imbalanced-learn>`__ |\n`Source Repository <https://github.com/scikit-learn-contrib/imbalanced-learn>`__ |\n`Issues & Ideas <https://github.com/scikit-learn-contrib/imbalanced-learn/issues>`__ |\n`Q&A Support <https://gitter.im/scikit-learn-contrib/imbalanced-learn>`__\n\nImbalanced-learn (imported as :mod:`imblearn`) is an open source, MIT-licensed\nlibrary relying on scikit-learn (imported as :mod:`sklearn`) and provides tools\nwhen dealing with classification with imbalanced classes.\n\n.. grid:: 1 2 2 2\n    :gutter: 4\n    :padding: 2 2 0 0\n    :class-container: sd-text-center\n\n    .. grid-item-card:: Getting started\n        :img-top: _static/index_getting_started.svg\n        :class-card: intro-card\n        :shadow: md\n\n        Check out the getting started guides to install `imbalanced-learn`.\n        Some extra information to get started with a new contribution is also provided.\n\n        +++\n\n        .. button-ref:: getting_started\n            :ref-type: ref\n            :click-parent:\n            :color: secondary\n            :expand:\n\n            To the installation guideline\n\n    .. grid-item-card::  User guide\n        :img-top: _static/index_user_guide.svg\n        :class-card: intro-card\n        :shadow: md\n\n        The user guide provides in-depth information on the key concepts of\n        `imbalanced-learn` with useful background information and explanation.\n\n        +++\n\n        .. button-ref:: user_guide\n            :ref-type: ref\n            :click-parent:\n            :color: secondary\n            :expand:\n\n            To the user guide\n\n    .. grid-item-card::  API reference\n        :img-top: _static/index_api.svg\n        :class-card: intro-card\n        :shadow: md\n\n        The reference guide contains a detailed description of\n        the `imbalanced-learn` API. To known more about methods parameters.\n\n        +++\n\n        .. button-ref:: api\n            :ref-type: ref\n            :click-parent:\n            :color: secondary\n            :expand:\n\n            To the reference guide\n\n    .. grid-item-card::  Examples\n        :img-top: _static/index_examples.svg\n        :class-card: intro-card\n        :shadow: md\n\n        The gallery of examples is a good place to see `imbalanced-learn` in action.\n        Select an example and dive in.\n\n        +++\n\n        .. button-ref:: general_examples\n            :ref-type: ref\n            :click-parent:\n            :color: secondary\n            :expand:\n\n            To the gallery of examples\n\n\n.. toctree::\n    :maxdepth: 3\n    :hidden:\n    :titlesonly:\n\n    install\n    user_guide\n    references/index\n    auto_examples/index\n    whats_new\n    about\n"
  },
  {
    "path": "doc/install.rst",
    "content": ".. _getting_started:\n\n###############\nGetting Started\n###############\n\nPrerequisites\n=============\n\n.. |PythonMinVersion| replace:: 3.10\n.. |NumPyMinVersion| replace:: 1.25.2\n.. |SciPyMinVersion| replace:: 1.11.4\n.. |ScikitLearnMinVersion| replace:: 1.4.2\n.. |MatplotlibMinVersion| replace:: 3.7.3\n.. |PandasMinVersion| replace:: 2.0.3\n.. |TensorflowMinVersion| replace:: 2.16.1\n.. |KerasMinVersion| replace:: 3.3.3\n.. |SeabornMinVersion| replace:: 0.12.2\n.. |PytestMinVersion| replace:: 7.2.2\n\n`imbalanced-learn` requires the following dependencies:\n\n- Python (>= |PythonMinVersion|)\n- NumPy (>= |NumPyMinVersion|)\n- SciPy (>= |SciPyMinVersion|)\n- Scikit-learn (>= |ScikitLearnMinVersion|)\n- Pytest (>= |PytestMinVersion|)\n\nAdditionally, `imbalanced-learn` requires the following optional dependencies:\n\n- Pandas (>= |PandasMinVersion|) for dealing with dataframes\n- Tensorflow (>= |TensorflowMinVersion|) for dealing with TensorFlow models\n- Keras (>= |KerasMinVersion|) for dealing with Keras models\n\nThe examples will requires the following additional dependencies:\n\n- Matplotlib (>= |MatplotlibMinVersion|)\n- Seaborn (>= |SeabornMinVersion|)\n\nInstall\n=======\n\nFrom PyPi or conda-forge repositories\n-------------------------------------\n\nimbalanced-learn is currently available on the PyPi's repositories and you can\ninstall it via `pip`::\n\n  pip install imbalanced-learn\n\nThe package is released also on the conda-forge repositories and you can install\nit with `conda` (or `mamba`)::\n\n  conda install -c conda-forge imbalanced-learn\n\nIntel optimizations via scikit-learn-intelex\n--------------------------------------------\n\nImbalanced-learn relies entirely on scikit-learn algorithms. Intel provides an\noptimized version of scikit-learn for Intel hardwares, called scikit-learn-intelex.\nInstalling scikit-learn-intelex and patching scikit-learn will activate the\nIntel optimizations.\n\nYou can refer to the following\n`blog post <https://medium.com/intel-analytics-software/why-pay-more-for-machine-learning-893683bd78e4>`_\nfor some benchmarks.\n\nRefer to the following documentation for instructions:\n\n- `Installation guide <https://intel.github.io/scikit-learn-intelex/installation.html>`_.\n- `Patching guide <https://intel.github.io/scikit-learn-intelex/what-is-patching.html>`_.\n\nFrom source available on GitHub\n-------------------------------\n\nIf you prefer, you can clone it and run the setup.py file. Use the following\ncommands to get a copy from Github and install all dependencies::\n\n  git clone https://github.com/scikit-learn-contrib/imbalanced-learn.git\n  cd imbalanced-learn\n  pip install .\n\nBe aware that you can install in developer mode with::\n\n  pip install --no-build-isolation --editable .\n\nIf you wish to make pull-requests on GitHub, we advise you to install\npre-commit::\n\n  pip install pre-commit\n  pre-commit install\n\nTest and coverage\n=================\n\nYou want to test the code before to install::\n\n  $ make test\n\nYou wish to test the coverage of your version::\n\n  $ make coverage\n\nYou can also use `pytest`::\n\n  $ pytest imblearn -v\n\nContribute\n==========\n\nYou can contribute to this code through Pull Request on GitHub_. Please, make\nsure that your code is coming with unit tests to ensure full coverage and\ncontinuous integration in the API.\n\n.. _GitHub: https://github.com/scikit-learn-contrib/imbalanced-learn/pulls\n"
  },
  {
    "path": "doc/introduction.rst",
    "content": ".. _introduction:\n\n============\nIntroduction\n============\n\n.. _api_imblearn:\n\nAPI's of imbalanced-learn samplers\n----------------------------------\n\nThe available samplers follow the\n`scikit-learn API <https://scikit-learn.org/stable/getting_started.html#fitting-and-predicting-estimator-basics>`_\nusing the base estimator\nand incorporating a sampling functionality via the ``sample`` method:\n\n:Estimator:\n\n    The base object, implements a ``fit`` method to learn from data::\n\n      estimator = obj.fit(data, targets)\n\n:Resampler:\n\n    To resample a data sets, each sampler implements a ``fit_resample`` method::\n\n      data_resampled, targets_resampled = obj.fit_resample(data, targets)\n\nImbalanced-learn samplers accept the same inputs as scikit-learn estimators:\n\n* `data`, 2-dimensional array-like structures, such as:\n   * Python's list of lists :class:`list`,\n   * Numpy arrays :class:`numpy.ndarray`,\n   * Panda dataframes :class:`pandas.DataFrame`,\n   * Scipy sparse matrices :class:`scipy.sparse.csr_matrix` or :class:`scipy.sparse.csc_matrix`;\n\n* `targets`, 1-dimensional array-like structures, such as:\n   * Numpy arrays :class:`numpy.ndarray`,\n   * Pandas series :class:`pandas.Series`.\n\nThe output will be of the following type:\n\n* `data_resampled`, 2-dimensional aray-like structures, such as:\n   * Numpy arrays :class:`numpy.ndarray`,\n   * Pandas dataframes :class:`pandas.DataFrame`,\n   * Scipy sparse matrices :class:`scipy.sparse.csr_matrix` or :class:`scipy.sparse.csc_matrix`;\n\n* `targets_resampled`, 1-dimensional array-like structures, such as:\n   * Numpy arrays :class:`numpy.ndarray`,\n   * Pandas series :class:`pandas.Series`.\n\n.. topic:: Pandas in/out\n\n   Unlike scikit-learn, imbalanced-learn provides support for pandas in/out.\n   Therefore providing a dataframe, will output as well a dataframe.\n\n.. topic:: Sparse input\n\n   For sparse input the data is **converted to the Compressed Sparse Rows\n   representation** (see ``scipy.sparse.csr_matrix``) before being fed to the\n   sampler. To avoid unnecessary memory copies, it is recommended to choose the\n   CSR representation upstream.\n\n.. _problem_statement:\n\nProblem statement regarding imbalanced data sets\n------------------------------------------------\n\nThe learning and prediction phrases of machine learning algorithms\ncan be impacted by the issue of **imbalanced datasets**. This imbalance\nrefers to the difference in the number of samples across different classes.\nWe demonstrate the effect of training a `Logistic Regression classifier\n<https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html>`_\nwith varying levels of class balancing by adjusting their weights.\n\n.. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_001.png\n   :target: ./auto_examples/over-sampling/plot_comparison_over_sampling.html\n   :scale: 60\n   :align: center\n\nAs expected, the decision function of the Logistic Regression classifier varies significantly\ndepending on how imbalanced the data is. With a greater imbalance ratio, the decision function\ntends to favour the class with the larger number of samples, usually referred to as the\n**majority class**.\n"
  },
  {
    "path": "doc/make.bat",
    "content": "@ECHO OFF\r\n\r\nREM Command file for Sphinx documentation\r\n\r\nif \"%SPHINXBUILD%\" == \"\" (\r\n\tset SPHINXBUILD=sphinx-build\r\n)\r\nset BUILDDIR=_build\r\nset ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% .\r\nset I18NSPHINXOPTS=%SPHINXOPTS% .\r\nif NOT \"%PAPER%\" == \"\" (\r\n\tset ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS%\r\n\tset I18NSPHINXOPTS=-D latex_paper_size=%PAPER% %I18NSPHINXOPTS%\r\n)\r\n\r\nif \"%1\" == \"\" goto help\r\n\r\nif \"%1\" == \"help\" (\r\n\t:help\r\n\techo.Please use `make ^<target^>` where ^<target^> is one of\r\n\techo.  html       to make standalone HTML files\r\n\techo.  dirhtml    to make HTML files named index.html in directories\r\n\techo.  singlehtml to make a single large HTML file\r\n\techo.  pickle     to make pickle files\r\n\techo.  json       to make JSON files\r\n\techo.  htmlhelp   to make HTML files and a HTML help project\r\n\techo.  qthelp     to make HTML files and a qthelp project\r\n\techo.  devhelp    to make HTML files and a Devhelp project\r\n\techo.  epub       to make an epub\r\n\techo.  latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter\r\n\techo.  text       to make text files\r\n\techo.  man        to make manual pages\r\n\techo.  texinfo    to make Texinfo files\r\n\techo.  gettext    to make PO message catalogs\r\n\techo.  changes    to make an overview over all changed/added/deprecated items\r\n\techo.  xml        to make Docutils-native XML files\r\n\techo.  pseudoxml  to make pseudoxml-XML files for display purposes\r\n\techo.  linkcheck  to check all external links for integrity\r\n\techo.  doctest    to run all doctests embedded in the documentation if enabled\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"clean\" (\r\n\tfor /d %%i in (%BUILDDIR%\\*) do rmdir /q /s %%i\r\n\tdel /q /s %BUILDDIR%\\*\r\n\tgoto end\r\n)\r\n\r\n\r\n%SPHINXBUILD% 2> nul\r\nif errorlevel 9009 (\r\n\techo.\r\n\techo.The 'sphinx-build' command was not found. Make sure you have Sphinx\r\n\techo.installed, then set the SPHINXBUILD environment variable to point\r\n\techo.to the full path of the 'sphinx-build' executable. Alternatively you\r\n\techo.may add the Sphinx directory to PATH.\r\n\techo.\r\n\techo.If you don't have Sphinx installed, grab it from\r\n\techo.http://sphinx-doc.org/\r\n\texit /b 1\r\n)\r\n\r\nif \"%1\" == \"html\" (\r\n\t%SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The HTML pages are in %BUILDDIR%/html.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"dirhtml\" (\r\n\t%SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"singlehtml\" (\r\n\t%SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"pickle\" (\r\n\t%SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished; now you can process the pickle files.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"json\" (\r\n\t%SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished; now you can process the JSON files.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"htmlhelp\" (\r\n\t%SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished; now you can run HTML Help Workshop with the ^\r\n.hhp project file in %BUILDDIR%/htmlhelp.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"qthelp\" (\r\n\t%SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished; now you can run \"qcollectiongenerator\" with the ^\r\n.qhcp project file in %BUILDDIR%/qthelp, like this:\r\n\techo.^> qcollectiongenerator %BUILDDIR%\\qthelp\\imbalanced-learn.qhcp\r\n\techo.To view the help file:\r\n\techo.^> assistant -collectionFile %BUILDDIR%\\qthelp\\imbalanced-learn.ghc\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"devhelp\" (\r\n\t%SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"epub\" (\r\n\t%SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The epub file is in %BUILDDIR%/epub.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"latex\" (\r\n\t%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished; the LaTeX files are in %BUILDDIR%/latex.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"latexpdf\" (\r\n\t%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex\r\n\tcd %BUILDDIR%/latex\r\n\tmake all-pdf\r\n\tcd %BUILDDIR%/..\r\n\techo.\r\n\techo.Build finished; the PDF files are in %BUILDDIR%/latex.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"latexpdfja\" (\r\n\t%SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex\r\n\tcd %BUILDDIR%/latex\r\n\tmake all-pdf-ja\r\n\tcd %BUILDDIR%/..\r\n\techo.\r\n\techo.Build finished; the PDF files are in %BUILDDIR%/latex.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"text\" (\r\n\t%SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The text files are in %BUILDDIR%/text.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"man\" (\r\n\t%SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The manual pages are in %BUILDDIR%/man.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"texinfo\" (\r\n\t%SPHINXBUILD% -b texinfo %ALLSPHINXOPTS% %BUILDDIR%/texinfo\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The Texinfo files are in %BUILDDIR%/texinfo.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"gettext\" (\r\n\t%SPHINXBUILD% -b gettext %I18NSPHINXOPTS% %BUILDDIR%/locale\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The message catalogs are in %BUILDDIR%/locale.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"changes\" (\r\n\t%SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.The overview file is in %BUILDDIR%/changes.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"linkcheck\" (\r\n\t%SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Link check complete; look for any errors in the above output ^\r\nor in %BUILDDIR%/linkcheck/output.txt.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"doctest\" (\r\n\t%SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Testing of doctests in the sources finished, look at the ^\r\nresults in %BUILDDIR%/doctest/output.txt.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"xml\" (\r\n\t%SPHINXBUILD% -b xml %ALLSPHINXOPTS% %BUILDDIR%/xml\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The XML files are in %BUILDDIR%/xml.\r\n\tgoto end\r\n)\r\n\r\nif \"%1\" == \"pseudoxml\" (\r\n\t%SPHINXBUILD% -b pseudoxml %ALLSPHINXOPTS% %BUILDDIR%/pseudoxml\r\n\tif errorlevel 1 exit /b 1\r\n\techo.\r\n\techo.Build finished. The pseudo-XML files are in %BUILDDIR%/pseudoxml.\r\n\tgoto end\r\n)\r\n\r\n:end\r\n"
  },
  {
    "path": "doc/metrics.rst",
    "content": ".. _metrics:\n\n=======\nMetrics\n=======\n\n.. currentmodule:: imblearn.metrics\n\nClassification metrics\n----------------------\n\nCurrently, scikit-learn only offers the\n``sklearn.metrics.balanced_accuracy_score`` (in 0.20) as metric to deal with\nimbalanced datasets. The module :mod:`imblearn.metrics` offers a couple of\nother metrics which are used in the literature to evaluate the quality of\nclassifiers.\n\n.. _sensitivity_specificity:\n\nSensitivity and specificity metrics\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nSensitivity and specificity are metrics which are well known in medical\nimaging. Sensitivity (also called true positive rate or recall) is the\nproportion of the positive samples which is well classified while specificity\n(also called true negative rate) is the proportion of the negative samples\nwhich are well classified. Therefore, depending of the field of application,\neither the sensitivity/specificity or the precision/recall pair of metrics are\nused.\n\nCurrently, only the `precision and recall metrics\n<http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html>`_\nare implemented in scikit-learn. :func:`sensitivity_specificity_support`,\n:func:`sensitivity_score`, and :func:`specificity_score` add the possibility to\nuse those metrics.\n\n.. _imbalanced_metrics:\n\nAdditional metrics specific to imbalanced datasets\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nThe :func:`geometric_mean_score`\n:cite:`barandela2003strategies,kubat1997addressing` is the root of the product\nof class-wise sensitivity. This measure tries to maximize the accuracy on each\nof the classes while keeping these accuracies balanced.\n\nThe :func:`make_index_balanced_accuracy` :cite:`garcia2012effectiveness` can\nwrap any metric and give more importance to a specific class using the\nparameter ``alpha``.\n\n.. _macro_averaged_mean_absolute_error:\n\nMacro-Averaged Mean Absolute Error (MA-MAE)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nOrdinal classification is used when there is a rank among classes, for example\nlevels of functionality or movie ratings.\n\nThe :func:`macro_averaged_mean_absolute_error` :cite:`esuli2009ordinal` is used\nfor imbalanced ordinal classification. The mean absolute error is computed for\neach class and averaged over classes, giving an equal weight to each class.\n\n.. _classification_report:\n\nSummary of important metrics\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nThe :func:`classification_report_imbalanced` will compute a set of metrics per\nclass and summarize it in a table. The parameter `output_dict` allows to get a\nstring or a Python dictionary. This dictionary can be reused to create a Pandas\ndataframe for instance.\n\nThe bottom row (i.e \"avg/total\") contains the weighted average by the support\n(i.e column \"sup\") of each column.\n\nNote that the weighted average of the class recalls is also known as the\nclassification accuracy.\n\n.. _pairwise_metrics:\n\nPairwise metrics\n----------------\n\nThe :mod:`imblearn.metrics.pairwise` submodule implements pairwise distances\nthat are available in scikit-learn while used in some of the methods in\nimbalanced-learn.\n\n.. _vdm:\n\nValue Difference Metric\n~~~~~~~~~~~~~~~~~~~~~~~\n\nThe class :class:`~imblearn.metrics.pairwise.ValueDifferenceMetric` is\nimplementing the Value Difference Metric proposed in\n:cite:`stanfill1986toward`. This measure is used to compute the proximity\nof two samples composed of only categorical values.\n\nGiven a single feature, categories with similar correlation with the target\nvector will be considered closer. Let's give an example to illustrate this\nbehaviour as given in :cite:`wilson1997improved`. `X` will be represented by a\nsingle feature which will be some color and the target will be if a sample is\nwhether or not an apple::\n\n    >>> import numpy as np\n    >>> X = np.array([\"green\"] * 10 + [\"red\"] * 10 + [\"blue\"] * 10).reshape(-1, 1)\n    >>> y = [\"apple\"] * 8 + [\"not apple\"] * 5 + [\"apple\"] * 7 + [\"not apple\"] * 9 + [\"apple\"]\n\nIn this dataset, the categories \"red\" and \"green\" are more correlated to the\ntarget `y` and should have a smaller distance than with the category \"blue\".\nWe should this behaviour. Be aware that we need to encode the `X` to work with\nnumerical values::\n\n    >>> from sklearn.preprocessing import OrdinalEncoder\n    >>> encoder = OrdinalEncoder(dtype=np.int32)\n    >>> X_encoded = encoder.fit_transform(X)\n\nNow, we can compute the distance between three different samples representing\nthe different categories::\n\n    >>> from imblearn.metrics.pairwise import ValueDifferenceMetric\n    >>> vdm = ValueDifferenceMetric().fit(X_encoded, y)\n    >>> X_test = np.array([\"green\", \"red\", \"blue\"]).reshape(-1, 1)\n    >>> X_test_encoded = encoder.transform(X_test)\n    >>> vdm.pairwise(X_test_encoded)\n    array([[0.  ,  0.04,  1.96],\n           [0.04,  0.  ,  1.44],\n           [1.96,  1.44,  0.  ]])\n\nWe see that the minimum distance happen when the categories \"red\" and \"green\"\nare compared. Whenever comparing with \"blue\", the distance is much larger.\n\n**Mathematical formulation**\n\nThe distance between feature values of two samples is defined as:\n\n.. math::\n    \\delta(x, y) = \\sum_{c=1}^{C} |p(c|x_{f}) - p(c|y_{f})|^{k} \\ ,\n\nwhere :math:`x` and :math:`y` are two samples and :math:`f` a given\nfeature, :math:`C` is the number of classes, :math:`p(c|x_{f})` is the\nconditional probability that the output class is :math:`c` given that\nthe feature value :math:`f` has the value :math:`x` and :math:`k` an\nexponent usually defined to 1 or 2.\n\nThe distance for the feature vectors :math:`X` and :math:`Y` is\nsubsequently defined as:\n\n.. math::\n    \\Delta(X, Y) = \\sum_{f=1}^{F} \\delta(X_{f}, Y_{f})^{r} \\ ,\n\nwhere :math:`F` is the number of feature and :math:`r` an exponent usually\ndefined equal to 1 or 2.\n"
  },
  {
    "path": "doc/miscellaneous.rst",
    "content": ".. _miscellaneous:\n\n======================\nMiscellaneous samplers\n======================\n\n.. currentmodule:: imblearn\n\n.. _function_sampler:\n\nCustom samplers\n---------------\n\nA fully customized sampler, :class:`FunctionSampler`, is available in\nimbalanced-learn such that you can fast prototype your own sampler by defining\na single function. Additional parameters can be added using the attribute\n``kw_args`` which accepts a dictionary. The following example illustrates how\nto retain the 10 first elements of the array ``X`` and ``y``::\n\n  >>> import numpy as np\n  >>> from imblearn import FunctionSampler\n  >>> from sklearn.datasets import make_classification\n  >>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,\n  ...                            n_redundant=0, n_repeated=0, n_classes=3,\n  ...                            n_clusters_per_class=1,\n  ...                            weights=[0.01, 0.05, 0.94],\n  ...                            class_sep=0.8, random_state=0)\n  >>> def func(X, y):\n  ...   return X[:10], y[:10]\n  >>> sampler = FunctionSampler(func=func)\n  >>> X_res, y_res = sampler.fit_resample(X, y)\n  >>> np.all(X_res == X[:10])\n  True\n  >>> np.all(y_res == y[:10])\n  True\n\nIn addition, the parameter ``validate`` controls input checking. For instance,\nturning ``validate=False`` allows to pass any type of target ``y`` and do some\nsampling for regression targets::\n\n  >>> from sklearn.datasets import make_regression\n  >>> X_reg, y_reg = make_regression(n_samples=100, random_state=42)\n  >>> rng = np.random.RandomState(42)\n  >>> def dummy_sampler(X, y):\n  ...     indices = rng.choice(np.arange(X.shape[0]), size=10)\n  ...     return X[indices], y[indices]\n  >>> sampler = FunctionSampler(func=dummy_sampler, validate=False)\n  >>> X_res, y_res = sampler.fit_resample(X_reg, y_reg)\n  >>> y_res\n  array([  41.49112498, -142.78526195,   85.55095317,  141.43321419,\n           75.46571114,  -67.49177372,  159.72700509, -169.80498923,\n          211.95889757,  211.95889757])\n\nWe illustrated the use of such sampler to implement an outlier rejection\nestimator which can be easily used within a\n:class:`~imblearn.pipeline.Pipeline`:\n:ref:`sphx_glr_auto_examples_applications_plot_outlier_rejections.py`\n\n.. _generators:\n\nCustom generators\n-----------------\n\nImbalanced-learn provides specific generators for TensorFlow and Keras which\nwill generate balanced mini-batches.\n\n.. _tensorflow_generator:\n\nTensorFlow generator\n~~~~~~~~~~~~~~~~~~~~\n\nThe :func:`~imblearn.tensorflow.balanced_batch_generator` allows to generate\nbalanced mini-batches using an imbalanced-learn sampler which returns indices.\n\nLet's first generate some data::\n\n  >>> n_features, n_classes = 10, 2\n  >>> X, y = make_classification(\n  ...     n_samples=10_000, n_features=n_features, n_informative=2,\n  ...     n_redundant=0, n_repeated=0, n_classes=n_classes,\n  ...     n_clusters_per_class=1, weights=[0.1, 0.9],\n  ...     class_sep=0.8, random_state=0\n  ... )\n  >>> X = X.astype(np.float32)\n\nThen, we can create the generator that will yield mini-batches that will be\nbalanced::\n\n  >>> from imblearn.under_sampling import RandomUnderSampler\n  >>> from imblearn.tensorflow import balanced_batch_generator\n  >>> training_generator, steps_per_epoch = balanced_batch_generator(\n  ...     X,\n  ...     y,\n  ...     sample_weight=None,\n  ...     sampler=RandomUnderSampler(),\n  ...     batch_size=32,\n  ...     random_state=42,\n  ... )\n\nThe ``generator`` and ``steps_per_epoch`` are used during the training of a\nTensorflow model. We will illustrate how to use this generator. First, we can\ndefine a logistic regression model which will be optimized by a gradient\ndescent::\n\n  >>> import tensorflow as tf\n  >>> # initialize the weights and intercept\n  >>> normal_initializer = tf.random_normal_initializer(mean=0, stddev=0.01)\n  >>> coef = tf.Variable(normal_initializer(\n  ...     shape=[n_features, n_classes]), dtype=\"float32\"\n  ... )\n  >>> intercept = tf.Variable(\n  ...     normal_initializer(shape=[n_classes]), dtype=\"float32\"\n  ... )\n  >>> # define the model\n  >>> def logistic_regression(X):\n  ...     return tf.nn.softmax(tf.matmul(X, coef) + intercept)\n  >>> # define the loss function\n  >>> def cross_entropy(y_true, y_pred):\n  ...     y_true = tf.one_hot(y_true, depth=n_classes)\n  ...     y_pred = tf.clip_by_value(y_pred, 1e-9, 1.)\n  ...     return tf.reduce_mean(-tf.reduce_sum(y_true * tf.math.log(y_pred)))\n  >>> # define our metric\n  >>> def balanced_accuracy(y_true, y_pred):\n  ...     cm = tf.math.confusion_matrix(tf.cast(y_true, tf.int64), tf.argmax(y_pred, 1))\n  ...     per_class = np.diag(cm) / tf.math.reduce_sum(cm, axis=1)\n  ...     return np.mean(per_class)\n  >>> # define the optimizer\n  >>> optimizer = tf.optimizers.SGD(learning_rate=0.01)\n  >>> # define the optimization step\n  >>> def run_optimization(X, y):\n  ...     with tf.GradientTape() as g:\n  ...         y_pred = logistic_regression(X)\n  ...         loss = cross_entropy(y, y_pred)\n  ...     gradients = g.gradient(loss, [coef, intercept])\n  ...     optimizer.apply_gradients(zip(gradients, [coef, intercept]))\n\nOnce initialized, the model is trained by iterating on balanced mini-batches of\ndata and minimizing the loss previously defined::\n\n  >>> epochs = 10\n  >>> for e in range(epochs):\n  ...     y_pred = logistic_regression(X)\n  ...     loss = cross_entropy(y, y_pred)\n  ...     bal_acc = balanced_accuracy(y, y_pred)\n  ...     print(f\"epoch: {e}, loss: {loss:.3f}, accuracy: {bal_acc}\")\n  ...     for i in range(steps_per_epoch):\n  ...         X_batch, y_batch = next(training_generator)\n  ...         run_optimization(X_batch, y_batch)\n  epoch: 0, ...\n\n.. _keras_generator:\n\nKeras generator\n~~~~~~~~~~~~~~~\n\nKeras provides an higher level API in which a model can be defined and train by\ncalling ``fit_generator`` method to train the model. To illustrate, we will\ndefine a logistic regression model::\n\n  >>> from tensorflow import keras\n  >>> y = keras.utils.to_categorical(y, 3)\n  >>> model = keras.Sequential()\n  >>> model.add(\n  ...     keras.layers.Dense(\n  ...         y.shape[1], input_dim=X.shape[1], activation='softmax'\n  ...     )\n  ... )\n  >>> model.compile(\n  ...     optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy']\n  ... )\n\n:func:`~imblearn.keras.balanced_batch_generator` creates a balanced\nmini-batches generator with the associated number of mini-batches which will be\ngenerated::\n\n  >>> from imblearn.keras import balanced_batch_generator\n  >>> training_generator, steps_per_epoch = balanced_batch_generator(\n  ...     X, y, sampler=RandomUnderSampler(), batch_size=10, random_state=42\n  ... )\n\nThen, ``fit`` can be called passing the generator and the step::\n\n  >>> callback_history = model.fit(\n  ...     training_generator,\n  ...     steps_per_epoch=steps_per_epoch,\n  ...     epochs=10,\n  ...     verbose=1,\n  ... )\n  Epoch 1/10 ...\n\nThe second possibility is to use\n:class:`~imblearn.keras.BalancedBatchGenerator`. Only an instance of this class\nwill be passed to ``fit``::\n\n  >>> from imblearn.keras import BalancedBatchGenerator\n  >>> training_generator = BalancedBatchGenerator(\n  ...     X, y, sampler=RandomUnderSampler(), batch_size=10, random_state=42\n  ... )\n  >>> callback_history = model.fit(\n  ...     training_generator,\n  ...     steps_per_epoch=steps_per_epoch,\n  ...     epochs=10,\n  ...     verbose=1,\n  ... )\n  Epoch 1/10 ...\n\n.. topic:: References\n\n  * :ref:`sphx_glr_auto_examples_applications_porto_seguro_keras_under_sampling.py`\n"
  },
  {
    "path": "doc/model_selection.rst",
    "content": ".. _cross_validation:\n\n================\nCross validation\n================\n\n.. currentmodule:: imblearn.model_selection\n\n\n.. _instance_hardness_threshold_cv:\n\nThe term instance hardness is used in literature to express the difficulty to correctly\nclassify an instance. An instance for which the predicted probability of the true class\nis low, has large instance hardness. The way these hard-to-classify instances are\ndistributed over train and test sets in cross validation, has significant effect on the\ntest set performance metrics. The :class:`~imblearn.model_selection.InstanceHardnessCV`\nsplitter distributes samples with large instance hardness equally over the folds,\nresulting in more robust cross validation.\n\nWe will discuss instance hardness in this document and explain how to use the\n:class:`~imblearn.model_selection.InstanceHardnessCV` splitter.\n\nInstance hardness and average precision\n=======================================\n\nInstance hardness is defined as 1 minus the probability of the most probable class:\n\n.. math::\n\n   H(x) = 1 - P(\\hat{y}|x)\n\nIn this equation :math:`H(x)` is the instance hardness for a sample with features\n:math:`x` and :math:`P(\\hat{y}|x)` the probability of predicted label :math:`\\hat{y}`\ngiven the features. If the model predicts label 0 and gives a `predict_proba` output\nof [0.9, 0.1], the probability of the most probable class (0) is 0.9 and the\ninstance hardness is `1-0.9=0.1`.\n\nSamples with large instance hardness have significant effect on the area under\nprecision-recall curve, or average precision. Especially samples with label 0\nwith large instance hardness (so the model predicts label 1) reduce the average\nprecision a lot as these points affect the precision-recall curve in the left\nwhere the area is largest; the precision is lowered in the range of low recall\nand high thresholds. When doing cross validation, e.g. in case of hyperparameter\ntuning or recursive feature elimination, random gathering of these points in\nsome folds introduce variance in CV results that deteriorates robustness of the\ncross validation task. The :class:`~imblearn.model_selection.InstanceHardnessCV`\nsplitter aims to distribute the samples with large instance hardness over the\nfolds in order to reduce undesired variance. Note that one should use this\nsplitter to make model *selection* tasks robust like hyperparameter tuning and\nfeature selection but not for model *performance estimation* for which you also\nwant to know the variance of performance to be expected in production.\n\n\nCreate imbalanced dataset with samples with large instance hardness\n===================================================================\n\nLet's start by creating a dataset to work with. We create a dataset with 5% class\nimbalance using scikit-learn's :func:`~sklearn.datasets.make_blobs` function.\n\n  >>> import numpy as np\n  >>> from matplotlib import pyplot as plt\n  >>> from sklearn.datasets import make_blobs\n  >>> from imblearn.datasets import make_imbalance\n  >>> random_state = 10\n  >>> X, y = make_blobs(n_samples=[950, 50], centers=((-3, 0), (3, 0)),\n  ...                   random_state=random_state)\n  >>> plt.scatter(X[:, 0], X[:, 1], c=y)\n  >>> plt.show()\n\n.. image:: ./auto_examples/model_selection/images/sphx_glr_plot_instance_hardness_cv_001.png\n   :target: ./auto_examples/model_selection/plot_instance_hardness_cv.html\n   :align: center\n\nNow we add some samples with large instance hardness\n\n  >>> X_hard, y_hard = make_blobs(n_samples=10, centers=((3, 0), (-3, 0)),\n  ...                             cluster_std=1,\n  ...                             random_state=random_state)\n  >>> X = np.vstack((X, X_hard))\n  >>> y = np.hstack((y, y_hard))\n  >>> plt.scatter(X[:, 0], X[:, 1], c=y)\n  >>> plt.show()\n\n.. image:: ./auto_examples/model_selection/images/sphx_glr_plot_instance_hardness_cv_002.png\n   :target: ./auto_examples/model_selection/plot_instance_hardness_cv.html\n   :align: center\n\nAssess cross validation performance variance using `InstanceHardnessCV` splitter\n================================================================================\n\nThen we take a :class:`~sklearn.linear_model.LogisticRegression` and assess the\ncross validation performance using a :class:`~sklearn.model_selection.StratifiedKFold`\ncv splitter and the :func:`~sklearn.model_selection.cross_validate` function.\n\n  >>> from sklearn.ensemble import LogisticRegressionClassifier\n  >>> clf = LogisticRegressionClassifier(random_state=random_state)\n  >>> skf_cv = StratifiedKFold(n_splits=5, shuffle=True,\n  ...                           random_state=random_state)\n  >>> skf_result = cross_validate(clf, X, y, cv=skf_cv, scoring=\"average_precision\")\n\nNow, we do the same using an :class:`~imblearn.model_selection.InstanceHardnessCV`\nsplitter. We use provide our classifier to the splitter to calculate instance hardness\nand distribute samples with large instance hardness equally over the folds.\n\n  >>> ih_cv = InstanceHardnessCV(estimator=clf, n_splits=5,\n  ...                               random_state=random_state)\n  >>> ih_result = cross_validate(clf, X, y, cv=ih_cv, scoring=\"average_precision\")\n\nWhen we plot the test scores for both cv splitters, we see that the variance using the\n:class:`~imblearn.model_selection.InstanceHardnessCV` splitter is lower than for the\n:class:`~sklearn.model_selection.StratifiedKFold` splitter.\n\n  >>> plt.boxplot([skf_result['test_score'], ih_result['test_score']],\n  ...               tick_labels=[\"StratifiedKFold\", \"InstanceHardnessCV\"],\n  ...               vert=False)\n  >>> plt.xlabel('Average precision')\n  >>> plt.tight_layout()\n\n.. image:: ./auto_examples/model_selection/images/sphx_glr_plot_instance_hardness_cv_003.png\n   :target: ./auto_examples/model_selection/plot_instance_hardness_cv.html\n   :align: center\n\nBe aware that the most important part of cross-validation splitters is to simulate the\nconditions that one will encounter in production. Therefore, if it is likely to get\ndifficult samples in production, one should use a cross-validation splitter that\nemulates this situation. In our case, the\n:class:`~sklearn.model_selection.StratifiedKFold` splitter did not allow to distribute\nthe difficult samples over the folds and thus it was likely a problem for our use case.\n"
  },
  {
    "path": "doc/over_sampling.rst",
    "content": ".. _over-sampling:\n\n=============\nOver-sampling\n=============\n\n.. currentmodule:: imblearn.over_sampling\n\nA practical guide\n=================\n\nYou can refer to\n:ref:`sphx_glr_auto_examples_over-sampling_plot_comparison_over_sampling.py`.\n\n.. _random_over_sampler:\n\nNaive random over-sampling\n--------------------------\n\nOne way to fight this issue is to generate new samples in the classes which are\nunder-represented. The most naive strategy is to generate new samples by\nrandomly sampling with replacement the current available samples. The\n:class:`RandomOverSampler` offers such scheme::\n\n   >>> from sklearn.datasets import make_classification\n   >>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,\n   ...                            n_redundant=0, n_repeated=0, n_classes=3,\n   ...                            n_clusters_per_class=1,\n   ...                            weights=[0.01, 0.05, 0.94],\n   ...                            class_sep=0.8, random_state=0)\n   >>> from imblearn.over_sampling import RandomOverSampler\n   >>> ros = RandomOverSampler(random_state=0)\n   >>> X_resampled, y_resampled = ros.fit_resample(X, y)\n   >>> from collections import Counter\n   >>> print(sorted(Counter(y_resampled).items()))\n   [(0, 4674), (1, 4674), (2, 4674)]\n\nThe augmented data set should be used instead of the original data set to train\na classifier::\n\n  >>> from sklearn.linear_model import LogisticRegression\n  >>> clf = LogisticRegression()\n  >>> clf.fit(X_resampled, y_resampled)\n  LogisticRegression(...)\n\nIn the figure below, we compare the decision functions of a classifier trained\nusing the over-sampled data set and the original data set.\n\n.. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_002.png\n   :target: ./auto_examples/over-sampling/plot_comparison_over_sampling.html\n   :scale: 60\n   :align: center\n\nAs a result, the majority class does not take over the other classes during the\ntraining process. Consequently, all classes are represented by the decision\nfunction.\n\nIn addition, :class:`RandomOverSampler` allows to sample heterogeneous data\n(e.g. containing some strings)::\n\n  >>> import numpy as np\n  >>> X_hetero = np.array([['xxx', 1, 1.0], ['yyy', 2, 2.0], ['zzz', 3, 3.0]],\n  ...                     dtype=object)\n  >>> y_hetero = np.array([0, 0, 1])\n  >>> X_resampled, y_resampled = ros.fit_resample(X_hetero, y_hetero)\n  >>> print(X_resampled)\n  [['xxx' 1 1.0]\n   ['yyy' 2 2.0]\n   ['zzz' 3 3.0]\n   ['zzz' 3 3.0]]\n  >>> print(y_resampled)\n  [0 0 1 1]\n\nIt would also work with pandas dataframe::\n\n  >>> from sklearn.datasets import fetch_openml\n  >>> df_adult, y_adult = fetch_openml(\n  ...     'adult', version=2, as_frame=True, return_X_y=True)\n  >>> df_adult.head()  # doctest: +SKIP\n  >>> df_resampled, y_resampled = ros.fit_resample(df_adult, y_adult)\n  >>> df_resampled.head()  # doctest: +SKIP\n\nIf repeating samples is an issue, the parameter `shrinkage` allows to create a\nsmoothed bootstrap. However, the original data needs to be numerical. The\n`shrinkage` parameter controls the dispersion of the new generated samples. We\nshow an example illustrate that the new samples are not overlapping anymore\nonce using a smoothed bootstrap. This ways of generating smoothed bootstrap is\nalso known a Random Over-Sampling Examples\n(ROSE) :cite:`torelli2014rose`.\n\n.. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_003.png\n   :target: ./auto_examples/over-sampling/plot_comparison_over_sampling.html\n   :scale: 60\n   :align: center\n\n.. _smote_adasyn:\n\nFrom random over-sampling to SMOTE and ADASYN\n---------------------------------------------\n\nApart from the random sampling with replacement, there are two popular methods\nto over-sample minority classes: (i) the Synthetic Minority Oversampling\nTechnique (SMOTE) :cite:`chawla2002smote` and (ii) the Adaptive Synthetic\n(ADASYN) :cite:`he2008adasyn` sampling method. These algorithms can be used in\nthe same manner::\n\n  >>> from imblearn.over_sampling import SMOTE, ADASYN\n  >>> X_resampled, y_resampled = SMOTE().fit_resample(X, y)\n  >>> print(sorted(Counter(y_resampled).items()))\n  [(0, 4674), (1, 4674), (2, 4674)]\n  >>> clf_smote = LogisticRegression().fit(X_resampled, y_resampled)\n  >>> X_resampled, y_resampled = ADASYN().fit_resample(X, y)\n  >>> print(sorted(Counter(y_resampled).items()))\n  [(0, 4673), (1, 4662), (2, 4674)]\n  >>> clf_adasyn = LogisticRegression().fit(X_resampled, y_resampled)\n\nThe figure below illustrates the major difference of the different\nover-sampling methods.\n\n.. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_004.png\n   :target: ./auto_examples/over-sampling/plot_comparison_over_sampling.html\n   :scale: 60\n   :align: center\n\nIll-posed examples\n------------------\n\nWhile the :class:`RandomOverSampler` is over-sampling by duplicating some of\nthe original samples of the minority class, :class:`SMOTE` and :class:`ADASYN`\ngenerate new samples in by interpolation. However, the samples used to\ninterpolate/generate new synthetic samples differ. In fact, :class:`ADASYN`\nfocuses on generating samples next to the original samples which are wrongly\nclassified using a k-Nearest Neighbors classifier while the basic\nimplementation of :class:`SMOTE` will not make any distinction between easy and\nhard samples to be classified using the nearest neighbors rule. Therefore, the\ndecision function found during training will be different among the algorithms.\n\n.. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_005.png\n   :target: ./auto_examples/over-sampling/plot_comparison_over_sampling.html\n   :align: center\n\nThe sampling particularities of these two algorithms can lead to some peculiar\nbehavior as shown below.\n\n.. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_006.png\n   :target: ./auto_examples/over-sampling/plot_comparison_over_sampling.html\n   :scale: 60\n   :align: center\n\nSMOTE variants\n--------------\n\nSMOTE might connect inliers and outliers while ADASYN might focus solely on\noutliers which, in both cases, might lead to a sub-optimal decision\nfunction. In this regard, SMOTE offers three additional options to generate\nsamples. Those methods focus on samples near the border of the optimal\ndecision function and will generate samples in the opposite direction of the\nnearest neighbors class. Those variants are presented in the figure below.\n\n.. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_comparison_over_sampling_007.png\n   :target: ./auto_examples/over-sampling/plot_comparison_over_sampling.html\n   :scale: 60\n   :align: center\n\n\nThe :class:`BorderlineSMOTE` :cite:`han2005borderline`,\n:class:`SVMSMOTE` :cite:`nguyen2009borderline`, and\n:class:`KMeansSMOTE` :cite:`last2017oversampling` offer some variant of the\nSMOTE algorithm::\n\n  >>> from imblearn.over_sampling import BorderlineSMOTE\n  >>> X_resampled, y_resampled = BorderlineSMOTE().fit_resample(X, y)\n  >>> print(sorted(Counter(y_resampled).items()))\n  [(0, 4674), (1, 4674), (2, 4674)]\n\nWhen dealing with mixed data type such as continuous and categorical features,\nnone of the presented methods (apart of the class :class:`RandomOverSampler`)\ncan deal with the categorical features. The :class:`SMOTENC`\n:cite:`chawla2002smote` is an extension of the :class:`SMOTE` algorithm for\nwhich categorical data are treated differently::\n\n  >>> # create a synthetic data set with continuous and categorical features\n  >>> rng = np.random.RandomState(42)\n  >>> n_samples = 50\n  >>> X = np.empty((n_samples, 3), dtype=object)\n  >>> X[:, 0] = rng.choice(['A', 'B', 'C'], size=n_samples).astype(object)\n  >>> X[:, 1] = rng.randn(n_samples)\n  >>> X[:, 2] = rng.randint(3, size=n_samples)\n  >>> y = np.array([0] * 20 + [1] * 30)\n  >>> print(sorted(Counter(y).items()))\n  [(0, 20), (1, 30)]\n\nIn this data set, the first and last features are considered as categorical\nfeatures. One needs to provide this information to :class:`SMOTENC` via the\nparameters ``categorical_features`` either by passing the indices, the feature\nnames when `X` is a pandas DataFrame, a boolean mask marking these features,\nor relying on `dtype` inference if the columns are using the\n:class:`pandas.CategoricalDtype`::\n\n  >>> from imblearn.over_sampling import SMOTENC\n  >>> smote_nc = SMOTENC(categorical_features=[0, 2], random_state=0)\n  >>> X_resampled, y_resampled = smote_nc.fit_resample(X, y)\n  >>> print(sorted(Counter(y_resampled).items()))\n  [(0, 30), (1, 30)]\n  >>> print(X_resampled[-5:])\n  [['A' 0.19... 2]\n   ['B' -0.36... 2]\n   ['B' 0.87... 2]\n   ['B' 0.37... 2]\n   ['B' 0.33... 2]]\n\nTherefore, it can be seen that the samples generated in the first and last\ncolumns are belonging to the same categories originally presented without any\nother extra interpolation.\n\nHowever, :class:`SMOTENC` is only working when data is a mixed of numerical and\ncategorical features. If data are made of only categorical data, one can use\nthe :class:`SMOTEN` variant :cite:`chawla2002smote`. The algorithm changes in\ntwo ways:\n\n* the nearest neighbors search does not rely on the Euclidean distance. Indeed,\n  the value difference metric (VDM) also implemented in the class\n  :class:`~imblearn.metrics.ValueDifferenceMetric` is used.\n* a new sample is generated where each feature value corresponds to the most\n  common category seen in the neighbors samples belonging to the same class.\n\nLet's take the following example::\n\n   >>> import numpy as np\n   >>> X = np.array([\"green\"] * 5 + [\"red\"] * 10 + [\"blue\"] * 7,\n   ...              dtype=object).reshape(-1, 1)\n   >>> y = np.array([\"apple\"] * 5 + [\"not apple\"] * 3 + [\"apple\"] * 7 +\n   ...              [\"not apple\"] * 5 + [\"apple\"] * 2, dtype=object)\n\nWe generate a dataset associating a color to being an apple or not an apple.\nWe strongly associated \"green\" and \"red\" to being an apple. The minority class\nbeing \"not apple\", we expect new data generated belonging to the category\n\"blue\"::\n\n   >>> from imblearn.over_sampling import SMOTEN\n   >>> sampler = SMOTEN(random_state=0)\n   >>> X_res, y_res = sampler.fit_resample(X, y)\n   >>> X_res[y.size:]\n   array([['blue'],\n           ['blue'],\n           ['blue'],\n           ['blue'],\n           ['blue'],\n           ['blue']], dtype=object)\n   >>> y_res[y.size:]\n   array(['not apple', 'not apple', 'not apple', 'not apple', 'not apple',\n          'not apple'], dtype=object)\n\nMathematical formulation\n========================\n\nSample generation\n-----------------\n\nBoth :class:`SMOTE` and :class:`ADASYN` use the same algorithm to generate new\nsamples. Considering a sample :math:`x_i`, a new sample :math:`x_{new}` will be\ngenerated considering its k neareast-neighbors (corresponding to\n``k_neighbors``). For instance, the 3 nearest-neighbors are included in the\nblue circle as illustrated in the figure below. Then, one of these\nnearest-neighbors :math:`x_{zi}` is selected and a sample is generated as\nfollows:\n\n.. math::\n\n   x_{new} = x_i + \\lambda \\times (x_{zi} - x_i)\n\nwhere :math:`\\lambda` is a random number in the range :math:`[0, 1]`. This\ninterpolation will create a sample on the line between :math:`x_{i}` and\n:math:`x_{zi}` as illustrated in the image below:\n\n.. image:: ./auto_examples/over-sampling/images/sphx_glr_plot_illustration_generation_sample_001.png\n   :target: ./auto_examples/over-sampling/plot_illustration_generation_sample.html\n   :scale: 60\n   :align: center\n\nSMOTE-NC slightly change the way a new sample is generated by performing\nsomething specific for the categorical features. In fact, the categories of a\nnew generated sample are decided by picking the most frequent category of the\nnearest neighbors present during the generation.\n\n.. warning::\n   Be aware that SMOTE-NC is not designed to work with only categorical data.\n\nThe other SMOTE variants and ADASYN differ from each other by selecting the\nsamples :math:`x_i` ahead of generating the new samples.\n\nThe **regular** SMOTE algorithm --- cf. to the :class:`SMOTE` object --- does not\nimpose any rule and will randomly pick-up all possible :math:`x_i` available.\n\nThe **borderline** SMOTE --- cf. to the :class:`BorderlineSMOTE` with the\nparameters ``kind='borderline-1'`` and ``kind='borderline-2'`` --- will\nclassify each sample :math:`x_i` to be (i) noise (i.e. all nearest-neighbors\nare from a different class than the one of :math:`x_i`), (ii) in danger\n(i.e. at least half of the nearest neighbors are from the same class than\n:math:`x_i`, or (iii) safe (i.e. all nearest neighbors are from the same class\nthan :math:`x_i`). **Borderline-1** and **Borderline-2** SMOTE will use the\nsamples *in danger* to generate new samples. In **Borderline-1** SMOTE,\n:math:`x_{zi}` will belong to the same class than the one of the sample\n:math:`x_i`. On the contrary, **Borderline-2** SMOTE will consider\n:math:`x_{zi}` which can be from any class.\n\n**SVM** SMOTE --- cf. to :class:`SVMSMOTE` --- uses an SVM classifier to find\nsupport vectors and generate samples considering them. Note that the ``C``\nparameter of the SVM classifier allows to select more or less support vectors.\n\nFor both borderline and SVM SMOTE, a neighborhood is defined using the\nparameter ``m_neighbors`` to decide if a sample is in danger, safe, or noise.\n\n**KMeans** SMOTE --- cf. to :class:`KMeansSMOTE` --- uses a KMeans clustering\nmethod before to apply SMOTE. The clustering will group samples together and\ngenerate new samples depending of the cluster density.\n\nADASYN works similarly to the regular SMOTE. However, the number of\nsamples generated for each :math:`x_i` is proportional to the number of samples\nwhich are not from the same class than :math:`x_i` in a given\nneighborhood. Therefore, more samples will be generated in the area that the\nnearest neighbor rule is not respected. The parameter ``m_neighbors`` is\nequivalent to ``k_neighbors`` in :class:`SMOTE`.\n\nMulti-class management\n----------------------\n\nAll algorithms can be used with multiple classes as well as binary classes\nclassification.  :class:`RandomOverSampler` does not require any inter-class\ninformation during the sample generation. Therefore, each targeted class is\nresampled independently. In the contrary, both :class:`ADASYN` and\n:class:`SMOTE` need information regarding the neighbourhood of each sample used\nfor sample generation. They are using a one-vs-rest approach by selecting each\ntargeted class and computing the necessary statistics against the rest of the\ndata set which are grouped in a single class.\n"
  },
  {
    "path": "doc/references/combine.rst",
    "content": ".. _combine_ref:\n\nCombination of over- and under-sampling methods\n===============================================\n\n.. automodule:: imblearn.combine\n   :no-members:\n   :no-inherited-members:\n\n.. currentmodule:: imblearn.combine\n\n.. autosummary::\n   :toctree: generated/\n   :template: class.rst\n\n   SMOTEENN\n   SMOTETomek\n"
  },
  {
    "path": "doc/references/datasets.rst",
    "content": ".. _datasets_ref:\n\nDatasets\n========\n\n.. automodule:: imblearn.datasets\n    :no-members:\n    :no-inherited-members:\n\n.. currentmodule:: imblearn.datasets\n\n.. autosummary::\n   :toctree: generated/\n   :template: function.rst\n\n   make_imbalance\n   fetch_datasets\n"
  },
  {
    "path": "doc/references/ensemble.rst",
    "content": ".. _ensemble_ref:\n\nEnsemble methods\n================\n\n.. automodule:: imblearn.ensemble\n    :no-members:\n    :no-inherited-members:\n\n.. currentmodule:: imblearn.ensemble\n\nBoosting algorithms\n-------------------\n\n.. autosummary::\n   :toctree: generated/\n   :template: class.rst\n\n   EasyEnsembleClassifier\n   RUSBoostClassifier\n\nBagging algorithms\n------------------\n\n.. autosummary::\n   :toctree: generated/\n   :template: class.rst\n\n   BalancedBaggingClassifier\n   BalancedRandomForestClassifier\n"
  },
  {
    "path": "doc/references/index.rst",
    "content": ".. _api:\n\n#############\nAPI reference\n#############\n\nThis is the full API documentation of the `imbalanced-learn` toolbox.\n\n.. toctree::\n   :maxdepth: 3\n\n   under_sampling\n   over_sampling\n   combine\n   ensemble\n   keras\n   tensorflow\n   miscellaneous\n   pipeline\n   metrics\n   model_selection\n   datasets\n   utils\n"
  },
  {
    "path": "doc/references/keras.rst",
    "content": ".. _keras_ref:\n\nBatch generator for Keras\n=========================\n\n.. automodule:: imblearn.keras\n    :no-members:\n    :no-inherited-members:\n\n.. currentmodule:: imblearn\n\n.. autosummary::\n   :toctree: generated/\n   :template: class.rst\n\n   keras.BalancedBatchGenerator\n\n.. autosummary::\n   :toctree: generated/\n   :template: function.rst\n\n   keras.balanced_batch_generator\n"
  },
  {
    "path": "doc/references/metrics.rst",
    "content": ".. _metrics_ref:\n\nMetrics\n=======\n\n.. automodule:: imblearn.metrics\n   :no-members:\n   :no-inherited-members:\n\nClassification metrics\n----------------------\nSee the :ref:`metrics` section of the user guide for further details.\n\n.. currentmodule:: imblearn.metrics\n\n.. autosummary::\n   :toctree: generated/\n   :template: function.rst\n\n   classification_report_imbalanced\n   sensitivity_specificity_support\n   sensitivity_score\n   specificity_score\n   geometric_mean_score\n   macro_averaged_mean_absolute_error\n   make_index_balanced_accuracy\n\nPairwise metrics\n----------------\nSee the :ref:`pairwise_metrics` section of the user guide for further details.\n\n.. automodule:: imblearn.metrics.pairwise\n   :no-members:\n   :no-inherited-members:\n\n.. currentmodule:: imblearn.metrics.pairwise\n\n.. autosummary::\n   :toctree: generated/\n   :template: class.rst\n\n   ValueDifferenceMetric\n"
  },
  {
    "path": "doc/references/miscellaneous.rst",
    "content": ".. _misc_ref:\n\nMiscellaneous\n=============\n\nImbalance-learn provides some fast-prototyping tools.\n\n.. currentmodule:: imblearn\n\n.. autosummary::\n   :toctree: generated/\n   :template: class.rst\n\n   FunctionSampler\n"
  },
  {
    "path": "doc/references/model_selection.rst",
    "content": ".. _model_selection_ref:\n\nModel selection methods\n=======================\n\n.. automodule:: imblearn.model_selection\n    :no-members:\n    :no-inherited-members:\n\nCross-validation splitters\n--------------------------\n\n.. automodule:: imblearn.model_selection._split\n   :no-members:\n   :no-inherited-members:\n\n.. currentmodule:: imblearn.model_selection\n\n.. autosummary::\n   :toctree: generated/\n   :template: class.rst\n\n   InstanceHardnessCV\n"
  },
  {
    "path": "doc/references/over_sampling.rst",
    "content": ".. _over_sampling_ref:\n\nOver-sampling methods\n=====================\n\n.. automodule:: imblearn.over_sampling\n    :no-members:\n    :no-inherited-members:\n\n.. currentmodule:: imblearn.over_sampling\n\nBasic over-sampling\n-------------------\n\n.. autosummary::\n   :toctree: generated/\n   :template: class.rst\n\n   RandomOverSampler\n\nSMOTE algorithms\n----------------\n\n.. autosummary::\n   :toctree: generated/\n   :template: class.rst\n\n   SMOTE\n   SMOTENC\n   SMOTEN\n   ADASYN\n   BorderlineSMOTE\n   KMeansSMOTE\n   SVMSMOTE\n"
  },
  {
    "path": "doc/references/pipeline.rst",
    "content": ".. _pipeline_ref:\n\nPipeline\n========\n\n.. automodule:: imblearn.pipeline\n    :no-members:\n    :no-inherited-members:\n\n.. currentmodule:: imblearn.pipeline\n\n.. autosummary::\n   :toctree: generated/\n   :template: class.rst\n\n   Pipeline\n\n.. autosummary::\n   :toctree: generated/\n   :template: function.rst\n\n   make_pipeline\n"
  },
  {
    "path": "doc/references/tensorflow.rst",
    "content": ".. _tensorflow_ref:\n\nBatch generator for TensorFlow\n==============================\n\n.. automodule:: imblearn.tensorflow\n    :no-members:\n    :no-inherited-members:\n\n.. currentmodule:: imblearn\n\n.. autosummary::\n   :toctree: generated/\n   :template: function.rst\n\n   tensorflow.balanced_batch_generator\n"
  },
  {
    "path": "doc/references/under_sampling.rst",
    "content": ".. _under_sampling_ref:\n\nUnder-sampling methods\n======================\n\n.. automodule:: imblearn.under_sampling\n    :no-members:\n    :no-inherited-members:\n\nPrototype generation\n--------------------\n\n.. automodule:: imblearn.under_sampling._prototype_generation\n   :no-members:\n   :no-inherited-members:\n\n.. currentmodule:: imblearn.under_sampling\n\n.. autosummary::\n   :toctree: generated/\n   :template: class.rst\n\n   ClusterCentroids\n\nPrototype selection\n-------------------\n\n.. automodule:: imblearn.under_sampling._prototype_selection\n   :no-members:\n   :no-inherited-members:\n\n.. currentmodule:: imblearn.under_sampling\n\n.. autosummary::\n   :toctree: generated/\n   :template: class.rst\n\n   CondensedNearestNeighbour\n   EditedNearestNeighbours\n   RepeatedEditedNearestNeighbours\n   AllKNN\n   InstanceHardnessThreshold\n   NearMiss\n   NeighbourhoodCleaningRule\n   OneSidedSelection\n   RandomUnderSampler\n   TomekLinks\n"
  },
  {
    "path": "doc/references/utils.rst",
    "content": "Utilities\n=========\n\n.. automodule:: imblearn.utils\n    :no-members:\n    :no-inherited-members:\n\n.. currentmodule:: imblearn.utils\n\nValidation checks used in samplers\n----------------------------------\n\n.. autosummary::\n   :toctree: generated/\n   :template: function.rst\n\n   estimator_checks.parametrize_with_checks\n   check_neighbors_object\n   check_sampling_strategy\n   check_target_type\n\nTesting compatibility of your own sampler\n-----------------------------------------\n\n.. automodule:: imblearn.utils.estimator_checks\n    :no-members:\n    :no-inherited-members:\n\n.. currentmodule:: imblearn.utils.estimator_checks\n\n.. autosummary::\n   :toctree: generated/\n   :template: function.rst\n\n   parametrize_with_checks\n"
  },
  {
    "path": "doc/sphinxext/LICENSE.txt",
    "content": "-------------------------------------------------------------------------------\n    The files\n    - numpydoc.py\n    - autosummary.py\n    - autosummary_generate.py\n    - docscrape.py\n    - docscrape_sphinx.py\n    - phantom_import.py\n    have the following license:\n\nCopyright (C) 2008 Stefan van der Walt <stefan@mentat.za.net>, Pauli Virtanen <pav@iki.fi>\n\nRedistribution and use in source and binary forms, with or without\nmodification, are permitted provided that the following conditions are\nmet:\n\n 1. Redistributions of source code must retain the above copyright\n    notice, this list of conditions and the following disclaimer.\n 2. Redistributions in binary form must reproduce the above copyright\n    notice, this list of conditions and the following disclaimer in\n    the documentation and/or other materials provided with the\n    distribution.\n\nTHIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR\nIMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED\nWARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE\nDISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT,\nINDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES\n(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR\nSERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)\nHOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,\nSTRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING\nIN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE\nPOSSIBILITY OF SUCH DAMAGE.\n\n-------------------------------------------------------------------------------\n    The files\n    - compiler_unparse.py\n    - comment_eater.py\n    - traitsdoc.py\n    have the following license:\n\nThis software is OSI Certified Open Source Software.\nOSI Certified is a certification mark of the Open Source Initiative.\n\nCopyright (c) 2006, Enthought, Inc.\nAll rights reserved.\n\nRedistribution and use in source and binary forms, with or without\nmodification, are permitted provided that the following conditions are met:\n\n * Redistributions of source code must retain the above copyright notice, this\n   list of conditions and the following disclaimer.\n * Redistributions in binary form must reproduce the above copyright notice,\n   this list of conditions and the following disclaimer in the documentation\n   and/or other materials provided with the distribution.\n * Neither the name of Enthought, Inc. nor the names of its contributors may\n   be used to endorse or promote products derived from this software without\n   specific prior written permission.\n\nTHIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND\nANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED\nWARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE\nDISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR\nANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES\n(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;\nLOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON\nANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT\n(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS\nSOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\n\n\n-------------------------------------------------------------------------------\n    The files\n    - only_directives.py\n    - plot_directive.py\n    originate from Matplotlib (http://matplotlib.sf.net/) which has\n    the following license:\n\nCopyright (c) 2002-2008 John D. Hunter; All Rights Reserved.\n\n1. This LICENSE AGREEMENT is between John D. Hunter (“JDH”), and the Individual or Organization (“Licensee”) accessing and otherwise using matplotlib software in source or binary form and its associated documentation.\n\n2. Subject to the terms and conditions of this License Agreement, JDH hereby grants Licensee a nonexclusive, royalty-free, world-wide license to reproduce, analyze, test, perform and/or display publicly, prepare derivative works, distribute, and otherwise use matplotlib 0.98.3 alone or in any derivative version, provided, however, that JDH’s License Agreement and JDH’s notice of copyright, i.e., “Copyright (c) 2002-2008 John D. Hunter; All Rights Reserved” are retained in matplotlib 0.98.3 alone or in any derivative version prepared by Licensee.\n\n3. In the event Licensee prepares a derivative work that is based on or incorporates matplotlib 0.98.3 or any part thereof, and wants to make the derivative work available to others as provided herein, then Licensee hereby agrees to include in any such work a brief summary of the changes made to matplotlib 0.98.3.\n\n4. JDH is making matplotlib 0.98.3 available to Licensee on an “AS IS” basis. JDH MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, JDH MAKES NO AND DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF MATPLOTLIB 0.98.3 WILL NOT INFRINGE ANY THIRD PARTY RIGHTS.\n\n5. JDH SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF MATPLOTLIB 0.98.3 FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING MATPLOTLIB 0.98.3, OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.\n\n6. This License Agreement will automatically terminate upon a material breach of its terms and conditions.\n\n7. Nothing in this License Agreement shall be deemed to create any relationship of agency, partnership, or joint venture between JDH and Licensee. This License Agreement does not grant permission to use JDH trademarks or trade name in a trademark sense to endorse or promote products or services of Licensee, or any third party.\n\n8. By copying, installing or otherwise using matplotlib 0.98.3, Licensee agrees to be bound by the terms and conditions of this License Agreement.\n"
  },
  {
    "path": "doc/sphinxext/MANIFEST.in",
    "content": "recursive-include tests *.py\ninclude *.txt\n"
  },
  {
    "path": "doc/sphinxext/README.txt",
    "content": "=====================================\nnumpydoc -- Numpy's Sphinx extensions\n=====================================\n\nNumpy's documentation uses several custom extensions to Sphinx.  These\nare shipped in this ``numpydoc`` package, in case you want to make use\nof them in third-party projects.\n\nThe following extensions are available:\n\n  - ``numpydoc``: support for the Numpy docstring format in Sphinx, and add\n    the code description directives ``np-function``, ``np-cfunction``, etc.\n    that support the Numpy docstring syntax.\n\n  - ``numpydoc.traitsdoc``: For gathering documentation about Traits attributes.\n\n  - ``numpydoc.plot_directives``: Adaptation of Matplotlib's ``plot::``\n    directive. Note that this implementation may still undergo severe\n    changes or eventually be deprecated.\n\n  - ``numpydoc.only_directives``: (DEPRECATED)\n\n  - ``numpydoc.autosummary``: (DEPRECATED) An ``autosummary::`` directive.\n    Available in Sphinx 0.6.2 and (to-be) 1.0 as ``sphinx.ext.autosummary``,\n    and it the Sphinx 1.0 version is recommended over that included in\n    Numpydoc.\n\n\nnumpydoc\n========\n\nNumpydoc inserts a hook into Sphinx's autodoc that converts docstrings\nfollowing the Numpy/Scipy format to a form palatable to Sphinx.\n\nOptions\n-------\n\nThe following options can be set in conf.py:\n\n- numpydoc_use_plots: bool\n\n  Whether to produce ``plot::`` directives for Examples sections that\n  contain ``import matplotlib``.\n\n- numpydoc_show_class_members: bool\n\n  Whether to show all members of a class in the Methods and Attributes\n  sections automatically.\n\n- numpydoc_edit_link: bool  (DEPRECATED -- edit your HTML template instead)\n\n  Whether to insert an edit link after docstrings.\n"
  },
  {
    "path": "doc/sphinxext/github_link.py",
    "content": "import inspect\nimport os\nimport subprocess\nimport sys\nfrom functools import partial\nfrom operator import attrgetter\n\nREVISION_CMD = \"git rev-parse --short HEAD\"\n\n\ndef _get_git_revision():\n    try:\n        revision = subprocess.check_output(REVISION_CMD.split()).strip()\n    except (subprocess.CalledProcessError, OSError):\n        print(\"Failed to execute git to get revision\")\n        return None\n    return revision.decode(\"utf-8\")\n\n\ndef _linkcode_resolve(domain, info, package, url_fmt, revision):\n    \"\"\"Determine a link to online source for a class/method/function\n\n    This is called by sphinx.ext.linkcode\n\n    An example with a long-untouched module that everyone has\n    >>> _linkcode_resolve('py', {'module': 'tty',\n    ...                          'fullname': 'setraw'},\n    ...                   package='tty',\n    ...                   url_fmt='https://hg.python.org/cpython/file/'\n    ...                           '{revision}/Lib/{package}/{path}#L{lineno}',\n    ...                   revision='xxxx')\n    'https://hg.python.org/cpython/file/xxxx/Lib/tty/tty.py#L18'\n    \"\"\"\n\n    if revision is None:\n        return\n    if domain not in (\"py\", \"pyx\"):\n        return\n    if not info.get(\"module\") or not info.get(\"fullname\"):\n        return\n\n    class_name = info[\"fullname\"].split(\".\")[0]\n    module = __import__(info[\"module\"], fromlist=[class_name])\n    obj = attrgetter(info[\"fullname\"])(module)\n\n    # Unwrap the object to get the correct source\n    # file in case that is wrapped by a decorator\n    obj = inspect.unwrap(obj)\n\n    try:\n        fn = inspect.getsourcefile(obj)\n    except Exception:\n        fn = None\n    if not fn:\n        try:\n            fn = inspect.getsourcefile(sys.modules[obj.__module__])\n        except Exception:\n            fn = None\n    if not fn:\n        return\n\n    fn = os.path.relpath(fn, start=os.path.dirname(__import__(package).__file__))\n    try:\n        lineno = inspect.getsourcelines(obj)[1]\n    except Exception:\n        lineno = \"\"\n    return url_fmt.format(revision=revision, package=package, path=fn, lineno=lineno)\n\n\ndef make_linkcode_resolve(package, url_fmt):\n    \"\"\"Returns a linkcode_resolve function for the given URL format\n\n    revision is a git commit reference (hash or name)\n\n    package is the name of the root module of the package\n\n    url_fmt is along the lines of ('https://github.com/USER/PROJECT/'\n                                   'blob/{revision}/{package}/'\n                                   '{path}#L{lineno}')\n    \"\"\"\n    revision = _get_git_revision()\n    return partial(\n        _linkcode_resolve, revision=revision, package=package, url_fmt=url_fmt\n    )\n"
  },
  {
    "path": "doc/sphinxext/sphinx_issues.py",
    "content": "\"\"\"A Sphinx extension for linking to your project's issue tracker.\n\nCopyright 2014 Steven Loria\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\nThe above copyright notice and this permission notice shall be included in\nall copies or substantial portions of the Software.\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN\nTHE SOFTWARE.\n\"\"\"\nimport re\n\nfrom docutils import nodes, utils\nfrom sphinx.util.nodes import split_explicit_title\n\n__version__ = \"1.2.0\"\n__author__ = \"Steven Loria\"\n__license__ = \"MIT\"\n\n\ndef user_role(name, rawtext, text, lineno, inliner, options=None, content=None):\n    \"\"\"Sphinx role for linking to a user profile. Defaults to linking to\n    Github profiles, but the profile URIS can be configured via the\n    ``issues_user_uri`` config value.\n    Examples: ::\n        :user:`sloria`\n    Anchor text also works: ::\n        :user:`Steven Loria <sloria>`\n    \"\"\"\n    options = options or {}\n    content = content or []\n    has_explicit_title, title, target = split_explicit_title(text)\n\n    target = utils.unescape(target).strip()\n    title = utils.unescape(title).strip()\n    config = inliner.document.settings.env.app.config\n    if config.issues_user_uri:\n        ref = config.issues_user_uri.format(user=target)\n    else:\n        ref = f\"https://github.com/{target}\"\n    if has_explicit_title:\n        text = title\n    else:\n        text = f\"@{target}\"\n\n    link = nodes.reference(text=text, refuri=ref, **options)\n    return [link], []\n\n\ndef cve_role(name, rawtext, text, lineno, inliner, options=None, content=None):\n    \"\"\"Sphinx role for linking to a CVE on https://cve.mitre.org.\n    Examples: ::\n        :cve:`CVE-2018-17175`\n    \"\"\"\n    options = options or {}\n    content = content or []\n    has_explicit_title, title, target = split_explicit_title(text)\n\n    target = utils.unescape(target).strip()\n    title = utils.unescape(title).strip()\n    ref = f\"https://cve.mitre.org/cgi-bin/cvename.cgi?name={target}\"\n    text = title if has_explicit_title else target\n    link = nodes.reference(text=text, refuri=ref, **options)\n    return [link], []\n\n\nclass IssueRole:\n    EXTERNAL_REPO_REGEX = re.compile(r\"^(\\w+)/(.+)([#@])([\\w]+)$\")\n\n    def __init__(\n        self,\n        uri_config_option,\n        format_kwarg,\n        github_uri_template,\n        format_text=None,\n    ):\n        self.uri_config_option = uri_config_option\n        self.format_kwarg = format_kwarg\n        self.github_uri_template = github_uri_template\n        self.format_text = format_text or self.default_format_text\n\n    @staticmethod\n    def default_format_text(issue_no):\n        return f\"#{issue_no}\"\n\n    def make_node(self, name, issue_no, config, options=None):\n        name_map = {\"pr\": \"pull\", \"issue\": \"issues\", \"commit\": \"commit\"}\n        options = options or {}\n        repo_match = self.EXTERNAL_REPO_REGEX.match(issue_no)\n        if repo_match:  # External repo\n            username, repo, symbol, issue = repo_match.groups()\n            if name not in name_map:\n                raise ValueError(f\"External repo linking not supported for :{name}:\")\n            path = name_map.get(name)\n            ref = \"https://github.com/{issues_github_path}/{path}/{n}\".format(\n                issues_github_path=f\"{username}/{repo}\",\n                path=path,\n                n=issue,\n            )\n            formatted_issue = self.format_text(issue).lstrip(\"#\")\n            text = \"{username}/{repo}{symbol}{formatted_issue}\".format(**locals())\n            link = nodes.reference(text=text, refuri=ref, **options)\n            return link\n\n        if issue_no not in (\"-\", \"0\"):\n            uri_template = getattr(config, self.uri_config_option, None)\n            if uri_template:\n                ref = uri_template.format(**{self.format_kwarg: issue_no})\n            elif config.issues_github_path:\n                ref = self.github_uri_template.format(\n                    issues_github_path=config.issues_github_path, n=issue_no\n                )\n            else:\n                raise ValueError(\n                    f\"Neither {self.uri_config_option} nor issues_github_path is set\"\n                )\n            issue_text = self.format_text(issue_no)\n            link = nodes.reference(text=issue_text, refuri=ref, **options)\n        else:\n            link = None\n        return link\n\n    def __call__(\n        self, name, rawtext, text, lineno, inliner, options=None, content=None\n    ):\n        options = options or {}\n        content = content or []\n        issue_nos = [each.strip() for each in utils.unescape(text).split(\",\")]\n        config = inliner.document.settings.env.app.config\n        ret = []\n        for i, issue_no in enumerate(issue_nos):\n            node = self.make_node(name, issue_no, config, options=options)\n            ret.append(node)\n            if i != len(issue_nos) - 1:\n                sep = nodes.raw(text=\", \", format=\"html\")\n                ret.append(sep)\n        return ret, []\n\n\n\"\"\"Sphinx role for linking to an issue. Must have\n`issues_uri` or `issues_github_path` configured in ``conf.py``.\nExamples: ::\n    :issue:`123`\n    :issue:`42,45`\n    :issue:`sloria/konch#123`\n\"\"\"\nissue_role = IssueRole(\n    uri_config_option=\"issues_uri\",\n    format_kwarg=\"issue\",\n    github_uri_template=\"https://github.com/{issues_github_path}/issues/{n}\",\n)\n\n\"\"\"Sphinx role for linking to a pull request. Must have\n`issues_pr_uri` or `issues_github_path` configured in ``conf.py``.\nExamples: ::\n    :pr:`123`\n    :pr:`42,45`\n    :pr:`sloria/konch#43`\n\"\"\"\npr_role = IssueRole(\n    uri_config_option=\"issues_pr_uri\",\n    format_kwarg=\"pr\",\n    github_uri_template=\"https://github.com/{issues_github_path}/pull/{n}\",\n)\n\n\ndef format_commit_text(sha):\n    return sha[:7]\n\n\n\"\"\"Sphinx role for linking to a commit. Must have\n`issues_pr_uri` or `issues_github_path` configured in ``conf.py``.\nExamples: ::\n    :commit:`123abc456def`\n    :commit:`sloria/konch@123abc456def`\n\"\"\"\ncommit_role = IssueRole(\n    uri_config_option=\"issues_commit_uri\",\n    format_kwarg=\"commit\",\n    github_uri_template=\"https://github.com/{issues_github_path}/commit/{n}\",\n    format_text=format_commit_text,\n)\n\n\ndef setup(app):\n    # Format template for issues URI\n    # e.g. 'https://github.com/sloria/marshmallow/issues/{issue}\n    app.add_config_value(\"issues_uri\", default=None, rebuild=\"html\")\n    # Format template for PR URI\n    # e.g. 'https://github.com/sloria/marshmallow/pull/{issue}\n    app.add_config_value(\"issues_pr_uri\", default=None, rebuild=\"html\")\n    # Format template for commit URI\n    # e.g. 'https://github.com/sloria/marshmallow/commits/{commit}\n    app.add_config_value(\"issues_commit_uri\", default=None, rebuild=\"html\")\n    # Shortcut for Github, e.g. 'sloria/marshmallow'\n    app.add_config_value(\"issues_github_path\", default=None, rebuild=\"html\")\n    # Format template for user profile URI\n    # e.g. 'https://github.com/{user}'\n    app.add_config_value(\"issues_user_uri\", default=None, rebuild=\"html\")\n    app.add_role(\"issue\", issue_role)\n    app.add_role(\"pr\", pr_role)\n    app.add_role(\"user\", user_role)\n    app.add_role(\"commit\", commit_role)\n    app.add_role(\"cve\", cve_role)\n    return {\n        \"version\": __version__,\n        \"parallel_read_safe\": True,\n        \"parallel_write_safe\": True,\n    }\n"
  },
  {
    "path": "doc/under_sampling.rst",
    "content": ".. _under-sampling:\n\n==============\nUnder-sampling\n==============\n\n.. currentmodule:: imblearn.under_sampling\n\nOne way of handling imbalanced datasets is to reduce the number of observations from\nall classes but the minority class. The minority class is that with the least number\nof observations. The most well known algorithm in this group is random\nundersampling, where samples from the targeted classes are removed at random.\n\nBut there are many other algorithms to help us reduce the number of observations in the\ndataset. These algorithms can be grouped based on their undersampling strategy into:\n\n- Prototype generation methods.\n- Prototype selection methods.\n\nAnd within the latter, we find:\n\n- Controlled undersampling\n- Cleaning methods\n\nWe will discuss the different algorithms throughout this document.\n\nCheck also\n:ref:`sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py`.\n\n.. _cluster_centroids:\n\nPrototype generation\n====================\n\nGiven an original data set :math:`S`, prototype generation algorithms will\ngenerate a new set :math:`S'` where :math:`|S'| < |S|` and :math:`S' \\not\\subset\nS`. In other words, prototype generation techniques will reduce the number of\nsamples in the targeted classes but the remaining samples are generated --- and\nnot selected --- from the original set.\n\n:class:`ClusterCentroids` makes use of K-means to reduce the number of\nsamples. Therefore, each class will be synthesized with the centroids of the\nK-means method instead of the original samples::\n\n  >>> from collections import Counter\n  >>> from sklearn.datasets import make_classification\n  >>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,\n  ...                            n_redundant=0, n_repeated=0, n_classes=3,\n  ...                            n_clusters_per_class=1,\n  ...                            weights=[0.01, 0.05, 0.94],\n  ...                            class_sep=0.8, random_state=0)\n  >>> print(sorted(Counter(y).items()))\n  [(0, 64), (1, 262), (2, 4674)]\n  >>> from imblearn.under_sampling import ClusterCentroids\n  >>> cc = ClusterCentroids(random_state=0)\n  >>> X_resampled, y_resampled = cc.fit_resample(X, y)\n  >>> print(sorted(Counter(y_resampled).items()))\n  [(0, 64), (1, 64), (2, 64)]\n\nThe figure below illustrates such under-sampling.\n\n.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_001.png\n   :target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html\n   :scale: 60\n   :align: center\n\n:class:`ClusterCentroids` offers an efficient way to represent the data cluster\nwith a reduced number of samples. Keep in mind that this method requires that\nyour data are grouped into clusters. In addition, the number of centroids\nshould be set such that the under-sampled clusters are representative of the\noriginal one.\n\n.. warning::\n\n   :class:`ClusterCentroids` supports sparse matrices. However, the new samples\n   generated are not specifically sparse. Therefore, even if the resulting\n   matrix will be sparse, the algorithm will be inefficient in this regard.\n\nPrototype selection\n===================\n\nPrototype selection algorithms will select samples from the original set :math:`S`,\ngenerating a dataset :math:`S'`, where :math:`|S'| < |S|` and :math:`S' \\subset S`. In\nother words, :math:`S'` is a subset of :math:`S`.\n\nPrototype selection algorithms can be divided into two groups: (i) controlled\nunder-sampling techniques and (ii) cleaning under-sampling techniques.\n\nControlled under-sampling methods reduce the number of observations in the majority\nclass or classes to an arbitrary number of samples specified by the user. Typically,\nthey reduce the number of observations to the number of samples observed in the\nminority class.\n\nIn contrast, cleaning under-sampling techniques \"clean\" the feature space by removing\neither \"noisy\" or \"too easy to classify\" observations, depending on the method. The\nfinal number of observations in each class varies with the cleaning method and can't be\nspecified by the user.\n\n.. _controlled_under_sampling:\n\nControlled under-sampling techniques\n------------------------------------\n\nControlled under-sampling techniques reduce the number of observations from the\ntargeted classes to a number specified by the user.\n\nRandom under-sampling\n^^^^^^^^^^^^^^^^^^^^^\n\n:class:`RandomUnderSampler` is a fast and easy way to balance the data by\nrandomly selecting a subset of data for the targeted classes::\n\n  >>> from imblearn.under_sampling import RandomUnderSampler\n  >>> rus = RandomUnderSampler(random_state=0)\n  >>> X_resampled, y_resampled = rus.fit_resample(X, y)\n  >>> print(sorted(Counter(y_resampled).items()))\n  [(0, 64), (1, 64), (2, 64)]\n\n.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_002.png\n   :target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html\n   :scale: 60\n   :align: center\n\n:class:`RandomUnderSampler` allows bootstrapping the data by setting\n``replacement`` to ``True``. When there are multiple classes, each targeted class is\nunder-sampled independently::\n\n  >>> import numpy as np\n  >>> print(np.vstack([tuple(row) for row in X_resampled]).shape)\n  (192, 2)\n  >>> rus = RandomUnderSampler(random_state=0, replacement=True)\n  >>> X_resampled, y_resampled = rus.fit_resample(X, y)\n  >>> print(np.vstack(np.unique([tuple(row) for row in X_resampled], axis=0)).shape)\n  (181, 2)\n\n:class:`RandomUnderSampler` handles heterogeneous data types, i.e. numerical,\ncategorical, dates, etc.::\n\n  >>> X_hetero = np.array([['xxx', 1, 1.0], ['yyy', 2, 2.0], ['zzz', 3, 3.0]],\n  ...                     dtype=object)\n  >>> y_hetero = np.array([0, 0, 1])\n  >>> X_resampled, y_resampled = rus.fit_resample(X_hetero, y_hetero)\n  >>> print(X_resampled)\n  [['xxx' 1 1.0]\n   ['zzz' 3 3.0]]\n  >>> print(y_resampled)\n  [0 1]\n\n:class:`RandomUnderSampler` also supports pandas dataframes as input for\nundersampling::\n\n  >>> from sklearn.datasets import fetch_openml\n  >>> df_adult, y_adult = fetch_openml(\n  ...     'adult', version=2, as_frame=True, return_X_y=True)\n  >>> df_adult.head()  # doctest: +SKIP\n  >>> df_resampled, y_resampled = rus.fit_resample(df_adult, y_adult)\n  >>> df_resampled.head()  # doctest: +SKIP\n\n:class:`NearMiss` adds some heuristic rules to select samples\n:cite:`mani2003knn`. :class:`NearMiss` implements 3 different types of\nheuristic which can be selected with the parameter ``version``::\n\n  >>> from imblearn.under_sampling import NearMiss\n  >>> nm1 = NearMiss(version=1)\n  >>> X_resampled_nm1, y_resampled = nm1.fit_resample(X, y)\n  >>> print(sorted(Counter(y_resampled).items()))\n  [(0, 64), (1, 64), (2, 64)]\n\nAs later stated in the next section, :class:`NearMiss` heuristic rules are\nbased on nearest neighbors algorithm. Therefore, the parameters ``n_neighbors``\nand ``n_neighbors_ver3`` accept classifier derived from ``KNeighborsMixin``\nfrom scikit-learn. The former parameter is used to compute the average distance\nto the neighbors while the latter is used for the pre-selection of the samples\nof interest.\n\nMathematical formulation\n^^^^^^^^^^^^^^^^^^^^^^^^\n\nLet *positive samples* be the samples belonging to the targeted class to be\nunder-sampled. *Negative sample* refers to the samples from the minority class\n(i.e., the most under-represented class).\n\nNearMiss-1 selects the positive samples for which the average distance\nto the :math:`N` closest samples of the negative class is the smallest.\n\n.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_nearmiss_001.png\n   :target: ./auto_examples/under-sampling/plot_illustration_nearmiss.html\n   :scale: 60\n   :align: center\n\nNearMiss-2 selects the positive samples for which the average distance to the\n:math:`N` farthest samples of the negative class is the smallest.\n\n.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_nearmiss_002.png\n   :target: ./auto_examples/under-sampling/plot_illustration_nearmiss.html\n   :scale: 60\n   :align: center\n\nNearMiss-3 is a 2-steps algorithm. First, for each negative sample, their\n:math:`M` nearest-neighbors will be kept. Then, the positive samples selected\nare the one for which the average distance to the :math:`N` nearest-neighbors\nis the largest.\n\n.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_nearmiss_003.png\n   :target: ./auto_examples/under-sampling/plot_illustration_nearmiss.html\n   :scale: 60\n   :align: center\n\nIn the next example, the different :class:`NearMiss` variant are applied on the\nprevious toy example. It can be seen that the decision functions obtained in\neach case are different.\n\nWhen under-sampling a specific class, NearMiss-1 can be altered by the presence\nof noise. In fact, it will implied that samples of the targeted class will be\nselected around these samples as it is the case in the illustration below for\nthe yellow class. However, in the normal case, samples next to the boundaries\nwill be selected. NearMiss-2 will not have this effect since it does not focus\non the nearest samples but rather on the farthest samples. We can imagine that\nthe presence of noise can also altered the sampling mainly in the presence of\nmarginal outliers. NearMiss-3 is probably the version which will be less\naffected by noise due to the first step sample selection.\n\n.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_003.png\n   :target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html\n   :scale: 60\n   :align: center\n\nCleaning under-sampling techniques\n----------------------------------\n\nCleaning under-sampling methods \"clean\" the feature space by removing\neither \"noisy\" observations or observations that are \"too easy to classify\", depending\non the method. The final number of observations in each targeted class varies with the\ncleaning method and cannot be specified by the user.\n\n.. _tomek_links:\n\nTomek's links\n^^^^^^^^^^^^^\n\nA Tomek's link exists when two samples from different classes are closest neighbors to\neach other.\n\nMathematically, a Tomek's link between two samples from different classes :math:`x`\nand :math:`y` is defined such that for any sample :math:`z`:\n\n.. math::\n\n   d(x, y) < d(x, z) \\text{ and } d(x, y) < d(y, z)\n\nwhere :math:`d(.)` is the distance between the two samples.\n\n:class:`TomekLinks` detects and removes Tomek's links :cite:`tomek1976two`. The\nunderlying idea is that Tomek's links are noisy or hard to classify observations and\nwould not help the algorithm find a suitable discrimination boundary.\n\nIn the following figure, a Tomek's link between an observation of class :math:`+` and\nclass :math:`-` is highlighted in green:\n\n.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_tomek_links_001.png\n   :target: ./auto_examples/under-sampling/plot_illustration_tomek_links.html\n   :scale: 60\n   :align: center\n\nWhen :class:`TomekLinks` finds a Tomek's link, it can either remove the sample of the\nmajority class, or both. The parameter ``sampling_strategy`` controls which samples\nfrom the link will be removed. By default (i.e., ``sampling_strategy='auto'``), it will\nremove the sample from the majority class. Both samples, that is that from the majority\nand the one from the minority class, can be removed by setting ``sampling_strategy`` to\n``'all'``.\n\nThe following figure illustrates this behaviour: on the left, only the sample from the\nmajority class is removed, whereas on the right, the entire Tomek's link is removed.\n\n.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_tomek_links_002.png\n   :target: ./auto_examples/under-sampling/plot_illustration_tomek_links.html\n   :scale: 60\n   :align: center\n\n.. _edited_nearest_neighbors:\n\nEditing data using nearest neighbours\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\nEdited nearest neighbours\n~~~~~~~~~~~~~~~~~~~~~~~~~\n\nThe edited nearest neighbours methodology uses K-Nearest Neighbours to identify the\nneighbours of the targeted class samples, and then removes observations if any or most\nof their neighbours are from a different class :cite:`wilson1972asymptotic`.\n\n:class:`EditedNearestNeighbours` carries out the following steps:\n\n1. Train a K-Nearest neighbours using the entire dataset.\n2. Find each observations' K closest neighbours (only for the targeted classes).\n3. Remove observations if any or most of its neighbours belong to a different class.\n\nBelow the code implementation::\n\n  >>> sorted(Counter(y).items())\n  [(0, 64), (1, 262), (2, 4674)]\n  >>> from imblearn.under_sampling import EditedNearestNeighbours\n  >>> enn = EditedNearestNeighbours()\n  >>> X_resampled, y_resampled = enn.fit_resample(X, y)\n  >>> print(sorted(Counter(y_resampled).items()))\n  [(0, 64), (1, 213), (2, 4568)]\n\n\nTo paraphrase step 3, :class:`EditedNearestNeighbours` will retain observations from\nthe majority class when **most**, or **all** of its neighbours are from the same class.\nTo control this behaviour we set ``kind_sel='mode'`` or ``kind_sel='all'``,\nrespectively. Hence, `kind_sel='all'` is less conservative than `kind_sel='mode'`,\nresulting in the removal of more samples::\n\n  >>> enn = EditedNearestNeighbours(kind_sel=\"all\")\n  >>> X_resampled, y_resampled = enn.fit_resample(X, y)\n  >>> print(sorted(Counter(y_resampled).items()))\n  [(0, 64), (1, 213), (2, 4568)]\n  >>> enn = EditedNearestNeighbours(kind_sel=\"mode\")\n  >>> X_resampled, y_resampled = enn.fit_resample(X, y)\n  >>> print(sorted(Counter(y_resampled).items()))\n  [(0, 64), (1, 234), (2, 4666)]\n\nThe parameter ``n_neighbors`` accepts integers. The integer refers to the number of\nneighbours to examine for each sample. It can also take a classifier subclassed from\n``KNeighborsMixin`` from scikit-learn. When passing a classifier, note that, if you\npass a 3-Nearest Neighbors classifier, only 2 neighbours will be examined for the cleaning, as the\nthird sample is the one being examined for undersampling since it is part of the\nsamples provided at `fit`.\n\nRepeated Edited Nearest Neighbours\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n:class:`RepeatedEditedNearestNeighbours` extends\n:class:`EditedNearestNeighbours` by repeating the algorithm multiple times\n:cite:`tomek1976experiment`. Generally, repeating the algorithm will delete\nmore data::\n\n   >>> from imblearn.under_sampling import RepeatedEditedNearestNeighbours\n   >>> renn = RepeatedEditedNearestNeighbours()\n   >>> X_resampled, y_resampled = renn.fit_resample(X, y)\n   >>> print(sorted(Counter(y_resampled).items()))\n   [(0, 64), (1, 208), (2, 4551)]\n\nThe user can set up the number of times the edited nearest neighbours method should be\nrepeated through the parameter `max_iter`.\n\nThe repetitions will stop when:\n\n1. the maximum number of iterations is reached, or\n2. no more observations are removed, or\n3. one of the majority classes becomes a minority class, or\n4. one of the majority classes disappears during the undersampling.\n\nAll KNN\n~~~~~~~\n\n:class:`AllKNN` is a variation of the\n:class:`RepeatedEditedNearestNeighbours` where the number of neighbours evaluated at\neach round of :class:`EditedNearestNeighbours` increases. It starts by editing based on\n1-Nearest Neighbour, and it increases the neighbourhood by 1 at each iteration\n:cite:`tomek1976experiment`::\n\n  >>> from imblearn.under_sampling import AllKNN\n  >>> allknn = AllKNN()\n  >>> X_resampled, y_resampled = allknn.fit_resample(X, y)\n  >>> print(sorted(Counter(y_resampled).items()))\n  [(0, 64), (1, 220), (2, 4601)]\n\n:class:`AllKNN` stops cleaning when the maximum number of neighbours to examine, which\nis determined by the user through the parameter `n_neighbors` is reached, or when the\nmajority class becomes the minority class.\n\nIn the example below, we see that :class:`EditedNearestNeighbours`,\n:class:`RepeatedEditedNearestNeighbours` and :class:`AllKNN` have similar impact when\ncleaning \"noisy\" samples at the boundaries between classes.\n\n.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_004.png\n   :target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html\n   :scale: 60\n   :align: center\n\n.. _condensed_nearest_neighbors:\n\nCondensed nearest neighbors\n^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n:class:`CondensedNearestNeighbour` uses a 1 nearest neighbor rule to\niteratively decide if a sample should be removed\n:cite:`hart1968condensed`. The algorithm runs as follows:\n\n1. Get all minority samples in a set :math:`C`.\n2. Add a sample from the targeted class (class to be under-sampled) in\n   :math:`C` and all other samples of this class in a set :math:`S`.\n3. Train a 1-Nearest Neigbhour on :math:`C`.\n4. Go through the samples in set :math:`S`, sample by sample, and classify each one\n   using a 1 nearest neighbor rule (trained in 3).\n5. If the sample is misclassified, add it to :math:`C`, and go to step 6.\n6. Repeat steps 3 to 5 until all observations in :math:`S` have been examined.\n\nThe final dataset is :math:`S`, containing all observations from the minority class and\nthose from the majority that were miss-classified by the successive\n1-Nearest Neigbhour algorithms.\n\nThe :class:`CondensedNearestNeighbour` can be used in the following manner::\n\n  >>> from imblearn.under_sampling import CondensedNearestNeighbour\n  >>> cnn = CondensedNearestNeighbour(random_state=0)\n  >>> X_resampled, y_resampled = cnn.fit_resample(X, y)\n  >>> print(sorted(Counter(y_resampled).items()))\n  [(0, 64), (1, 24), (2, 115)]\n\n:class:`CondensedNearestNeighbour` is sensitive to noise and may add noisy samples\n(see figure later on).\n\nOne Sided Selection\n~~~~~~~~~~~~~~~~~~~\n\nIn an attempt to remove the noisy observations introduced by\n:class:`CondensedNearestNeighbour`, :class:`OneSidedSelection`\nwill first find the observations that are hard to classify, and then will use\n:class:`TomekLinks` to remove noisy samples :cite:`hart1968condensed`.\n:class:`OneSidedSelection` runs as follows:\n\n1. Get all minority samples in a set :math:`C`.\n2. Add a sample from the targeted class (class to be under-sampled) in\n   :math:`C` and all other samples of this class in a set :math:`S`.\n3. Train a 1-Nearest Neighbors on :math:`C`.\n4. Using a 1 nearest neighbor rule trained in 3, classify all samples in\n   set :math:`S`.\n5. Add all misclassified samples to :math:`C`.\n6. Remove Tomek Links from :math:`C`.\n\nThe final dataset is :math:`S`, containing all observations from the minority class,\nplus the observations from the majority that were added at random, plus all\nthose from the majority that were miss-classified by the 1-Nearest Neighbors algorithms.\n\nNote that differently from :class:`CondensedNearestNeighbour`, :class:`OneSidedSelection`\ndoes not train a K-Nearest Neighbors after each sample is misclassified. It uses the\n1-Nearest Neighbors from step 3 to classify all samples from the majority in 1 pass.\nThe class can be used as::\n\n  >>> from imblearn.under_sampling import OneSidedSelection\n  >>> oss = OneSidedSelection(random_state=0)\n  >>> X_resampled, y_resampled = oss.fit_resample(X, y)\n  >>> print(sorted(Counter(y_resampled).items()))\n  [(0, 64), (1, 174), (2, 4404)]\n\nOur implementation offers the possibility to set the number of observations\nto put at random in the set :math:`C` through the parameter ``n_seeds_S``.\n\n:class:`NeighbourhoodCleaningRule` will focus on cleaning the data than\ncondensing them :cite:`laurikkala2001improving`. Therefore, it will used the\nunion of samples to be rejected between the :class:`EditedNearestNeighbours`\nand the output a 3 nearest neighbors classifier. The class can be used as::\n\n  >>> from imblearn.under_sampling import NeighbourhoodCleaningRule\n  >>> ncr = NeighbourhoodCleaningRule(n_neighbors=11)\n  >>> X_resampled, y_resampled = ncr.fit_resample(X, y)\n  >>> print(sorted(Counter(y_resampled).items()))\n  [(0, 64), (1, 193), (2, 4535)]\n\n.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_005.png\n   :target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html\n   :scale: 60\n   :align: center\n\n.. _instance_hardness_threshold:\n\nAdditional undersampling techniques\n-----------------------------------\n\nInstance hardness threshold\n^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n**Instance Hardness** is a measure of how difficult it is to classify an instance or\nobservation correctly. In other words, hard instances are observations that are hard to\nclassify correctly.\n\nFundamentally, instances that are hard to classify correctly are those for which the\nlearning algorithm or classifier produces a low probability of predicting the correct\nclass label.\n\nIf we removed these hard instances from the dataset, the logic goes, we would help the\nclassifier better identify the different classes :cite:`smith2014instance`.\n\n:class:`InstanceHardnessThreshold` trains a classifier on the data and then removes the\nsamples with lower probabilities :cite:`smith2014instance`. Or in other words, it\nretains the observations with the higher class probabilities.\n\nIn our implementation, :class:`InstanceHardnessThreshold` is (almost) a controlled\nunder-sampling method: it will retain a specific number of observations of the target\nclass(es), which is specified by the user (see caveat below).\n\nThe class can be used as::\n\n  >>> from sklearn.linear_model import LogisticRegression\n  >>> from imblearn.under_sampling import InstanceHardnessThreshold\n  >>> iht = InstanceHardnessThreshold(random_state=0,\n  ...                                 estimator=LogisticRegression())\n  >>> X_resampled, y_resampled = iht.fit_resample(X, y)\n  >>> print(sorted(Counter(y_resampled).items()))\n  [(0, 64), (1, 64), (2, 64)]\n\n:class:`InstanceHardnessThreshold` has 2 important parameters. The parameter\n``estimator`` accepts any scikit-learn classifier with a method ``predict_proba``.\nThis classifier will be used to identify the hard instances. The training is performed\nwith cross-validation which can be specified through the parameter ``cv`.\n\n.. note::\n\n   :class:`InstanceHardnessThreshold` could almost be considered as a\n   controlled under-sampling method. However, due to the probability outputs, it\n   is not always possible to get the specified number of samples.\n\nThe figure below shows examples of instance hardness undersampling on a toy dataset.\n\n.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_comparison_under_sampling_006.png\n   :target: ./auto_examples/under-sampling/plot_comparison_under_sampling.html\n   :scale: 60\n   :align: center\n"
  },
  {
    "path": "doc/user_guide.rst",
    "content": ".. title:: User guide: contents\n\n.. _user_guide:\n\n==========\nUser Guide\n==========\n\n.. Ensure that the references will be alphabetically collected last\n.. Check https://github.com/mcmtroffaes/sphinxcontrib-bibtex/issues/113\n\n.. toctree::\n   :numbered:\n\n   introduction.rst\n   over_sampling.rst\n   under_sampling.rst\n   combine.rst\n   ensemble.rst\n   miscellaneous.rst\n   metrics.rst\n   model_selection.rst\n   common_pitfalls.rst\n   Dataset loading utilities <datasets/index.rst>\n   developers_utils.rst\n   zzz_references.rst\n"
  },
  {
    "path": "doc/whats_new/v0.1.rst",
    "content": ".. _changes_0_1:\n\nVersion 0.1\n===========\n\n**December 26, 2016**\n\nChangelog\n---------\n\nAPI\n~~~\n\n- First release of the stable API. By :user;`Fernando Nogueira <fmfn>`,\n  :user:`Guillaume Lemaitre <glemaitre>`, :user:`Christos Aridas <chkoar>`,\n  and :user:`Dayvid Oliveira <dvro>`.\n\nNew methods\n~~~~~~~~~~~\n\n* Under-sampling\n    1. Random majority under-sampling with replacement\n    2. Extraction of majority-minority Tomek links\n    3. Under-sampling with Cluster Centroids\n    4. NearMiss-(1 & 2 & 3)\n    5. Condensend Nearest Neighbour\n    6. One-Sided Selection\n    7. Neighboorhood Cleaning Rule\n    8. Edited Nearest Neighbours\n    9. Instance Hardness Threshold\n    10. Repeated Edited Nearest Neighbours\n\n* Over-sampling\n    1. Random minority over-sampling with replacement\n    2. SMOTE - Synthetic Minority Over-sampling Technique\n    3. bSMOTE(1 & 2) - Borderline SMOTE of types 1 and 2\n    4. SVM SMOTE - Support Vectors SMOTE\n    5. ADASYN - Adaptive synthetic sampling approach for imbalanced learning\n\n* Over-sampling followed by under-sampling\n    1. SMOTE + Tomek links\n    2. SMOTE + ENN\n\n* Ensemble sampling\n    1. EasyEnsemble\n    2. BalanceCascade\n"
  },
  {
    "path": "doc/whats_new/v0.10.rst",
    "content": ".. _changes_0_10:\n\nVersion 0.10.1\n==============\n\n**December 28, 2022**\n\nChangelog\n---------\n\nBug fixes\n.........\n\n- Fix a regression in over-sampler where the string `minority` was rejected as\n  an unvalid sampling strategy.\n  :pr:`964` by :user:`Prakhyath Bhandary <Prakhyath07>`.\n\nVersion 0.10.0\n==============\n\n**December 9, 2022**\n\nChangelog\n---------\n\nBug fixes\n.........\n\n- Make sure that :class:`~imblearn.utils._docstring.Substitution` is\n  working with `python -OO` that replace `__doc__` by `None`.\n  :pr:`953` bu :user:`Guillaume Lemaitre <glemaitre>`.\n\nCompatibility\n.............\n\n- Maintenance release for be compatible with scikit-learn >= 1.0.2.\n  :pr:`946`, :pr:`947`, :pr:`949` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Add support for automatic parameters validation as in scikit-learn >= 1.2.\n  :pr:`955` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Add support for `feature_names_in_` as well as `get_feature_names_out` for\n  all samplers.\n  :pr:`959` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nDeprecation\n...........\n\n- The parameter `n_jobs` has been deprecated from the classes\n  :class:`~imblearn.over_sampling.ADASYN`,\n  :class:`~imblearn.over_sampling.BorderlineSMOTE`,\n  :class:`~imblearn.over_sampling.SMOTE`,\n  :class:`~imblearn.over_sampling.SMOTENC`,\n  :class:`~imblearn.over_sampling.SMOTEN`, and\n  :class:`~imblearn.over_sampling.SVMSMOTE`. Instead, pass a nearest neighbors\n  estimator where `n_jobs` is set.\n  :pr:`887` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- The parameter `base_estimator` is deprecated and will be removed in version\n  0.12. It is impacted the following classes:\n  :class:`~imblearn.ensemble.BalancedBaggingClassifier`,\n  :class:`~imblearn.ensemble.EasyEnsembleClassifier`,\n  :class:`~imblearn.ensemble.RUSBoostClassifier`.\n  :pr:`946` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n\nEnhancements\n............\n\n- Add support to accept compatible `NearestNeighbors` objects by only\n  duck-typing. For instance, it allows to accept cuML instances.\n  :pr:`858` by :user:`NV-jpt <NV-jpt>` and\n  :user:`Guillaume Lemaitre <glemaitre>`.\n"
  },
  {
    "path": "doc/whats_new/v0.11.rst",
    "content": ".. _changes_0_11:\n\nVersion 0.11.0\n==============\n\n**July 8, 2023**\n\nChangelog\n---------\n\nBug fixes\n.........\n\n- Fix a bug in :func:`~imblearn.metrics.classification_report_imbalanced` where the\n  parameter `target_names` was not taken into account when `output_dict=True`.\n  :pr:`989` by :user:`AYY7 <AYY7>`.\n\n- :class:`~imblearn.over_sampling.SMOTENC` now handles mix types of data type such as\n  `bool` and `pd.category` by delegating the conversion to scikit-learn encoder.\n  :pr:`1002` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Handle sparse matrices in :class:`~imblearn.over_sampling.SMOTEN` and raise a warning\n  since it requires a conversion to dense matrices.\n  :pr:`1003` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Remove spurious warning raised when minority class get over-sampled more than the\n  number of sample in the majority class.\n  :pr:`1007` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nCompatibility\n.............\n\n- Maintenance release for being compatible with scikit-learn >= 1.3.0.\n  :pr:`999` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nDeprecation\n...........\n\n- The fitted attribute `ohe_` in :class:`~imblearn.over_sampling.SMOTENC` is deprecated\n  and will be removed in version 0.13. Use `categorical_encoder_` instead.\n  :pr:`1000` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- The default of the parameters `sampling_strategy`, `bootstrap` and\n  `replacement` will change in\n  :class:`~imblearn.ensemble.BalancedRandomForestClassifier` to follow the\n  implementation of the original paper. This changes will take effect in\n  version 0.13.\n  :pr:`1006` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nEnhancements\n............\n\n- :class:`~imblearn.over_sampling.SMOTENC` now accepts a parameter `categorical_encoder`\n  allowing to specify a :class:`~sklearn.preprocessing.OneHotEncoder` with custom\n  parameters.\n  :pr:`1000` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- :class:`~imblearn.over_sampling.SMOTEN` now accepts a parameter `categorical_encoder`\n  allowing to specify a :class:`~sklearn.preprocessing.OrdinalEncoder` with custom\n  parameters. A new fitted parameter `categorical_encoder_` is exposed to access the\n  fitted encoder.\n  :pr:`1001` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- :class:`~imblearn.under_sampling.RandomUnderSampler` and\n  :class:`~imblearn.over_sampling.RandomOverSampler` (when `shrinkage is not\n  None`) now accept any data types and will not attempt any data conversion.\n  :pr:`1004` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- :class:`~imblearn.over_sampling.SMOTENC` now support passing array-like of `str`\n  when passing the `categorical_features` parameter.\n  :pr:`1008` by :user`Guillaume Lemaitre <glemaitre>`.\n\n- :class:`~imblearn.over_sampling.SMOTENC` now support automatic categorical inference\n  when `categorical_features` is set to `\"auto\"`.\n  :pr:`1009` by :user`Guillaume Lemaitre <glemaitre>`.\n"
  },
  {
    "path": "doc/whats_new/v0.12.rst",
    "content": ".. _changes_0_12:\n\nVersion 0.12.4\n==============\n\n**October 4, 2024**\n\nChangelog\n---------\n\nCompatibility\n.............\n\n- Compatibility with NumPy 2.0+\n  :pr:`1097` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nVersion 0.12.3\n==============\n\n**May 28, 2024**\n\nChangelog\n---------\n\nCompatibility\n.............\n\n- Compatibility with scikit-learn 1.5\n  :pr:`1074` and :pr:`1084` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nVersion 0.12.2\n==============\n\n**March 31, 2024**\n\nChangelog\n---------\n\nBug fixes\n.........\n\n- Fix the way we check for a specific Python version in the test suite.\n  :pr:`1075` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nVersion 0.12.1\n==============\n\n**March 31, 2024**\n\nChangelog\n---------\n\nBug fixes\n.........\n\n- Fix a bug in :class:`~imblearn.under_sampling.InstanceHardnessThreshold` where\n  `estimator` could not be a :class:`~sklearn.pipeline.Pipeline` object.\n  :pr:`1049` by :user:`Gonenc Mogol <gmogol>`.\n\nCompatibility\n.............\n\n- Do not use `distutils` in tests due to deprecation.\n  :pr:`1065` by :user:`Michael R. Crusoe <mr-c>`.\n\n- Fix the scikit-learn import in tests to be compatible with version 1.4.1.post1.\n  :pr:`1073` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Fix test to be compatible with Python 3.13.\n  :pr:`1073` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nVersion 0.12.0\n==============\n\n**January 24, 2024**\n\nChangelog\n---------\n\nBug fixes\n.........\n\n- Fix a bug in :class:`~imblearn.over_sampling.SMOTENC` where the entries of the\n  one-hot encoding should be divided by `sqrt(2)` and not `2`, taking into account that\n  they are plugged into an Euclidean distance computation.\n  :pr:`1014` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Raise an informative error message when all support vectors are tagged as noise in\n  :class:`~imblearn.over_sampling.SVMSMOTE`.\n  :pr:`1016` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Fix a bug in :class:`~imblearn.over_sampling.SMOTENC` where the median of standard\n  deviation of the continuous features was only computed on the minority class. Now,\n  we are computing this statistic for each class that is up-sampled.\n  :pr:`1015` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Fix a bug in :class:`~imblearn.over_sampling.SMOTENC` such that the case where\n  the median of standard deviation of the continuous features is null is handled\n  in the multiclass case as well.\n  :pr:`1015` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Fix a bug in :class:`~imblearn.over_sampling.BorderlineSMOTE` version 2 where samples\n  should be generated from the whole dataset and not only from the minority class.\n  :pr:`1023` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Fix a bug in :class:`~imblearn.under_sampling.NeighbourhoodCleaningRule` where the\n  `kind_sel=\"all\"` was not working as explained in the literature.\n  :pr:`1012` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Fix a bug in :class:`~imblearn.under_sampling.NeighbourhoodCleaningRule` where the\n  `threshold_cleaning` ratio was multiplied on the total number of samples instead of\n  the number of samples in the minority class.\n  :pr:`1012` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Fix a bug in :class:`~imblearn.under_sampling.RandomUnderSampler` and\n  :class:`~imblearn.over_sampling.RandomOverSampler` where a column containing only\n  NaT was not handled correctly.\n  :pr:`1059` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nCompatibility\n.............\n\n- :class:`~imblearn.ensemble.BalancedRandomForestClassifier` now support missing values\n  and monotonic constraints if scikit-learn >= 1.4 is installed.\n\n- :class:`~imblearn.pipeline.Pipeline` support metadata routing if scikit-learn >= 1.4\n  is installed.\n\n- Compatibility with scikit-learn 1.4.\n  :pr:`1058` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nDeprecations\n............\n\n- Deprecate `estimator_` argument in favor of `estimators_` for the classes\n  :class:`~imblearn.under_sampling.CondensedNearestNeighbour` and\n  :class:`~imblearn.under_sampling.OneSidedSelection`. `estimator_` will be removed\n  in 0.14.\n  :pr:`1011` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Deprecate `kind_sel` in :class:`~imblearn.under_sampling.NeighbourhoodCleaningRule.\n  It will be removed in 0.14. The parameter does not have any effect.\n  :pr:`1012` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nEnhancements\n............\n\n- Allows to output dataframe with sparse format if provided as input.\n  :pr:`1059` by :user:`ts2095 <ts2095>`.\n"
  },
  {
    "path": "doc/whats_new/v0.13.rst",
    "content": ".. _changes_0_13:\n\nVersion 0.13.0\n==============\n\n**December 20, 2024**\n\nChangelog\n---------\n\nBug fixes\n.........\n\n- Fix `get_metadata_routing` in :class:`~imblearn.pipeline.Pipeline` such that one\n  can use a sampler with metadata routing.\n  :pr:`1115` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nCompatibility\n.............\n\n- Compatibility with scikit-learn 1.6\n  :pr:`1109` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nDeprecations\n............\n\n- :class:`~imblearn.pipeline.Pipeline` now uses\n  :func:`~sklearn.utils.check_is_fitted` instead of\n  :func:`~sklearn.utils.check_fitted` to check if the pipeline is fitted. In 0.15, it\n  will raise an error instead of a warning.\n  :pr:`1109` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- `algorithm` parameter in :class:`~imblearn.ensemble.RUSBoostClassifier` is now\n  deprecated and will be removed in 0.14.\n  :pr:`1109` by :user:`Guillaume Lemaitre <glemaitre>`.\n"
  },
  {
    "path": "doc/whats_new/v0.14.rst",
    "content": ".. _changes_0_14:\n\nVersion 0.14.1\n==============\n\n**December 21, 2025**\n\nChangelog\n---------\n\nMaintenance\n...........\n\n- Compatibility with scikit-learn 1.8\n  :pr:`1158` by :user:`Guillaume Lemaitre <glemaitre>` and\n  :user:`stratakis <stratakis>`.\n\nVersion 0.14.0\n==============\n\n**August 14, 2025**\n\nChangelog\n---------\n\nBug fixes\n.........\n\nEnhancements\n............\n\n- Add :class:`~imblearn.model_selection.InstanceHardnessCV` to split data and ensure\n  that samples are distributed in folds based on their instance hardness.\n  :pr:`1125` by :user:`Frits Hermans <fritshermans>`.\n\nCompatibility\n.............\n\n- Compatibility with scikit-learn 1.7\n  :pr:`1137`, :pr:`1145`, :pr:`1146` by :user:`Guillaume Lemaitre <glemaitre>`.\n"
  },
  {
    "path": "doc/whats_new/v0.15.rst",
    "content": ".. _changes_0_15:\n\nVersion 0.15.dev0 (In development)\n==================================\n\n**TBD**\n\nChangelog\n---------\n\nBug fixes\n.........\n\nEnhancements\n............\n\nCompatibility\n.............\n\nDeprecations\n............\n"
  },
  {
    "path": "doc/whats_new/v0.2.rst",
    "content": ".. _changes_0_2:\n\nVersion 0.2\n===========\n\n**January 1, 2017**\n\nChangelog\n---------\n\nBug fixes\n~~~~~~~~~\n\n- Fixed a bug in :class:`under_sampling.NearMiss` which was not picking the\n  right samples during under sampling for the method 3. By :user:`Guillaume\n  Lemaitre <glemaitre>`.\n\n- Fixed a bug in :class:`ensemble.EasyEnsemble`, correction of the\n  `random_state` generation. By :user:`Guillaume Lemaitre <glemaitre>` and\n  :user:`Christos Aridas <chkoar>`.\n\n- Fixed a bug in :class:`under_sampling.RepeatedEditedNearestNeighbours`, add\n  additional stopping criterion to avoid that the minority class become a\n  majority class or that a class disappear. By :user:`Guillaume Lemaitre\n  <glemaitre>`.\n\n- Fixed a bug in :class:`under_sampling.AllKNN`, add stopping criteria to avoid\n  that the minority class become a majority class or that a class disappear. By\n  :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Fixed a bug in :class:`under_sampling.CondensedNeareastNeigbour`, correction\n  of the list of indices returned. By :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Fixed a bug in :class:`ensemble.BalanceCascade`, solve the issue to obtain a\n  single array if desired. By :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Fixed a bug in :class:`pipeline.Pipeline`, solve to embed `Pipeline` in other\n  `Pipeline`. :issue:`231` by :user:`Christos Aridas <chkoar>`.\n\n- Fixed a bug in :class:`pipeline.Pipeline`, solve the issue to put to sampler\n  in the same `Pipeline`. :issue:`188` by :user:`Christos Aridas <chkoar>`.\n\n- Fixed a bug in :class:`under_sampling.CondensedNeareastNeigbour`, correction\n  of the shape of `sel_x` when only one sample is selected. By\n  :user:`Aliaksei Halachkin <honeyext>`.\n\n- Fixed a bug in :class:`under_sampling.NeighbourhoodCleaningRule`, selecting\n  neighbours instead of minority class misclassified samples. :issue:`230` by\n  :user:`Aleksandr Loskutov <loskutyan>`.\n\n- Fixed a bug in :class:`over_sampling.ADASYN`, correction of the creation of a\n  new sample so that the new sample lies between the minority sample and the\n  nearest neighbour. :issue:`235` by :user:`Rafael Wampfler <Eichnof>`.\n\nNew features\n~~~~~~~~~~~~\n\n- Added AllKNN under sampling technique. By :user:`Dayvid Oliveira <dvro>`.\n\n- Added a module `metrics` implementing some specific scoring function for the\n  problem of balancing. :issue:`204` by :user:`Guillaume Lemaitre <glemaitre>`\n  and :user:`Christos Aridas <chkoar>`.\n\nEnhancement\n~~~~~~~~~~~\n\n- Added support for bumpversion. By :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Validate the type of target in binary samplers. A warning is raised for the\n  moment. By :user:`Guillaume Lemaitre <glemaitre>` and :user:`Christos Aridas\n  <chkoar>`.\n\n- Change from `cross_validation` module to `model_selection` module for\n  `sklearn` deprecation cycle. By :user:`Dayvid Oliveira <dvro>` and\n  :user:`Christos Aridas <chkoar>`.\n\nAPI changes summary\n~~~~~~~~~~~~~~~~~~~\n\n- `size_ngh` has been deprecated in :class:`combine.SMOTEENN`. Use\n  `n_neighbors` instead. By :user:`Guillaume Lemaitre <glemaitre>`,\n  :user:`Christos Aridas <chkoar>`, and :user:`Dayvid Oliveira <dvro>`.\n\n- `size_ngh` has been deprecated in\n  :class:`under_sampling.EditedNearestNeighbors`. Use `n_neighbors` instead. By\n  :user:`Guillaume Lemaitre <glemaitre>`, :user:`Christos Aridas <chkoar>`,\n  and :user:`Dayvid Oliveira <dvro>`.\n\n- `size_ngh` has been deprecated in\n  :class:`under_sampling.CondensedNeareastNeigbour`. Use `n_neighbors`\n  instead. By :user:`Guillaume Lemaitre <glemaitre>`,\n  :user:`Christos Aridas <chkoar>`, and\n  :user:`Dayvid Oliveira <dvro>`.\n\n- `size_ngh` has been deprecated in\n  :class:`under_sampling.OneSidedSelection`. Use `n_neighbors` instead. By\n  :user:`Guillaume Lemaitre <glemaitre>`, :user:`Christos Aridas <chkoar>`,\n  and :user:`Dayvid Oliveira <dvro>`.\n\n- `size_ngh` has been deprecated in\n  :class:`under_sampling.NeighbourhoodCleaningRule`. Use `n_neighbors`\n  instead. By :user:`Guillaume Lemaitre <glemaitre>`,\n  :user:`Christos Aridas <chkoar>`, and\n  :user:`Dayvid Oliveira <dvro>`.\n\n- `size_ngh` has been deprecated in\n  :class:`under_sampling.RepeatedEditedNearestNeighbours`. Use `n_neighbors`\n  instead. By :user:`Guillaume Lemaitre <glemaitre>`,\n  :user:`Christos Aridas <chkoar>`, and\n  :user:`Dayvid Oliveira <dvro>`.\n\n- `size_ngh` has been deprecated in :class:`under_sampling.AllKNN`. Use\n  `n_neighbors` instead. By :user:`Guillaume Lemaitre <glemaitre>`,\n  :user:`Christos Aridas <chkoar>`, and :user:`Dayvid Oliveira <dvro>`.\n\n- Two base classes :class:`BaseBinaryclassSampler` and\n  :class:`BaseMulticlassSampler` have been created to handle the target type\n  and raise warning in case of abnormality.\n  By :user:`Guillaume Lemaitre <glemaitre>` and :user:`Christos Aridas <chkoar>`.\n\n- Move `random_state` to be assigned in the :class:`SamplerMixin`\n  initialization. By :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Provide estimators instead of parameters in :class:`combine.SMOTEENN` and\n  :class:`combine.SMOTETomek`. Therefore, the list of parameters have been\n  deprecated. By :user:`Guillaume Lemaitre <glemaitre>` and\n  :user:`Christos Aridas <chkoar>`.\n\n- `k` has been deprecated in :class:`over_sampling.ADASYN`. Use `n_neighbors`\n  instead. :issue:`183` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- `k` and `m` have been deprecated in :class:`over_sampling.SMOTE`. Use\n  `k_neighbors` and `m_neighbors` instead. :issue:`182` by :user:`Guillaume\n  Lemaitre <glemaitre>`.\n\n- `n_neighbors` accept `KNeighborsMixin` based object for\n  :class:`under_sampling.EditedNearestNeighbors`,\n  :class:`under_sampling.CondensedNeareastNeigbour`,\n  :class:`under_sampling.NeighbourhoodCleaningRule`,\n  :class:`under_sampling.RepeatedEditedNearestNeighbours`, and\n  :class:`under_sampling.AllKNN`. :issue:`109` by :user:`Guillaume Lemaitre\n  <glemaitre>`.\n\nDocumentation changes\n~~~~~~~~~~~~~~~~~~~~~\n\n- Replace some remaining `UnbalancedDataset` occurences.\n  By :user:`Francois Magimel <Linkid>`.\n\n- Added doctest in the documentation. By :user:`Guillaume Lemaitre\n  <glemaitre>`.\n"
  },
  {
    "path": "doc/whats_new/v0.3.rst",
    "content": ".. _changes_0_3:\n\nVersion 0.3\n===========\n\n**February 22, 2018**\n\nChangelog\n---------\n\nTesting\n~~~~~~~\n- Pytest is used instead of nosetests. :issue:`321` by :user:`Joan Massich\n  <massich>`.\n\nDocumentation\n~~~~~~~~~~~~~\n\n- Added a User Guide and extended some examples. :issue:`295` by\n  :user:`Guillaume Lemaitre <glemaitre>`.\n\nBug fixes\n~~~~~~~~~\n\n- Fixed a bug in :func:`utils.check_ratio` such that an error is raised when\n  the number of samples required is negative. :issue:`312` by :user:`Guillaume\n  Lemaitre <glemaitre>`.\n\n- Fixed a bug in :class:`under_sampling.NearMiss` version 3. The indices\n  returned were wrong. :issue:`312` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Fixed bug for :class:`ensemble.BalanceCascade` and :class:`combine.SMOTEENN`\n  and :class:`SMOTETomek`. :issue:`295` by :user:`Guillaume Lemaitre\n  <glemaitre>`.\n\n- Fixed bug for `check_ratio` to be able to pass arguments when `ratio` is a\n  callable. :issue:`307` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nNew features\n~~~~~~~~~~~~\n\n- Turn off steps in :class:`pipeline.Pipeline` using the `None`\n  object. By :user:`Christos Aridas <chkoar>`.\n\n- Add a fetching function :func:`datasets.fetch_datasets` in order to get some\n  imbalanced datasets useful for benchmarking. :issue:`249` by :user:`Guillaume\n  Lemaitre <glemaitre>`.\n\nEnhancement\n~~~~~~~~~~~\n\n- All samplers accepts sparse matrices with defaulting on CSR\n  type. :issue:`316` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- :func:`datasets.make_imbalance` take a ratio similarly to other samplers. It\n  supports multiclass. :issue:`312` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- All the unit tests have been factorized and a :func:`utils.check_estimators`\n  has been derived from scikit-learn. By :user:`Guillaume Lemaitre\n  <glemaitre>`.\n\n- Script for automatic build of conda packages and uploading. :issue:`242` by\n  :user:`Guillaume Lemaitre <glemaitre>`\n\n- Remove seaborn dependence and improve the examples. :issue:`264` by\n  :user:`Guillaume Lemaitre <glemaitre>`.\n\n- adapt all classes to multi-class resampling. :issue:`290` by :user:`Guillaume\n  Lemaitre <glemaitre>`\n\nAPI changes summary\n~~~~~~~~~~~~~~~~~~~\n\n- `__init__` has been removed from the :class:`base.SamplerMixin` to create a\n  real mixin class. :issue:`242` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- creation of a module :mod:`exceptions` to handle consistant raising of\n  errors. :issue:`242` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- creation of a module ``utils.validation`` to make checking of recurrent\n  patterns. :issue:`242` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- move the under-sampling methods in ``prototype_selection`` and\n  ``prototype_generation`` submodule to make a clearer\n  dinstinction. :issue:`277` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- change ``ratio`` such that it can adapt to multiple class\n  problems. :issue:`290` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nDeprecation\n~~~~~~~~~~~\n\n- Deprecation of the use of ``min_c_`` in\n  :func:`datasets.make_imbalance`. :issue:`312` by :user:`Guillaume Lemaitre\n  <glemaitre>`\n\n- Deprecation of the use of float in :func:`datasets.make_imbalance` for the\n  ratio parameter. :issue:`290` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- deprecate the use of float as ratio in favor of dictionary, string, or\n  callable. :issue:`290` by :user:`Guillaume Lemaitre <glemaitre>`.\n"
  },
  {
    "path": "doc/whats_new/v0.4.rst",
    "content": ".. _changes_0_4:\n\nVersion 0.4.2\n=============\n\n**October 21, 2018**\n\nChangelog\n---------\n\nBug fixes\n.........\n\n- Fix a bug in :class:`imblearn.over_sampling.SMOTENC` in which the the median\n  of the standard deviation instead of half of the median of the standard\n  deviation.\n  By :user:`Guillaume Lemaitre <glemaitre>` in :issue:`491`.\n\n- Raise an error when passing target  which is not supported, i.e. regression\n  target or multilabel targets. Imbalanced-learn does not support this case.\n  By :user:`Guillaume Lemaitre <glemaitre>` in :issue:`490`.\n\n- Fix a bug in :class:`imblearn.over_sampling.SMOTENC` in which a sparse\n  matrices were densify during ``inverse_transform``.\n  By :user:`Guillaume Lemaitre <glemaitre>` in :issue:`495`.\n\n- Fix a bug in :class:`imblearn.over_sampling.SMOTE_NC` in which a the tie\n  breaking was wrongly sampling.\n  By :user:`Guillaume Lemaitre <glemaitre>` in :issue:`497`.\n\nVersion 0.4\n===========\n\n**October 12, 2018**\n\n.. warning::\n\n    Version 0.4 is the last version of imbalanced-learn to support Python 2.7\n    and Python 3.4. Imbalanced-learn 0.5 will require Python 3.5 or higher.\n\nHighlights\n----------\n\nThis release brings its set of new feature as well as some API changes to\nstrengthen the foundation of imbalanced-learn.\n\nAs new feature, 2 new modules :mod:`imblearn.keras` and\n:mod:`imblearn.tensorflow` have been added in which imbalanced-learn samplers\ncan be used to generate balanced mini-batches.\n\nThe module :mod:`imblearn.ensemble` has been consolidated with new classifier:\n:class:`imblearn.ensemble.BalancedRandomForestClassifier`,\n:class:`imblearn.ensemble.EasyEnsembleClassifier`,\n:class:`imblearn.ensemble.RUSBoostClassifier`.\n\nSupport for string has been added in\n:class:`imblearn.over_sampling.RandomOverSampler` and\n:class:`imblearn.under_sampling.RandomUnderSampler`. In addition, a new class\n:class:`imblearn.over_sampling.SMOTENC` allows to generate sample with data\nsets containing both continuous and categorical features.\n\nThe :class:`imblearn.over_sampling.SMOTE` has been simplified and break down\nto 2 additional classes:\n:class:`imblearn.over_sampling.SVMSMOTE` and\n:class:`imblearn.over_sampling.BorderlineSMOTE`.\n\nThere is also some changes regarding the API:\nthe parameter ``sampling_strategy`` has been introduced to replace the\n``ratio`` parameter. In addition, the ``return_indices`` argument has been\ndeprecated and all samplers will exposed a ``sample_indices_`` whenever this is\npossible.\n\nChangelog\n---------\n\nAPI\n...\n\n- Replace the parameter ``ratio`` by ``sampling_strategy``. :issue:`411` by\n  :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Enable to use a ``float`` with binary classification for\n  ``sampling_strategy``. :issue:`411` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Enable to use a ``list`` for the cleaning methods to specify the class to\n  sample. :issue:`411` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Replace ``fit_sample`` by ``fit_resample``. An alias is still available for\n  backward compatibility. In addition, ``sample`` has been removed to avoid\n  resampling on different set of data.\n  :issue:`462` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nNew features\n............\n\n- Add a :mod:`keras` and :mod:`tensorflow` modules to create balanced\n  mini-batches generator.\n  :issue:`409` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Add :class:`imblearn.ensemble.EasyEnsembleClassifier` which create a bag of\n  AdaBoost classifier trained on balanced bootstrap samples.\n  :issue:`455` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Add :class:`imblearn.ensemble.BalancedRandomForestClassifier` which balanced\n  each bootstrap provided to each tree of the forest.\n  :issue:`459` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Add :class:`imblearn.ensemble.RUSBoostClassifier` which applied a random\n  under-sampling stage before each boosting iteration of AdaBoost.\n  :issue:`469` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Add :class:`imblern.over_sampling.SMOTENC` which generate synthetic samples\n  on data set with heterogeneous data type (continuous and categorical\n  features).\n  :issue:`412` by :user:`Denis Dudnik <ddudnik>` and\n  :user:`Guillaume Lemaitre <glemaitre>`.\n\nEnhancement\n...........\n\n- Add a documentation node to create a balanced random forest from a balanced\n  bagging classifier. :issue:`372` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Document the metrics to evaluate models on imbalanced dataset. :issue:`367`\n  by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Add support for one-vs-all encoded target to support keras. :issue:`409` by\n  :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Adding specific class for borderline and SVM SMOTE using\n  :class:`BorderlineSMOTE` and :class:`SVMSMOTE`.\n  :issue:`440` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Allow :class:`imblearn.over_sampling.RandomOverSampler` can return indices\n  using the attributes ``return_indices``.\n  :issue:`439` by :user:`Hugo Gascon<hgascon>` and\n  :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Allow :class:`imblearn.under_sampling.RandomUnderSampler` and\n  :class:`imblearn.over_sampling.RandomOverSampler` to sample object array\n  containing strings.\n  :issue:`451` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nBug fixes\n.........\n\n- Fix bug in :func:`metrics.classification_report_imbalanced` for which\n  `y_pred` and `y_true` where inversed. :issue:`394` by :user:`Ole Silvig\n  <klizter>.`\n\n- Fix bug in ADASYN to consider only samples from the current class when\n  generating new samples. :issue:`354` by :user:`Guillaume Lemaitre\n  <glemaitre>`.\n\n- Fix bug which allow for sorted behavior of ``sampling_strategy`` dictionary\n  and thus to obtain a deterministic results when using the same random state.\n  :issue:`447` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Force to clone scikit-learn estimator passed as attributes to samplers.\n  :issue:`446` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Fix bug which was not preserving the dtype of X and y when generating\n  samples.\n  :issue:`450` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Add the option to pass a ``Memory`` object to :func:`make_pipeline` like\n  in :class:`pipeline.Pipeline` class.\n  :issue:`458` by :user:`Christos Aridas <chkoar>`.\n\nMaintenance\n...........\n\n- Remove deprecated parameters in 0.2 - :issue:`331` by :user:`Guillaume\n  Lemaitre <glemaitre>`.\n\n- Make some modules private.\n  :issue:`452` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Upgrade requirements to scikit-learn 0.20.\n  :issue:`379` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Catch deprecation warning in testing.\n  :issue:`441` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Refactor and impose `pytest` style tests.\n  :issue:`470` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nDocumentation\n.............\n\n- Remove some docstring which are not necessary.\n  :issue:`454` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Fix the documentation of the ``sampling_strategy`` parameters when used as a\n  float.\n  :issue:`480` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nDeprecation\n...........\n\n- Deprecate ``ratio`` in favor of ``sampling_strategy``. :issue:`411` by\n  :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Deprecate the use of a ``dict`` for cleaning methods. a ``list`` should be\n  used. :issue:`411` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Deprecate ``random_state`` in :class:`imblearn.under_sampling.NearMiss`,\n  :class:`imblearn.under_sampling.EditedNearestNeighbors`,\n  :class:`imblearn.under_sampling.RepeatedEditedNearestNeighbors`,\n  :class:`imblearn.under_sampling.AllKNN`,\n  :class:`imblearn.under_sampling.NeighbourhoodCleaningRule`,\n  :class:`imblearn.under_sampling.InstanceHardnessThreshold`,\n  :class:`imblearn.under_sampling.CondensedNearestNeighbours`.\n\n- Deprecate ``kind``, ``out_step``, ``svm_estimator``, ``m_neighbors`` in\n  :class:`imblearn.over_sampling.SMOTE`. User should use\n  :class:`imblearn.over_sampling.SVMSMOTE` and\n  :class:`imblearn.over_sampling.BorderlineSMOTE`.\n  :issue:`440` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Deprecate :class:`imblearn.ensemble.EasyEnsemble` in favor of meta-estimator\n  :class:`imblearn.ensemble.EasyEnsembleClassifier` which follow the exact\n  algorithm described in the literature.\n  :issue:`455` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Deprecate :class:`imblearn.ensemble.BalanceCascade`.\n  :issue:`472` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Deprecate ``return_indices`` in all samplers. Instead, an attribute\n  ``sample_indices_`` is created whenever the sampler is selecting a subset of\n  the original samples.\n  :issue:`474` by :user:`Guillaume Lemaitre <glemaitre`.\n"
  },
  {
    "path": "doc/whats_new/v0.5.rst",
    "content": ".. _changes_0_5:\n\nVersion 0.5.0\n=============\n\n**June 28, 2019**\n\nChangelog\n---------\n\nChanged models\n..............\n\nThe following models or function might give different results even if the\nsame data ``X`` and ``y`` are the same.\n\n* :class:`imblearn.ensemble.RUSBoostClassifier` default estimator changed from\n  :class:`sklearn.tree.DecisionTreeClassifier` with full depth to a decision\n  stump (i.e., tree with ``max_depth=1``).\n\nDocumentation\n.............\n\n- Correct the definition of the ratio when using a ``float`` in sampling\n  strategy for the over-sampling and under-sampling.\n  :issue:`525` by :user:`Ariel Rossanigo <arielrossanigo>`.\n\n- Add :class:`imblearn.over_sampling.BorderlineSMOTE` and\n  :class:`imblearn.over_sampling.SVMSMOTE` in the API documenation.\n  :issue:`530` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nEnhancement\n...........\n\n- Add Parallelisation for SMOTEENN and SMOTETomek.\n  :pr:`547` by :user:`Michael Hsieh <Microsheep>`.\n\n- Add :class:`imblearn.utils._show_versions`. Updated the contribution guide\n  and issue template showing how to print system and dependency information\n  from the command line. :pr:`557` by :user:`Alexander L. Hayes <batflyer>`.\n\n- Add :class:`imblearn.over_sampling.KMeansSMOTE` which is an over-sampler\n  clustering points before to apply SMOTE.\n  :pr:`435` by :user:`Stephan Heijl <StephanHeijl>`.\n\nMaintenance\n...........\n\n- Make it possible to ``import imblearn`` and access submodule.\n  :pr:`500` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Remove support for Python 2, remove deprecation warning from\n  scikit-learn 0.21.\n  :pr:`576` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nBug\n...\n\n- Fix wrong usage of :class:`keras.layers.BatchNormalization` in\n  ``porto_seguro_keras_under_sampling.py`` example. The batch normalization\n  was moved before the activation function and the bias was removed from the\n  dense layer.\n  :pr:`531` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Fix bug which converting to COO format sparse when stacking the matrices in\n  :class:`imblearn.over_sampling.SMOTENC`. This bug was only old scipy version.\n  :pr:`539` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Fix bug in :class:`imblearn.pipeline.Pipeline` where None could be the final\n  estimator.\n  :pr:`554` by :user:`Oliver Rausch <orausch>`.\n\n- Fix bug in :class:`imblearn.over_sampling.SVMSMOTE` and\n  :class:`imblearn.over_sampling.BorderlineSMOTE` where the default parameter\n  of ``n_neighbors`` was not set properly.\n  :pr:`578` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Fix bug by changing the default depth in\n  :class:`imblearn.ensemble.RUSBoostClassifier` to get a decision stump as a\n  weak learner as in the original paper.\n  :pr:`545` by :user:`Christos Aridas <chkoar>`.\n\n- Allow to import ``keras`` directly from ``tensorflow`` in the\n  :mod:`imblearn.keras`.\n  :pr:`531` by :user:`Guillaume Lemaitre <glemaitre>`.\n"
  },
  {
    "path": "doc/whats_new/v0.6.rst",
    "content": ".. _changes_0_6_2:\n\nVersion 0.6.2\n==============\n\n**February 16, 2020**\n\nThis is a bug-fix release to resolve some issues regarding the handling the\ninput and the output format of the arrays.\n\nChangelog\n---------\n\n- Allow column vectors to be passed as targets.\n  :pr:`673` by :user:`Christos Aridas <chkoar>`.\n\n- Better input/output handling for pandas, numpy and plain lists.\n  :pr:`681` by :user:`Christos Aridas <chkoar>`.\n\n.. _changes_0_6_1:\n\nVersion 0.6.1\n==============\n\n**December 7, 2019**\n\nThis is a bug-fix release to primarily resolve some packaging issues in version\n0.6.0. It also includes minor documentation improvements and some bug fixes.\n\nChangelog\n---------\n\nBug fixes\n.........\n\n- Fix a bug in :class:`imblearn.ensemble.BalancedRandomForestClassifier`\n  leading to a wrong number of samples used during fitting due `max_samples`\n  and therefore a bad computation of the OOB score.\n  :pr:`656` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n.. _changes_0_6:\n\nVersion 0.6.0\n=============\n\n**December 5, 2019**\n\nChangelog\n---------\n\nChanged models\n..............\n\nThe following models might give some different sampling due to changes in\nscikit-learn:\n\n- :class:`imblearn.under_sampling.ClusterCentroids`\n- :class:`imblearn.under_sampling.InstanceHardnessThreshold`\n\nThe following samplers will give different results due to change linked to\nthe random state internal usage:\n\n- :class:`imblearn.over_sampling.ADASYN`\n- :class:`imblearn.over_sampling.SMOTENC`\n\nBug fixes\n.........\n\n- :class:`imblearn.under_sampling.InstanceHardnessThreshold` now take into\n  account the `random_state` and will give deterministic results. In addition,\n  `cross_val_predict` is used to take advantage of the parallelism.\n  :pr:`599` by :user:`Shihab Shahriar Khan <Shihab-Shahriar>`.\n\n- Fix a bug in :class:`imblearn.ensemble.BalancedRandomForestClassifier`\n  leading to a wrong computation of the OOB score.\n  :pr:`656` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nMaintenance\n...........\n\n- Update imports from scikit-learn after that some modules have been privatize.\n  The following import have been changed:\n  :class:`sklearn.ensemble._base._set_random_states`,\n  :class:`sklearn.ensemble._forest._parallel_build_trees`,\n  :class:`sklearn.metrics._classification._check_targets`,\n  :class:`sklearn.metrics._classification._prf_divide`,\n  :class:`sklearn.utils.Bunch`,\n  :class:`sklearn.utils._safe_indexing`,\n  :class:`sklearn.utils._testing.assert_allclose`,\n  :class:`sklearn.utils._testing.assert_array_equal`,\n  :class:`sklearn.utils._testing.SkipTest`.\n  :pr:`617` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Synchronize :mod:`imblearn.pipeline` with :mod:`sklearn.pipeline`.\n  :pr:`620` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Synchronize :class:`imblearn.ensemble.BalancedRandomForestClassifier` and add\n  parameters `max_samples` and `ccp_alpha`.\n  :pr:`621` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nEnhancement\n...........\n\n- :class:`imblearn.under_sampling.RandomUnderSampling`,\n  :class:`imblearn.over_sampling.RandomOverSampling`,\n  :class:`imblearn.datasets.make_imbalance` accepts Pandas DataFrame in and\n  will output Pandas DataFrame. Similarly, it will accepts Pandas Series in and\n  will output Pandas Series.\n  :pr:`636` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- :class:`imblearn.FunctionSampler` accepts a parameter ``validate`` allowing\n  to check or not the input ``X`` and ``y``.\n  :pr:`637` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- :class:`imblearn.under_sampling.RandomUnderSampler`,\n  :class:`imblearn.over_sampling.RandomOverSampler` can resample when non\n  finite values are present in ``X``.\n  :pr:`643` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- All samplers will output a Pandas DataFrame if a Pandas DataFrame was given\n  as an input.\n  :pr:`644` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- The samples generation in\n  :class:`imblearn.over_sampling.ADASYN`,\n  :class:`imblearn.over_sampling.SMOTE`,\n  :class:`imblearn.over_sampling.BorderlineSMOTE`,\n  :class:`imblearn.over_sampling.SVMSMOTE`,\n  :class:`imblearn.over_sampling.KMeansSMOTE`,\n  :class:`imblearn.over_sampling.SMOTENC` is now vectorize with giving\n  an additional speed-up when `X` in sparse.\n  :pr:`596` and :pr:`649` by :user:`Matt Eding <MattEding>`.\n\nDeprecation\n...........\n\n- The following classes have been removed after 2 deprecation cycles:\n  `ensemble.BalanceCascade` and `ensemble.EasyEnsemble`.\n  :pr:`617` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- The following functions have been removed after 2 deprecation cycles:\n  `utils.check_ratio`.\n  :pr:`617` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- The parameter `ratio` and `return_indices` has been removed from all\n  samplers.\n  :pr:`617` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- The parameters `m_neighbors`, `out_step`, `kind`, `svm_estimator`\n  have been removed from the :class:`imblearn.over_sampling.SMOTE`.\n  :pr:`617` by :user:`Guillaume Lemaitre <glemaitre>`.\n"
  },
  {
    "path": "doc/whats_new/v0.7.rst",
    "content": ".. _changes_0_7:\n\nVersion 0.7.0\n=============\n\n**June 9, 2020**\n\nChangelog\n---------\n\nMaintenance\n...........\n\n- Ensure that :class:`imblearn.pipeline.Pipeline` is working when `memory`\n  is activated and `joblib==0.11`.\n  :pr:`687` by :user:`Christos Aridas <chkoar>`.\n\n- Refactor common test to use the dev tools from `scikit-learn` 0.23.\n  :pr:`710` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Remove `FutureWarning` issued by `scikit-learn` 0.23.\n  :pr:`710` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Impose keywords only argument as in `scikit-learn`.\n  :pr:`721` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nChanged models\n..............\n\nThe following models might give some different results due to changes:\n\n- :class:`imblearn.ensemble.BalancedRandomForestClassifier`\n\nBug fixes\n.........\n\n- Change the default value `min_samples_leaf` to be consistent with\n  scikit-learn.\n  :pr:`711` by :user:`zerolfx <zerolfx>`.\n\n- Fix a bug due to change in `scikit-learn` 0.23 in\n  :class:`imblearn.metrics.make_index_balanced_accuracy`. The function was\n  unusable.\n  :pr:`710` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Raise a proper error message when only numerical or categorical features\n  are given in :class:`imblearn.over_sampling.SMOTENC`.\n  :pr:`720` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Fix a bug when the median of the standard deviation is null in\n  :class:`imblearn.over_sampling.SMOTENC`.\n  :pr:`675` by :user:`bganglia <bganglia>`.\n\nEnhancements\n............\n\n- The classifier implemented in imbalanced-learn,\n  :class:`imblearn.ensemble.BalancedBaggingClassifier`,\n  :class:`imblearn.ensemble.BalancedRandomForestClassifier`,\n  :class:`imblearn.ensemble.EasyEnsembleClassifier`, and\n  :class:`imblearn.ensemble.RUSBoostClassifier`, accept `sampling_strategy`\n  with the same key than in `y` without the need of encoding `y` in advance.\n  :pr:`718` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Lazy import `keras` module when importing `imblearn.keras`\n  :pr:`719` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nDeprecation\n...........\n\n- Deprecation of the parameters `n_jobs` in\n  :class:`imblearn.under_sampling.ClusterCentroids` since it was used by\n  :class:`sklearn.cluster.KMeans` which deprecated it.\n  :pr:`710` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Deprecation of passing keyword argument by position similarly to\n  `scikit-learn`.\n  :pr:`721` by :user:`Guillaume lemaitre <glemaitre>`.\n"
  },
  {
    "path": "doc/whats_new/v0.8.rst",
    "content": ".. _changes_0_8:\n\nVersion 0.8.1\n=============\n\n**September 29, 2020**\n\nChangelog\n---------\n\nMaintenance\n...........\n\n- Make `imbalanced-learn` compatible with `scikit-learn` 1.0.\n  :pr:`864` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nVersion 0.8.0\n=============\n\n**February 18, 2021**\n\nChangelog\n---------\n\nNew features\n............\n\n- Add the the function\n  :func:`imblearn.metrics.macro_averaged_mean_absolute_error` returning the\n  average across class of the MAE. This metric is used in ordinal\n  classification.\n  :pr:`780` by :user:`Aurélien Massiot <AurelienMassiot>`.\n\n- Add the class :class:`imblearn.metrics.pairwise.ValueDifferenceMetric` to\n  compute pairwise distances between samples containing only categorical\n  values.\n  :pr:`796` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Add the class :class:`imblearn.over_sampling.SMOTEN` to over-sample data\n  only containing categorical features.\n  :pr:`802` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Add the possibility to pass any type of samplers in\n  :class:`imblearn.ensemble.BalancedBaggingClassifier` unlocking the\n  implementation of methods based on resampled bagging.\n  :pr:`808` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nEnhancements\n............\n\n- Add option `output_dict` in\n  :func:`imblearn.metrics.classification_report_imbalanced` to return a\n  dictionary instead of a string.\n  :pr:`770` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Added an option to generate smoothed bootstrap in\n  :class:`imblearn.over_sampling.RandomOverSampler`. It is controls by the\n  parameter `shrinkage`. This method is also known as Random Over-Sampling\n  Examples (ROSE).\n  :pr:`754` by :user:`Andrea Lorenzon <andrealorenzon>` and\n  :user:`Guillaume Lemaitre <glemaitre>`.\n\nBug fixes\n.........\n\n- Fix a bug in :class:`imblearn.under_sampling.ClusterCentroids` where\n  `voting=\"hard\"` could have lead to select a sample from any class instead of\n  the targeted class.\n  :pr:`769` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Fix a bug in :class:`imblearn.FunctionSampler` where validation was performed\n  even with `validate=False` when calling `fit`.\n  :pr:`790` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nMaintenance\n...........\n\n- Remove requirements files in favour of adding the packages in the\n  `extras_require` within the `setup.py` file.\n  :pr:`816` by :user:`Guillaume Lemaitre <glemaitre>`.\n\n- Change the website template to use `pydata-sphinx-theme`.\n  :pr:`801` by :user:`Guillaume Lemaitre <glemaitre>`.\n\nDeprecation\n...........\n\n- The context manager :func:`imblearn.utils.testing.warns` is deprecated in 0.8\n  and will be removed 1.0.\n  :pr:`815` by :user:`Guillaume Lemaitre <glemaitre>`.\n"
  },
  {
    "path": "doc/whats_new/v0.9.rst",
    "content": ".. _changes_0_9:\n\nVersion 0.9.1\n=============\n\n**May 16, 2022**\n\nChangelog\n---------\n\nThis release provides fixes that make `imbalanced-learn` works with the\nlatest release (`1.1.0`) of `scikit-learn`.\n\nVersion 0.9.0\n=============\n\n**January 11, 2022**\n\nChangelog\n---------\n\nThis release is mainly providing fixes that make `imbalanced-learn` works\nwith the latest release (`1.0.2`) of `scikit-learn`.\n"
  },
  {
    "path": "doc/whats_new.rst",
    "content": ".. currentmodule:: imblearn\n\n===============\nRelease history\n===============\n\n.. include:: whats_new/v0.15.rst\n\n.. include:: whats_new/v0.14.rst\n\n.. include:: whats_new/v0.13.rst\n\n.. include:: whats_new/v0.12.rst\n\n.. include:: whats_new/v0.11.rst\n\n.. include:: whats_new/v0.10.rst\n\n.. include:: whats_new/v0.9.rst\n\n.. include:: whats_new/v0.8.rst\n\n.. include:: whats_new/v0.7.rst\n\n.. include:: whats_new/v0.6.rst\n\n.. include:: whats_new/v0.5.rst\n\n.. include:: whats_new/v0.4.rst\n\n.. include:: whats_new/v0.3.rst\n\n.. include:: whats_new/v0.2.rst\n\n.. include:: whats_new/v0.1.rst\n"
  },
  {
    "path": "doc/zzz_references.rst",
    "content": "==========\nReferences\n==========\n\n.. bibliography:: bibtex/refs.bib\n"
  },
  {
    "path": "examples/README.txt",
    "content": ".. _general_examples:\n\nExamples\n--------\n\nGeneral-purpose and introductory examples for the `imbalanced-learn` toolbox.\n"
  },
  {
    "path": "examples/api/README.txt",
    "content": ".. _api_usage:\n\nExamples showing API imbalanced-learn usage\n-------------------------------------------\n\nExamples that show some details regarding the API of imbalanced-learn.\n"
  },
  {
    "path": "examples/api/plot_sampling_strategy_usage.py",
    "content": "\"\"\"\n====================================================\nHow to use ``sampling_strategy`` in imbalanced-learn\n====================================================\n\nThis example shows the different usage of the parameter ``sampling_strategy``\nfor the different family of samplers (i.e. over-sampling, under-sampling. or\ncleaning methods).\n\n\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\n# %%\nprint(__doc__)\nimport seaborn as sns\n\nsns.set_context(\"poster\")\n\n# %% [markdown]\n# Create an imbalanced dataset\n# ----------------------------\n#\n# First, we will create an imbalanced data set from a the iris data set.\n\n# %%\nfrom sklearn.datasets import load_iris\n\nfrom imblearn.datasets import make_imbalance\n\niris = load_iris(as_frame=True)\n\nsampling_strategy = {0: 10, 1: 20, 2: 47}\nX, y = make_imbalance(iris.data, iris.target, sampling_strategy=sampling_strategy)\n\n# %%\nimport matplotlib.pyplot as plt\n\nfig, axs = plt.subplots(ncols=2, figsize=(10, 5))\nautopct = \"%.2f\"\niris.target.value_counts().plot.pie(autopct=autopct, ax=axs[0])\naxs[0].set_title(\"Original\")\ny.value_counts().plot.pie(autopct=autopct, ax=axs[1])\naxs[1].set_title(\"Imbalanced\")\nfig.tight_layout()\n\n# %% [markdown]\n# Using ``sampling_strategy`` in resampling algorithms\n# ====================================================\n#\n# `sampling_strategy` as a `float`\n# --------------------------------\n#\n# `sampling_strategy` can be given a `float`. For **under-sampling\n# methods**, it corresponds to the ratio :math:`\\alpha_{us}` defined by\n# :math:`N_{rM} = \\alpha_{us} \\times N_{m}` where :math:`N_{rM}` and\n# :math:`N_{m}` are the number of samples in the majority class after\n# resampling and the number of samples in the minority class, respectively.\n\n# %%\n\n# select only 2 classes since the ratio make sense in this case\nbinary_mask = y.isin([0, 1])\nbinary_y = y[binary_mask]\nbinary_X = X[binary_mask]\n\n# %%\nfrom imblearn.under_sampling import RandomUnderSampler\n\nsampling_strategy = 0.8\nrus = RandomUnderSampler(sampling_strategy=sampling_strategy)\nX_res, y_res = rus.fit_resample(binary_X, binary_y)\nax = y_res.value_counts().plot.pie(autopct=autopct)\n_ = ax.set_title(\"Under-sampling\")\n\n# %% [markdown]\n# For **over-sampling methods**, it correspond to the ratio\n# :math:`\\alpha_{os}` defined by :math:`N_{rm} = \\alpha_{os} \\times N_{M}`\n# where :math:`N_{rm}` and :math:`N_{M}` are the number of samples in the\n# minority class after resampling and the number of samples in the majority\n# class, respectively.\n\n# %%\nfrom imblearn.over_sampling import RandomOverSampler\n\nros = RandomOverSampler(sampling_strategy=sampling_strategy)\nX_res, y_res = ros.fit_resample(binary_X, binary_y)\nax = y_res.value_counts().plot.pie(autopct=autopct)\n_ = ax.set_title(\"Over-sampling\")\n\n# %% [markdown]\n# `sampling_strategy` as a `str`\n# -------------------------------\n#\n# `sampling_strategy` can be given as a string which specify the class\n# targeted by the resampling. With under- and over-sampling, the number of\n# samples will be equalized.\n#\n# Note that we are using multiple classes from now on.\n\n# %%\nsampling_strategy = \"not minority\"\n\nfig, axs = plt.subplots(ncols=2, figsize=(10, 5))\nrus = RandomUnderSampler(sampling_strategy=sampling_strategy)\nX_res, y_res = rus.fit_resample(X, y)\ny_res.value_counts().plot.pie(autopct=autopct, ax=axs[0])\naxs[0].set_title(\"Under-sampling\")\n\nsampling_strategy = \"not majority\"\nros = RandomOverSampler(sampling_strategy=sampling_strategy)\nX_res, y_res = ros.fit_resample(X, y)\ny_res.value_counts().plot.pie(autopct=autopct, ax=axs[1])\n_ = axs[1].set_title(\"Over-sampling\")\n\n# %% [markdown]\n# With **cleaning method**, the number of samples in each class will not be\n# equalized even if targeted.\n\n# %%\nfrom imblearn.under_sampling import TomekLinks\n\nsampling_strategy = \"not minority\"\ntl = TomekLinks(sampling_strategy=sampling_strategy)\nX_res, y_res = tl.fit_resample(X, y)\nax = y_res.value_counts().plot.pie(autopct=autopct)\n_ = ax.set_title(\"Cleaning\")\n\n# %% [markdown]\n# `sampling_strategy` as a `dict`\n# -------------------------------\n#\n# When `sampling_strategy` is a `dict`, the keys correspond to the targeted\n# classes. The values correspond to the desired number of samples for each\n# targeted class. This is working for both **under- and over-sampling**\n# algorithms but not for the **cleaning algorithms**. Use a `list` instead.\n\n# %%\nfig, axs = plt.subplots(ncols=2, figsize=(10, 5))\n\nsampling_strategy = {0: 10, 1: 15, 2: 20}\nrus = RandomUnderSampler(sampling_strategy=sampling_strategy)\nX_res, y_res = rus.fit_resample(X, y)\ny_res.value_counts().plot.pie(autopct=autopct, ax=axs[0])\naxs[0].set_title(\"Under-sampling\")\n\nsampling_strategy = {0: 25, 1: 35, 2: 47}\nros = RandomOverSampler(sampling_strategy=sampling_strategy)\nX_res, y_res = ros.fit_resample(X, y)\ny_res.value_counts().plot.pie(autopct=autopct, ax=axs[1])\n_ = axs[1].set_title(\"Under-sampling\")\n\n# %% [markdown]\n# `sampling_strategy` as a `list`\n# -------------------------------\n#\n# When `sampling_strategy` is a `list`, the list contains the targeted\n# classes. It is used only for **cleaning methods** and raise an error\n# otherwise.\n\n# %%\nsampling_strategy = [0, 1, 2]\ntl = TomekLinks(sampling_strategy=sampling_strategy)\nX_res, y_res = tl.fit_resample(X, y)\nax = y_res.value_counts().plot.pie(autopct=autopct)\n_ = ax.set_title(\"Cleaning\")\n\n# %% [markdown]\n# `sampling_strategy` as a callable\n# ---------------------------------\n#\n# When callable, function taking `y` and returns a `dict`. The keys\n# correspond to the targeted classes. The values correspond to the desired\n# number of samples for each class.\n\n\n# %%\ndef ratio_multiplier(y):\n    from collections import Counter\n\n    multiplier = {1: 0.7, 2: 0.95}\n    target_stats = Counter(y)\n    for key, value in target_stats.items():\n        if key in multiplier:\n            target_stats[key] = int(value * multiplier[key])\n    return target_stats\n\n\nX_res, y_res = RandomUnderSampler(sampling_strategy=ratio_multiplier).fit_resample(X, y)\nax = y_res.value_counts().plot.pie(autopct=autopct)\nax.set_title(\"Under-sampling\")\nplt.show()\n"
  },
  {
    "path": "examples/applications/README.txt",
    "content": ".. _realword_examples:\n\nExamples based on real world datasets\n-------------------------------------\n\nExamples which use real-word dataset.\n"
  },
  {
    "path": "examples/applications/plot_impact_imbalanced_classes.py",
    "content": "\"\"\"\n==========================================================\nFitting model on imbalanced datasets and how to fight bias\n==========================================================\n\nThis example illustrates the problem induced by learning on datasets having\nimbalanced classes. Subsequently, we compare different approaches alleviating\nthese negative effects.\n\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\n# %%\nprint(__doc__)\n\n# %% [markdown]\n# Problem definition\n# ------------------\n#\n# We are dropping the following features:\n#\n# - \"fnlwgt\": this feature was created while studying the \"adult\" dataset.\n#   Thus, we will not use this feature which is not acquired during the survey.\n# - \"education-num\": it is encoding the same information than \"education\".\n#   Thus, we are removing one of these 2 features.\n\n# %%\nfrom sklearn.datasets import fetch_openml\n\ndf, y = fetch_openml(\"adult\", version=2, as_frame=True, return_X_y=True)\ndf = df.drop(columns=[\"fnlwgt\", \"education-num\"])\n\n# %% [markdown]\n# The \"adult\" dataset as a class ratio of about 3:1\n\n# %%\nclasses_count = y.value_counts()\nclasses_count\n\n# %% [markdown]\n# This dataset is only slightly imbalanced. To better highlight the effect of\n# learning from an imbalanced dataset, we will increase its ratio to 30:1\n\n# %%\nfrom imblearn.datasets import make_imbalance\n\nratio = 30\ndf_res, y_res = make_imbalance(\n    df,\n    y,\n    sampling_strategy={classes_count.idxmin(): classes_count.max() // ratio},\n)\ny_res.value_counts()\n\n# %% [markdown]\n# We will perform a cross-validation evaluation to get an estimate of the test\n# score.\n#\n# As a baseline, we could use a classifier which will always predict the\n# majority class independently of the features provided.\n\nfrom sklearn.dummy import DummyClassifier\n\n# %%\nfrom sklearn.model_selection import cross_validate\n\ndummy_clf = DummyClassifier(strategy=\"most_frequent\")\nscoring = [\"accuracy\", \"balanced_accuracy\"]\ncv_result = cross_validate(dummy_clf, df_res, y_res, scoring=scoring)\nprint(f\"Accuracy score of a dummy classifier: {cv_result['test_accuracy'].mean():.3f}\")\n\n# %% [markdown]\n# Instead of using the accuracy, we can use the balanced accuracy which will\n# take into account the balancing issue.\n\n# %%\nprint(\n    \"Balanced accuracy score of a dummy classifier: \"\n    f\"{cv_result['test_balanced_accuracy'].mean():.3f}\"\n)\n\n# %% [markdown]\n# Strategies to learn from an imbalanced dataset\n# ----------------------------------------------\n# We will use a dictionary and a list to continuously store the results of\n# our experiments and show them as a pandas dataframe.\n\n# %%\nindex = []\nscores = {\"Accuracy\": [], \"Balanced accuracy\": []}\n\n# %% [markdown]\n# Dummy baseline\n# ..............\n#\n# Before to train a real machine learning model, we can store the results\n# obtained with our :class:`~sklearn.dummy.DummyClassifier`.\n\n# %%\nimport pandas as pd\n\nindex += [\"Dummy classifier\"]\ncv_result = cross_validate(dummy_clf, df_res, y_res, scoring=scoring)\nscores[\"Accuracy\"].append(cv_result[\"test_accuracy\"].mean())\nscores[\"Balanced accuracy\"].append(cv_result[\"test_balanced_accuracy\"].mean())\n\ndf_scores = pd.DataFrame(scores, index=index)\ndf_scores\n\n# %% [markdown]\n# Linear classifier baseline\n# ..........................\n#\n# We will create a machine learning pipeline using a\n# :class:`~sklearn.linear_model.LogisticRegression` classifier. In this regard,\n# we will need to one-hot encode the categorical columns and standardized the\n# numerical columns before to inject the data into the\n# :class:`~sklearn.linear_model.LogisticRegression` classifier.\n#\n# First, we define our numerical and categorical pipelines.\n\n# %%\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import OneHotEncoder, StandardScaler\n\nnum_pipe = make_pipeline(\n    StandardScaler(), SimpleImputer(strategy=\"mean\", add_indicator=True)\n)\ncat_pipe = make_pipeline(\n    SimpleImputer(strategy=\"constant\", fill_value=\"missing\"),\n    OneHotEncoder(handle_unknown=\"ignore\"),\n)\n\n# %% [markdown]\n# Then, we can create a preprocessor which will dispatch the categorical\n# columns to the categorical pipeline and the numerical columns to the\n# numerical pipeline\n\n# %%\nfrom sklearn.compose import make_column_selector as selector\nfrom sklearn.compose import make_column_transformer\n\npreprocessor_linear = make_column_transformer(\n    (num_pipe, selector(dtype_include=\"number\")),\n    (cat_pipe, selector(dtype_include=\"category\")),\n    n_jobs=2,\n)\n\n# %% [markdown]\n# Finally, we connect our preprocessor with our\n# :class:`~sklearn.linear_model.LogisticRegression`. We can then evaluate our\n# model.\n\n# %%\nfrom sklearn.linear_model import LogisticRegression\n\nlr_clf = make_pipeline(preprocessor_linear, LogisticRegression(max_iter=1000))\n\n# %%\nindex += [\"Logistic regression\"]\ncv_result = cross_validate(lr_clf, df_res, y_res, scoring=scoring)\nscores[\"Accuracy\"].append(cv_result[\"test_accuracy\"].mean())\nscores[\"Balanced accuracy\"].append(cv_result[\"test_balanced_accuracy\"].mean())\n\ndf_scores = pd.DataFrame(scores, index=index)\ndf_scores\n\n# %% [markdown]\n# We can see that our linear model is learning slightly better than our dummy\n# baseline. However, it is impacted by the class imbalance.\n#\n# We can verify that something similar is happening with a tree-based model\n# such as :class:`~sklearn.ensemble.RandomForestClassifier`. With this type of\n# classifier, we will not need to scale the numerical data, and we will only\n# need to ordinal encode the categorical data.\n\nfrom sklearn.ensemble import RandomForestClassifier\n\n# %%\nfrom sklearn.preprocessing import OrdinalEncoder\n\nnum_pipe = SimpleImputer(strategy=\"mean\", add_indicator=True)\ncat_pipe = make_pipeline(\n    SimpleImputer(strategy=\"constant\", fill_value=\"missing\"),\n    OrdinalEncoder(handle_unknown=\"use_encoded_value\", unknown_value=-1),\n)\n\npreprocessor_tree = make_column_transformer(\n    (num_pipe, selector(dtype_include=\"number\")),\n    (cat_pipe, selector(dtype_include=\"category\")),\n    n_jobs=2,\n)\n\nrf_clf = make_pipeline(\n    preprocessor_tree, RandomForestClassifier(random_state=42, n_jobs=2)\n)\n\n# %%\nindex += [\"Random forest\"]\ncv_result = cross_validate(rf_clf, df_res, y_res, scoring=scoring)\nscores[\"Accuracy\"].append(cv_result[\"test_accuracy\"].mean())\nscores[\"Balanced accuracy\"].append(cv_result[\"test_balanced_accuracy\"].mean())\n\ndf_scores = pd.DataFrame(scores, index=index)\ndf_scores\n\n# %% [markdown]\n# The :class:`~sklearn.ensemble.RandomForestClassifier` is as well affected by\n# the class imbalanced, slightly less than the linear model. Now, we will\n# present different approach to improve the performance of these 2 models.\n#\n# Use `class_weight`\n# ..................\n#\n# Most of the models in `scikit-learn` have a parameter `class_weight`. This\n# parameter will affect the computation of the loss in linear model or the\n# criterion in the tree-based model to penalize differently a false\n# classification from the minority and majority class. We can set\n# `class_weight=\"balanced\"` such that the weight applied is inversely\n# proportional to the class frequency. We test this parametrization in both\n# linear model and tree-based model.\n\n# %%\nlr_clf.set_params(logisticregression__class_weight=\"balanced\")\n\nindex += [\"Logistic regression with balanced class weights\"]\ncv_result = cross_validate(lr_clf, df_res, y_res, scoring=scoring)\nscores[\"Accuracy\"].append(cv_result[\"test_accuracy\"].mean())\nscores[\"Balanced accuracy\"].append(cv_result[\"test_balanced_accuracy\"].mean())\n\ndf_scores = pd.DataFrame(scores, index=index)\ndf_scores\n\n# %%\nrf_clf.set_params(randomforestclassifier__class_weight=\"balanced\")\n\nindex += [\"Random forest with balanced class weights\"]\ncv_result = cross_validate(rf_clf, df_res, y_res, scoring=scoring)\nscores[\"Accuracy\"].append(cv_result[\"test_accuracy\"].mean())\nscores[\"Balanced accuracy\"].append(cv_result[\"test_balanced_accuracy\"].mean())\n\ndf_scores = pd.DataFrame(scores, index=index)\ndf_scores\n\n# %% [markdown]\n# We can see that using `class_weight` was really effective for the linear\n# model, alleviating the issue of learning from imbalanced classes. However,\n# the :class:`~sklearn.ensemble.RandomForestClassifier` is still biased toward\n# the majority class, mainly due to the criterion which is not suited enough to\n# fight the class imbalance.\n#\n# Resample the training set during learning\n# .........................................\n#\n# Another way is to resample the training set by under-sampling or\n# over-sampling some of the samples. `imbalanced-learn` provides some samplers\n# to do such processing.\n\n# %%\nfrom imblearn.pipeline import make_pipeline as make_pipeline_with_sampler\nfrom imblearn.under_sampling import RandomUnderSampler\n\nlr_clf = make_pipeline_with_sampler(\n    preprocessor_linear,\n    RandomUnderSampler(random_state=42),\n    LogisticRegression(max_iter=1000),\n)\n\n# %%\nindex += [\"Under-sampling + Logistic regression\"]\ncv_result = cross_validate(lr_clf, df_res, y_res, scoring=scoring)\nscores[\"Accuracy\"].append(cv_result[\"test_accuracy\"].mean())\nscores[\"Balanced accuracy\"].append(cv_result[\"test_balanced_accuracy\"].mean())\n\ndf_scores = pd.DataFrame(scores, index=index)\ndf_scores\n\n# %%\nrf_clf = make_pipeline_with_sampler(\n    preprocessor_tree,\n    RandomUnderSampler(random_state=42),\n    RandomForestClassifier(random_state=42, n_jobs=2),\n)\n\n# %%\nindex += [\"Under-sampling + Random forest\"]\ncv_result = cross_validate(rf_clf, df_res, y_res, scoring=scoring)\nscores[\"Accuracy\"].append(cv_result[\"test_accuracy\"].mean())\nscores[\"Balanced accuracy\"].append(cv_result[\"test_balanced_accuracy\"].mean())\n\ndf_scores = pd.DataFrame(scores, index=index)\ndf_scores\n\n# %% [markdown]\n# Applying a random under-sampler before the training of the linear model or\n# random forest, allows to not focus on the majority class at the cost of\n# making more mistake for samples in the majority class (i.e. decreased\n# accuracy).\n#\n# We could apply any type of samplers and find which sampler is working best\n# on the current dataset.\n#\n# Instead, we will present another way by using classifiers which will apply\n# sampling internally.\n#\n# Use of specific balanced algorithms from imbalanced-learn\n# .........................................................\n#\n# We already showed that random under-sampling can be effective on decision\n# tree. However, instead of under-sampling once the dataset, one could\n# under-sample the original dataset before to take a bootstrap sample. This is\n# the base of the :class:`imblearn.ensemble.BalancedRandomForestClassifier` and\n# :class:`~imblearn.ensemble.BalancedBaggingClassifier`.\n\n# %%\nfrom imblearn.ensemble import BalancedRandomForestClassifier\n\nrf_clf = make_pipeline(\n    preprocessor_tree,\n    BalancedRandomForestClassifier(\n        sampling_strategy=\"all\",\n        replacement=True,\n        bootstrap=False,\n        random_state=42,\n        n_jobs=2,\n    ),\n)\n\n# %%\nindex += [\"Balanced random forest\"]\ncv_result = cross_validate(rf_clf, df_res, y_res, scoring=scoring)\nscores[\"Accuracy\"].append(cv_result[\"test_accuracy\"].mean())\nscores[\"Balanced accuracy\"].append(cv_result[\"test_balanced_accuracy\"].mean())\n\ndf_scores = pd.DataFrame(scores, index=index)\ndf_scores\n\n# %% [markdown]\n# The performance with the\n# :class:`~imblearn.ensemble.BalancedRandomForestClassifier` is better than\n# applying a single random under-sampling. We will use a gradient-boosting\n# classifier within a :class:`~imblearn.ensemble.BalancedBaggingClassifier`.\n\nfrom sklearn.ensemble import HistGradientBoostingClassifier\n\nfrom imblearn.ensemble import BalancedBaggingClassifier\n\nbag_clf = make_pipeline(\n    preprocessor_tree,\n    BalancedBaggingClassifier(\n        estimator=HistGradientBoostingClassifier(random_state=42),\n        n_estimators=10,\n        random_state=42,\n        n_jobs=2,\n    ),\n)\n\nindex += [\"Balanced bag of histogram gradient boosting\"]\ncv_result = cross_validate(bag_clf, df_res, y_res, scoring=scoring)\nscores[\"Accuracy\"].append(cv_result[\"test_accuracy\"].mean())\nscores[\"Balanced accuracy\"].append(cv_result[\"test_balanced_accuracy\"].mean())\n\ndf_scores = pd.DataFrame(scores, index=index)\ndf_scores\n\n# %% [markdown]\n# This last approach is the most effective. The different under-sampling allows\n# to bring some diversity for the different GBDT to learn and not focus on a\n# portion of the majority class.\n"
  },
  {
    "path": "examples/applications/plot_multi_class_under_sampling.py",
    "content": "\"\"\"\n=============================================\nMulticlass classification with under-sampling\n=============================================\n\nSome balancing methods allow for balancing dataset with multiples classes.\nWe provide an example to illustrate the use of those methods which do\nnot differ from the binary case.\n\n\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\nfrom collections import Counter\n\nfrom sklearn.datasets import load_iris\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler\n\nfrom imblearn.datasets import make_imbalance\nfrom imblearn.metrics import classification_report_imbalanced\nfrom imblearn.pipeline import make_pipeline\nfrom imblearn.under_sampling import NearMiss\n\nprint(__doc__)\n\nRANDOM_STATE = 42\n\n# Create a folder to fetch the dataset\niris = load_iris()\nX, y = make_imbalance(\n    iris.data,\n    iris.target,\n    sampling_strategy={0: 25, 1: 50, 2: 50},\n    random_state=RANDOM_STATE,\n)\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=RANDOM_STATE)\n\nprint(f\"Training target statistics: {Counter(y_train)}\")\nprint(f\"Testing target statistics: {Counter(y_test)}\")\n\n# Create a pipeline\npipeline = make_pipeline(NearMiss(version=2), StandardScaler(), LogisticRegression())\npipeline.fit(X_train, y_train)\n\n# Classify and report the results\nprint(classification_report_imbalanced(y_test, pipeline.predict(X_test)))\n"
  },
  {
    "path": "examples/applications/plot_outlier_rejections.py",
    "content": "\"\"\"\n===============================================================\nCustomized sampler to implement an outlier rejections estimator\n===============================================================\n\nThis example illustrates the use of a custom sampler to implement an outlier\nrejections estimator. It can be used easily within a pipeline in which the\nnumber of samples can vary during training, which usually is a limitation of\nthe current scikit-learn pipeline.\n\n\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nfrom sklearn.datasets import make_blobs, make_moons\nfrom sklearn.ensemble import IsolationForest\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.metrics import classification_report\n\nfrom imblearn import FunctionSampler\nfrom imblearn.pipeline import make_pipeline\n\nprint(__doc__)\n\nrng = np.random.RandomState(42)\n\n\ndef plot_scatter(X, y, title):\n    \"\"\"Function to plot some data as a scatter plot.\"\"\"\n    plt.figure()\n    plt.scatter(X[y == 1, 0], X[y == 1, 1], label=\"Class #1\")\n    plt.scatter(X[y == 0, 0], X[y == 0, 1], label=\"Class #0\")\n    plt.legend()\n    plt.title(title)\n\n\n##############################################################################\n# Toy data generation\n##############################################################################\n\n##############################################################################\n# We are generating some non Gaussian data set contaminated with some unform\n# noise.\n\nmoons, _ = make_moons(n_samples=500, noise=0.05)\nblobs, _ = make_blobs(\n    n_samples=500, centers=[(-0.75, 2.25), (1.0, 2.0)], cluster_std=0.25\n)\noutliers = rng.uniform(low=-3, high=3, size=(500, 2))\nX_train = np.vstack([moons, blobs, outliers])\ny_train = np.hstack(\n    [\n        np.ones(moons.shape[0], dtype=np.int8),\n        np.zeros(blobs.shape[0], dtype=np.int8),\n        rng.randint(0, 2, size=outliers.shape[0], dtype=np.int8),\n    ]\n)\n\nplot_scatter(X_train, y_train, \"Training dataset\")\n\n##############################################################################\n# We will generate some cleaned test data without outliers.\n\nmoons, _ = make_moons(n_samples=50, noise=0.05)\nblobs, _ = make_blobs(\n    n_samples=50, centers=[(-0.75, 2.25), (1.0, 2.0)], cluster_std=0.25\n)\nX_test = np.vstack([moons, blobs])\ny_test = np.hstack(\n    [np.ones(moons.shape[0], dtype=np.int8), np.zeros(blobs.shape[0], dtype=np.int8)]\n)\n\nplot_scatter(X_test, y_test, \"Testing dataset\")\n\n##############################################################################\n# How to use the :class:`~imblearn.FunctionSampler`\n##############################################################################\n\n##############################################################################\n# We first define a function which will use\n# :class:`~sklearn.ensemble.IsolationForest` to eliminate some outliers from\n# our dataset during training. The function passed to the\n# :class:`~imblearn.FunctionSampler` will be called when using the method\n# ``fit_resample``.\n\n\ndef outlier_rejection(X, y):\n    \"\"\"This will be our function used to resample our dataset.\"\"\"\n    model = IsolationForest(max_samples=100, contamination=0.4, random_state=rng)\n    model.fit(X)\n    y_pred = model.predict(X)\n    return X[y_pred == 1], y[y_pred == 1]\n\n\nreject_sampler = FunctionSampler(func=outlier_rejection)\nX_inliers, y_inliers = reject_sampler.fit_resample(X_train, y_train)\nplot_scatter(X_inliers, y_inliers, \"Training data without outliers\")\n\n##############################################################################\n# Integrate it within a pipeline\n##############################################################################\n\n##############################################################################\n# By elimnating outliers before the training, the classifier will be less\n# affected during the prediction.\n\npipe = make_pipeline(\n    FunctionSampler(func=outlier_rejection),\n    LogisticRegression(random_state=rng),\n)\ny_pred = pipe.fit(X_train, y_train).predict(X_test)\nprint(classification_report(y_test, y_pred))\n\nclf = LogisticRegression(random_state=rng)\ny_pred = clf.fit(X_train, y_train).predict(X_test)\nprint(classification_report(y_test, y_pred))\n\nplt.show()\n"
  },
  {
    "path": "examples/applications/plot_over_sampling_benchmark_lfw.py",
    "content": "\"\"\"\n==========================================================\nBenchmark over-sampling methods in a face recognition task\n==========================================================\n\nIn this face recognition example two faces are used from the LFW\n(Faces in the Wild) dataset. Several implemented over-sampling\nmethods are used in conjunction with a 3NN classifier in order\nto examine the improvement of the classifier's output quality\nby using an over-sampler.\n\"\"\"\n\n# Authors: Christos Aridas\n#          Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\n# %%\nprint(__doc__)\n\nimport seaborn as sns\n\nsns.set_context(\"poster\")\n\n# %% [markdown]\n# Load the dataset\n# ----------------\n#\n# We will use a dataset containing image from know person where we will\n# build a model to recognize the person on the image. We will make this problem\n# a binary problem by taking picture of only George W. Bush and Bill Clinton.\n\n# %%\nimport numpy as np\nfrom sklearn.datasets import fetch_lfw_people\n\ndata = fetch_lfw_people()\ngeorge_bush_id = 1871  # Photos of George W. Bush\nbill_clinton_id = 531  # Photos of Bill Clinton\nclasses = [george_bush_id, bill_clinton_id]\nclasses_name = np.array([\"B. Clinton\", \"G.W. Bush\"], dtype=object)\n\n# %%\nmask_photos = np.isin(data.target, classes)\nX, y = data.data[mask_photos], data.target[mask_photos]\ny = (y == george_bush_id).astype(np.int8)\ny = classes_name[y]\n\n# %% [markdown]\n# We can check the ratio between the two classes.\n\n# %%\nimport matplotlib.pyplot as plt\nimport pandas as pd\n\nclass_distribution = pd.Series(y).value_counts(normalize=True)\nax = class_distribution.plot.barh()\nax.set_title(\"Class distribution\")\npos_label = class_distribution.idxmin()\nplt.tight_layout()\nprint(f\"The positive label considered as the minority class is {pos_label}\")\n\n# %% [markdown]\n# We see that we have an imbalanced classification problem with ~95% of the\n# data belonging to the class G.W. Bush.\n#\n# Compare over-sampling approaches\n# --------------------------------\n#\n# We will use different over-sampling approaches and use a kNN classifier\n# to check if we can recognize the 2 presidents. The evaluation will be\n# performed through cross-validation and we will plot the mean ROC curve.\n#\n# We will create different pipelines and evaluate them.\n\nfrom sklearn.neighbors import KNeighborsClassifier\n\nfrom imblearn import FunctionSampler\nfrom imblearn.over_sampling import ADASYN, SMOTE, RandomOverSampler\nfrom imblearn.pipeline import make_pipeline\n\nclassifier = KNeighborsClassifier(n_neighbors=3)\n\npipeline = [\n    make_pipeline(FunctionSampler(), classifier),\n    make_pipeline(RandomOverSampler(random_state=42), classifier),\n    make_pipeline(ADASYN(random_state=42), classifier),\n    make_pipeline(SMOTE(random_state=42), classifier),\n]\n\n# %%\nfrom sklearn.model_selection import StratifiedKFold\n\ncv = StratifiedKFold(n_splits=3)\n\n# %% [markdown]\n# We will compute the mean ROC curve for each pipeline using a different splits\n# provided by the :class:`~sklearn.model_selection.StratifiedKFold`\n# cross-validation.\n\n# %%\nfrom sklearn.metrics import RocCurveDisplay, auc, roc_curve\n\ndisp = []\nfor model in pipeline:\n    # compute the mean fpr/tpr to get the mean ROC curve\n    mean_tpr, mean_fpr = 0.0, np.linspace(0, 1, 100)\n    for train, test in cv.split(X, y):\n        model.fit(X[train], y[train])\n        y_proba = model.predict_proba(X[test])\n\n        pos_label_idx = np.flatnonzero(model.classes_ == pos_label)[0]\n        fpr, tpr, thresholds = roc_curve(\n            y[test], y_proba[:, pos_label_idx], pos_label=pos_label\n        )\n        mean_tpr += np.interp(mean_fpr, fpr, tpr)\n        mean_tpr[0] = 0.0\n\n    mean_tpr /= cv.get_n_splits(X, y)\n    mean_tpr[-1] = 1.0\n    mean_auc = auc(mean_fpr, mean_tpr)\n\n    # Create a display that we will reuse to make the aggregated plots for\n    # all methods\n    disp.append(\n        RocCurveDisplay(\n            fpr=mean_fpr,\n            tpr=mean_tpr,\n            roc_auc=mean_auc,\n            name=f\"{model[0].__class__.__name__}\",\n        )\n    )\n\n# %% [markdown]\n# In the previous cell, we created the different mean ROC curve and we can plot\n# them on the same plot.\n\n# %%\nfig, ax = plt.subplots(figsize=(9, 9))\nfor d in disp:\n    d.plot(ax=ax, curve_kwargs={\"linestyle\": \"--\"})\nax.plot([0, 1], [0, 1], linestyle=\"--\", color=\"k\")\nax.axis(\"square\")\nfig.suptitle(\"Comparison of over-sampling methods \\nwith a 3NN classifier\")\nax.set_xlim([0, 1])\nax.set_ylim([0, 1])\nsns.despine(offset=10, ax=ax)\nplt.legend(loc=\"lower right\", fontsize=16)\nplt.tight_layout()\nplt.show()\n\n# %% [markdown]\n# We see that for this task, methods that are generating new samples with some\n# interpolation (i.e. ADASYN and SMOTE) perform better than random\n# over-sampling or no resampling.\n"
  },
  {
    "path": "examples/applications/plot_topic_classication.py",
    "content": "\"\"\"\n=================================================\nExample of topic classification in text documents\n=================================================\n\nThis example shows how to balance the text data before to train a classifier.\n\nNote that for this example, the data are slightly imbalanced but it can happen\nthat for some data sets, the imbalanced ratio is more significant.\n\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\n# %%\nprint(__doc__)\n\n# %% [markdown]\n# Setting the data set\n# --------------------\n#\n# We use a part of the 20 newsgroups data set by loading 4 topics. Using the\n# scikit-learn loader, the data are split into a training and a testing set.\n#\n# Note the class \\#3 is the minority class and has almost twice less samples\n# than the majority class.\n\n# %%\nfrom sklearn.datasets import fetch_20newsgroups\n\ncategories = [\n    \"alt.atheism\",\n    \"talk.religion.misc\",\n    \"comp.graphics\",\n    \"sci.space\",\n]\nnewsgroups_train = fetch_20newsgroups(subset=\"train\", categories=categories)\nnewsgroups_test = fetch_20newsgroups(subset=\"test\", categories=categories)\n\nX_train = newsgroups_train.data\nX_test = newsgroups_test.data\n\ny_train = newsgroups_train.target\ny_test = newsgroups_test.target\n\n# %%\nfrom collections import Counter\n\nprint(f\"Training class distributions summary: {Counter(y_train)}\")\nprint(f\"Test class distributions summary: {Counter(y_test)}\")\n\n# %% [markdown]\n# The usual scikit-learn pipeline\n# -------------------------------\n#\n# You might usually use scikit-learn pipeline by combining the TF-IDF\n# vectorizer to feed a multinomial naive bayes classifier. A classification\n# report summarized the results on the testing set.\n#\n# As expected, the recall of the class \\#3 is low mainly due to the class\n# imbalanced.\n\n# %%\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nfrom sklearn.naive_bayes import MultinomialNB\nfrom sklearn.pipeline import make_pipeline\n\nmodel = make_pipeline(TfidfVectorizer(), MultinomialNB())\nmodel.fit(X_train, y_train)\ny_pred = model.predict(X_test)\n\n# %%\nfrom imblearn.metrics import classification_report_imbalanced\n\nprint(classification_report_imbalanced(y_test, y_pred))\n\n# %% [markdown]\n# Balancing the class before classification\n# -----------------------------------------\n#\n# To improve the prediction of the class \\#3, it could be interesting to apply\n# a balancing before to train the naive bayes classifier. Therefore, we will\n# use a :class:`~imblearn.under_sampling.RandomUnderSampler` to equalize the\n# number of samples in all the classes before the training.\n#\n# It is also important to note that we are using the\n# :class:`~imblearn.pipeline.make_pipeline` function implemented in\n# imbalanced-learn to properly handle the samplers.\n\nfrom imblearn.pipeline import make_pipeline as make_pipeline_imb\n\n# %%\nfrom imblearn.under_sampling import RandomUnderSampler\n\nmodel = make_pipeline_imb(TfidfVectorizer(), RandomUnderSampler(), MultinomialNB())\n\nmodel.fit(X_train, y_train)\ny_pred = model.predict(X_test)\n\n# %% [markdown]\n# Although the results are almost identical, it can be seen that the resampling\n# allowed to correct the poor recall of the class \\#3 at the cost of reducing\n# the other metrics for the other classes. However, the overall results are\n# slightly better.\n\n# %%\nprint(classification_report_imbalanced(y_test, y_pred))\n"
  },
  {
    "path": "examples/applications/porto_seguro_keras_under_sampling.py",
    "content": "\"\"\"\n==========================================================\nPorto Seguro: balancing samples in mini-batches with Keras\n==========================================================\n\nThis example compares two strategies to train a neural-network on the Porto\nSeguro Kaggle data set [1]_. The data set is imbalanced and we show that\nbalancing each mini-batch allows to improve performance and reduce the training\ntime.\n\nReferences\n----------\n\n.. [1] https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/data\n\n\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\nprint(__doc__)\n\n###############################################################################\n# Data loading\n###############################################################################\n\nfrom collections import Counter\n\nimport numpy as np\nimport pandas as pd\n\n###############################################################################\n# First, you should download the Porto Seguro data set from Kaggle. See the\n# link in the introduction.\n\ntraining_data = pd.read_csv(\"./input/train.csv\")\ntesting_data = pd.read_csv(\"./input/test.csv\")\n\ny_train = training_data[[\"id\", \"target\"]].set_index(\"id\")\nX_train = training_data.drop([\"target\"], axis=1).set_index(\"id\")\nX_test = testing_data.set_index(\"id\")\n\n###############################################################################\n# The data set is imbalanced and it will have an effect on the fitting.\n\nprint(f\"The data set is imbalanced: {Counter(y_train['target'])}\")\n\n###############################################################################\n# Define the pre-processing pipeline\n###############################################################################\n\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler\n\n\ndef convert_float64(X):\n    return X.astype(np.float64)\n\n\n###############################################################################\n# We want to standard scale the numerical features while we want to one-hot\n# encode the categorical features. In this regard, we make use of the\n# :class:`~sklearn.compose.ColumnTransformer`.\n\nnumerical_columns = [\n    name for name in X_train.columns if \"_calc_\" in name and \"_bin\" not in name\n]\nnumerical_pipeline = make_pipeline(\n    FunctionTransformer(func=convert_float64, validate=False), StandardScaler()\n)\n\ncategorical_columns = [name for name in X_train.columns if \"_cat\" in name]\ncategorical_pipeline = make_pipeline(\n    SimpleImputer(missing_values=-1, strategy=\"most_frequent\"),\n    OneHotEncoder(categories=\"auto\"),\n)\n\npreprocessor = ColumnTransformer(\n    [\n        (\"numerical_preprocessing\", numerical_pipeline, numerical_columns),\n        (\n            \"categorical_preprocessing\",\n            categorical_pipeline,\n            categorical_columns,\n        ),\n    ],\n    remainder=\"drop\",\n)\n\n# Create an environment variable to avoid using the GPU. This can be changed.\nimport os\n\nos.environ[\"CUDA_VISIBLE_DEVICES\"] = \"-1\"\n\nfrom tensorflow.keras.layers import Activation, BatchNormalization, Dense, Dropout\n\n###############################################################################\n# Create a neural-network\n###############################################################################\nfrom tensorflow.keras.models import Sequential\n\n\ndef make_model(n_features):\n    model = Sequential()\n    model.add(Dense(200, input_shape=(n_features,), kernel_initializer=\"glorot_normal\"))\n    model.add(BatchNormalization())\n    model.add(Activation(\"relu\"))\n    model.add(Dropout(0.5))\n    model.add(Dense(100, kernel_initializer=\"glorot_normal\", use_bias=False))\n    model.add(BatchNormalization())\n    model.add(Activation(\"relu\"))\n    model.add(Dropout(0.25))\n    model.add(Dense(50, kernel_initializer=\"glorot_normal\", use_bias=False))\n    model.add(BatchNormalization())\n    model.add(Activation(\"relu\"))\n    model.add(Dropout(0.15))\n    model.add(Dense(25, kernel_initializer=\"glorot_normal\", use_bias=False))\n    model.add(BatchNormalization())\n    model.add(Activation(\"relu\"))\n    model.add(Dropout(0.1))\n    model.add(Dense(1, activation=\"sigmoid\"))\n\n    model.compile(loss=\"binary_crossentropy\", optimizer=\"adam\", metrics=[\"accuracy\"])\n\n    return model\n\n\n###############################################################################\n# We create a decorator to report the computation time\n\nimport time\nfrom functools import wraps\n\n\ndef timeit(f):\n    @wraps(f)\n    def wrapper(*args, **kwds):\n        start_time = time.time()\n        result = f(*args, **kwds)\n        elapsed_time = time.time() - start_time\n        print(f\"Elapsed computation time: {elapsed_time:.3f} secs\")\n        return (elapsed_time, result)\n\n    return wrapper\n\n\n###############################################################################\n# The first model will be trained using the ``fit`` method and with imbalanced\n# mini-batches.\nimport tensorflow\nfrom sklearn.metrics import roc_auc_score\nfrom sklearn.utils.fixes import parse_version\n\ntf_version = parse_version(tensorflow.__version__)\n\n\n@timeit\ndef fit_predict_imbalanced_model(X_train, y_train, X_test, y_test):\n    model = make_model(X_train.shape[1])\n    model.fit(X_train, y_train, epochs=2, verbose=1, batch_size=1000)\n    if tf_version < parse_version(\"2.6\"):\n        # predict_proba was removed in tensorflow 2.6\n        predict_method = \"predict_proba\"\n    else:\n        predict_method = \"predict\"\n    y_pred = getattr(model, predict_method)(X_test, batch_size=1000)\n    return roc_auc_score(y_test, y_pred)\n\n\n###############################################################################\n# In the contrary, we will use imbalanced-learn to create a generator of\n# mini-batches which will yield balanced mini-batches.\n\nfrom imblearn.keras import BalancedBatchGenerator\n\n\n@timeit\ndef fit_predict_balanced_model(X_train, y_train, X_test, y_test):\n    model = make_model(X_train.shape[1])\n    training_generator = BalancedBatchGenerator(\n        X_train, y_train, batch_size=1000, random_state=42\n    )\n    model.fit(training_generator, epochs=5, verbose=1)\n    y_pred = model.predict(X_test, batch_size=1000)\n    return roc_auc_score(y_test, y_pred)\n\n\n###############################################################################\n# Classification loop\n###############################################################################\n\n###############################################################################\n# We will perform a 10-fold cross-validation and train the neural-network with\n# the two different strategies previously presented.\n\nfrom sklearn.model_selection import StratifiedKFold\n\nskf = StratifiedKFold(n_splits=10)\n\ncv_results_imbalanced = []\ncv_time_imbalanced = []\ncv_results_balanced = []\ncv_time_balanced = []\nfor train_idx, valid_idx in skf.split(X_train, y_train):\n    X_local_train = preprocessor.fit_transform(X_train.iloc[train_idx])\n    y_local_train = y_train.iloc[train_idx].values.ravel()\n    X_local_test = preprocessor.transform(X_train.iloc[valid_idx])\n    y_local_test = y_train.iloc[valid_idx].values.ravel()\n\n    elapsed_time, roc_auc = fit_predict_imbalanced_model(\n        X_local_train, y_local_train, X_local_test, y_local_test\n    )\n    cv_time_imbalanced.append(elapsed_time)\n    cv_results_imbalanced.append(roc_auc)\n\n    elapsed_time, roc_auc = fit_predict_balanced_model(\n        X_local_train, y_local_train, X_local_test, y_local_test\n    )\n    cv_time_balanced.append(elapsed_time)\n    cv_results_balanced.append(roc_auc)\n\n###############################################################################\n# Plot of the results and computation time\n###############################################################################\n\ndf_results = pd.DataFrame(\n    {\n        \"Balanced model\": cv_results_balanced,\n        \"Imbalanced model\": cv_results_imbalanced,\n    }\n)\ndf_results = df_results.unstack().reset_index()\n\ndf_time = pd.DataFrame(\n    {\"Balanced model\": cv_time_balanced, \"Imbalanced model\": cv_time_imbalanced}\n)\ndf_time = df_time.unstack().reset_index()\n\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n\nplt.figure()\nsns.boxplot(y=\"level_0\", x=0, data=df_time)\nsns.despine(top=True, right=True, left=True)\nplt.xlabel(\"time [s]\")\nplt.ylabel(\"\")\nplt.title(\"Computation time difference using a random under-sampling\")\n\nplt.figure()\nsns.boxplot(y=\"level_0\", x=0, data=df_results, whis=10.0)\nsns.despine(top=True, right=True, left=True)\nax = plt.gca()\nax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, pos: \"%i%%\" % (100 * x)))\nplt.xlabel(\"ROC-AUC\")\nplt.ylabel(\"\")\nplt.title(\"Difference in terms of ROC-AUC using a random under-sampling\")\n"
  },
  {
    "path": "examples/combine/README.txt",
    "content": ".. _combine_examples:\n\nExamples using combine class methods\n====================================\n\nCombine methods mixed over- and under-sampling methods. Generally SMOTE is used for over-sampling while some cleaning methods (i.e., ENN and Tomek links) are used to under-sample.\n"
  },
  {
    "path": "examples/combine/plot_comparison_combine.py",
    "content": "\"\"\"\n==================================================\nCompare sampler combining over- and under-sampling\n==================================================\n\nThis example shows the effect of applying an under-sampling algorithms after\nSMOTE over-sampling. In the literature, Tomek's link and edited nearest\nneighbours are the two methods which have been used and are available in\nimbalanced-learn.\n\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\n# %%\nprint(__doc__)\n\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n\nsns.set_context(\"poster\")\n\n\n# %% [markdown]\n# Dataset generation\n# ------------------\n#\n# We will create an imbalanced dataset with a couple of samples. We will use\n# :func:`~sklearn.datasets.make_classification` to generate this dataset.\n\n# %%\nfrom sklearn.datasets import make_classification\n\nX, y = make_classification(\n    n_samples=100,\n    n_features=2,\n    n_informative=2,\n    n_redundant=0,\n    n_repeated=0,\n    n_classes=3,\n    n_clusters_per_class=1,\n    weights=[0.1, 0.2, 0.7],\n    class_sep=0.8,\n    random_state=0,\n)\n\n# %%\n_, ax = plt.subplots(figsize=(6, 6))\n_ = ax.scatter(X[:, 0], X[:, 1], c=y, alpha=0.8, edgecolor=\"k\")\n\n# %% [markdown]\n# The following function will be used to plot the sample space after resampling\n# to illustrate the characteristic of an algorithm.\n\n# %%\nfrom collections import Counter\n\n\ndef plot_resampling(X, y, sampler, ax):\n    \"\"\"Plot the resampled dataset using the sampler.\"\"\"\n    X_res, y_res = sampler.fit_resample(X, y)\n    ax.scatter(X_res[:, 0], X_res[:, 1], c=y_res, alpha=0.8, edgecolor=\"k\")\n    sns.despine(ax=ax, offset=10)\n    ax.set_title(f\"Decision function for {sampler.__class__.__name__}\")\n    return Counter(y_res)\n\n\n# %% [markdown]\n# The following function will be used to plot the decision function of a\n# classifier given some data.\n\n# %%\nimport numpy as np\n\n\ndef plot_decision_function(X, y, clf, ax):\n    \"\"\"Plot the decision function of the classifier and the original data\"\"\"\n    plot_step = 0.02\n    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1\n    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1\n    xx, yy = np.meshgrid(\n        np.arange(x_min, x_max, plot_step), np.arange(y_min, y_max, plot_step)\n    )\n\n    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])\n    Z = Z.reshape(xx.shape)\n    ax.contourf(xx, yy, Z, alpha=0.4)\n    ax.scatter(X[:, 0], X[:, 1], alpha=0.8, c=y, edgecolor=\"k\")\n    ax.set_title(f\"Resampling using {clf[0].__class__.__name__}\")\n\n\n# %% [markdown]\n# :class:`~imblearn.over_sampling.SMOTE` allows to generate samples. However,\n# this method of over-sampling does not have any knowledge regarding the\n# underlying distribution. Therefore, some noisy samples can be generated, e.g.\n# when the different classes cannot be well separated. Hence, it can be\n# beneficial to apply an under-sampling algorithm to clean the noisy samples.\n# Two methods are usually used in the literature: (i) Tomek's link and (ii)\n# edited nearest neighbours cleaning methods. Imbalanced-learn provides two\n# ready-to-use samplers :class:`~imblearn.combine.SMOTETomek` and\n# :class:`~imblearn.combine.SMOTEENN`. In general,\n# :class:`~imblearn.combine.SMOTEENN` cleans more noisy data than\n# :class:`~imblearn.combine.SMOTETomek`.\n\nfrom sklearn.linear_model import LogisticRegression\n\nfrom imblearn.combine import SMOTEENN, SMOTETomek\n\n# %%\nfrom imblearn.over_sampling import SMOTE\nfrom imblearn.pipeline import make_pipeline\n\nsamplers = [SMOTE(random_state=0), SMOTEENN(random_state=0), SMOTETomek(random_state=0)]\n\nfig, axs = plt.subplots(3, 2, figsize=(15, 25))\nfor ax, sampler in zip(axs, samplers):\n    clf = make_pipeline(sampler, LogisticRegression()).fit(X, y)\n    plot_decision_function(X, y, clf, ax[0])\n    plot_resampling(X, y, sampler, ax[1])\nfig.tight_layout()\n\nplt.show()\n"
  },
  {
    "path": "examples/datasets/README.txt",
    "content": ".. _dataset_examples:\n\nDataset examples\n-----------------------\n\nExamples concerning the :mod:`imblearn.datasets` module.\n"
  },
  {
    "path": "examples/datasets/plot_make_imbalance.py",
    "content": "\"\"\"\n============================\nCreate an imbalanced dataset\n============================\n\nAn illustration of the :func:`~imblearn.datasets.make_imbalance` function to\ncreate an imbalanced dataset from a balanced dataset. We show the ability of\n:func:`~imblearn.datasets.make_imbalance` of dealing with Pandas DataFrame.\n\"\"\"\n\n# Authors: Dayvid Oliveira\n#          Christos Aridas\n#          Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\n# %%\nprint(__doc__)\n\nimport seaborn as sns\n\nsns.set_context(\"poster\")\n\n# %% [markdown]\n# Generate the dataset\n# --------------------\n#\n# First, we will generate a dataset and convert it to a\n# :class:`~pandas.DataFrame` with arbitrary column names. We will plot the\n# original dataset.\n\n# %%\nimport matplotlib.pyplot as plt\nimport pandas as pd\nfrom sklearn.datasets import make_moons\n\nX, y = make_moons(n_samples=200, shuffle=True, noise=0.5, random_state=10)\nX = pd.DataFrame(X, columns=[\"feature 1\", \"feature 2\"])\nax = X.plot.scatter(\n    x=\"feature 1\",\n    y=\"feature 2\",\n    c=y,\n    colormap=\"viridis\",\n    colorbar=False,\n)\nsns.despine(ax=ax, offset=10)\nplt.tight_layout()\n\n# %% [markdown]\n# Make a dataset imbalanced\n# -------------------------\n#\n# Now, we will show the helpers :func:`~imblearn.datasets.make_imbalance`\n# that is useful to random select a subset of samples. It will impact the\n# class distribution as specified by the parameters.\n\n# %%\nfrom collections import Counter\n\n\ndef ratio_func(y, multiplier, minority_class):\n    target_stats = Counter(y)\n    return {minority_class: int(multiplier * target_stats[minority_class])}\n\n\n# %%\nfrom imblearn.datasets import make_imbalance\n\nfig, axs = plt.subplots(nrows=2, ncols=3, figsize=(15, 10))\n\nX.plot.scatter(\n    x=\"feature 1\",\n    y=\"feature 2\",\n    c=y,\n    ax=axs[0, 0],\n    colormap=\"viridis\",\n    colorbar=False,\n)\naxs[0, 0].set_title(\"Original set\")\nsns.despine(ax=axs[0, 0], offset=10)\n\nmultipliers = [0.9, 0.75, 0.5, 0.25, 0.1]\nfor ax, multiplier in zip(axs.ravel()[1:], multipliers):\n    X_resampled, y_resampled = make_imbalance(\n        X,\n        y,\n        sampling_strategy=ratio_func,\n        **{\"multiplier\": multiplier, \"minority_class\": 1},\n    )\n    X_resampled.plot.scatter(\n        x=\"feature 1\",\n        y=\"feature 2\",\n        c=y_resampled,\n        ax=ax,\n        colormap=\"viridis\",\n        colorbar=False,\n    )\n    ax.set_title(f\"Sampling ratio = {multiplier}\")\n    sns.despine(ax=ax, offset=10)\n\nplt.tight_layout()\nplt.show()\n"
  },
  {
    "path": "examples/ensemble/README.txt",
    "content": ".. _ensemble_examples:\n\nExample using ensemble class methods\n====================================\n\nUnder-sampling methods implies that samples of the majority class are lost during the balancing procedure.\nEnsemble methods offer an alternative to use most of the samples.\nIn fact, an ensemble of balanced sets is created and used to later train any classifier.\n"
  },
  {
    "path": "examples/ensemble/plot_bagging_classifier.py",
    "content": "\"\"\"\n=================================\nBagging classifiers using sampler\n=================================\n\nIn this example, we show how\n:class:`~imblearn.ensemble.BalancedBaggingClassifier` can be used to create a\nlarge variety of classifiers by giving different samplers.\n\nWe will give several examples that have been published in the passed year.\n\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\n# %%\nprint(__doc__)\n\n# %% [markdown]\n# Generate an imbalanced dataset\n# ------------------------------\n#\n# For this example, we will create a synthetic dataset using the function\n# :func:`~sklearn.datasets.make_classification`. The problem will be a toy\n# classification problem with a ratio of 1:9 between the two classes.\n\n# %%\nfrom sklearn.datasets import make_classification\n\nX, y = make_classification(\n    n_samples=10_000,\n    n_features=10,\n    weights=[0.1, 0.9],\n    class_sep=0.5,\n    random_state=0,\n)\n\n# %%\nimport pandas as pd\n\npd.Series(y).value_counts(normalize=True)\n\n# %% [markdown]\n# In the following sections, we will show a couple of algorithms that have\n# been proposed over the years. We intend to illustrate how one can reuse the\n# :class:`~imblearn.ensemble.BalancedBaggingClassifier` by passing different\n# sampler.\n\nfrom sklearn.ensemble import BaggingClassifier\n\n# %%\nfrom sklearn.model_selection import cross_validate\n\nebb = BaggingClassifier()\ncv_results = cross_validate(ebb, X, y, scoring=\"balanced_accuracy\")\n\nprint(f\"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}\")\n\n# %% [markdown]\n# Exactly Balanced Bagging and Over-Bagging\n# -----------------------------------------\n#\n# The :class:`~imblearn.ensemble.BalancedBaggingClassifier` can use in\n# conjunction with a :class:`~imblearn.under_sampling.RandomUnderSampler` or\n# :class:`~imblearn.over_sampling.RandomOverSampler`. These methods are\n# referred as Exactly Balanced Bagging and Over-Bagging, respectively and have\n# been proposed first in [1]_.\n\n# %%\nfrom imblearn.ensemble import BalancedBaggingClassifier\nfrom imblearn.under_sampling import RandomUnderSampler\n\n# Exactly Balanced Bagging\nebb = BalancedBaggingClassifier(sampler=RandomUnderSampler())\ncv_results = cross_validate(ebb, X, y, scoring=\"balanced_accuracy\")\n\nprint(f\"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}\")\n\n# %%\nfrom imblearn.over_sampling import RandomOverSampler\n\n# Over-bagging\nover_bagging = BalancedBaggingClassifier(sampler=RandomOverSampler())\ncv_results = cross_validate(over_bagging, X, y, scoring=\"balanced_accuracy\")\n\nprint(f\"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}\")\n\n# %% [markdown]\n# SMOTE-Bagging\n# -------------\n#\n# Instead of using a :class:`~imblearn.over_sampling.RandomOverSampler` that\n# make a bootstrap, an alternative is to use\n# :class:`~imblearn.over_sampling.SMOTE` as an over-sampler. This is known as\n# SMOTE-Bagging [2]_.\n\n# %%\nfrom imblearn.over_sampling import SMOTE\n\n# SMOTE-Bagging\nsmote_bagging = BalancedBaggingClassifier(sampler=SMOTE())\ncv_results = cross_validate(smote_bagging, X, y, scoring=\"balanced_accuracy\")\n\nprint(f\"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}\")\n\n# %% [markdown]\n# Roughly Balanced Bagging\n# ------------------------\n# While using a :class:`~imblearn.under_sampling.RandomUnderSampler` or\n# :class:`~imblearn.over_sampling.RandomOverSampler` will create exactly the\n# desired number of samples, it does not follow the statistical spirit wanted\n# in the bagging framework. The authors in [3]_ proposes to use a negative\n# binomial distribution to compute the number of samples of the majority\n# class to be selected and then perform a random under-sampling.\n#\n# Here, we illustrate this method by implementing a function in charge of\n# resampling and use the :class:`~imblearn.FunctionSampler` to integrate it\n# within a :class:`~imblearn.pipeline.Pipeline` and\n# :class:`~sklearn.model_selection.cross_validate`.\n\n# %%\nfrom collections import Counter\n\nimport numpy as np\n\nfrom imblearn import FunctionSampler\n\n\ndef roughly_balanced_bagging(X, y, replace=False):\n    \"\"\"Implementation of Roughly Balanced Bagging for binary problem.\"\"\"\n    # find the minority and majority classes\n    class_counts = Counter(y)\n    majority_class = max(class_counts, key=class_counts.get)\n    minority_class = min(class_counts, key=class_counts.get)\n\n    # compute the number of sample to draw from the majority class using\n    # a negative binomial distribution\n    n_minority_class = class_counts[minority_class]\n    n_majority_resampled = np.random.negative_binomial(n=n_minority_class, p=0.5)\n\n    # draw randomly with or without replacement\n    majority_indices = np.random.choice(\n        np.flatnonzero(y == majority_class),\n        size=n_majority_resampled,\n        replace=replace,\n    )\n    minority_indices = np.random.choice(\n        np.flatnonzero(y == minority_class),\n        size=n_minority_class,\n        replace=replace,\n    )\n    indices = np.hstack([majority_indices, minority_indices])\n\n    return X[indices], y[indices]\n\n\n# Roughly Balanced Bagging\nrbb = BalancedBaggingClassifier(\n    sampler=FunctionSampler(func=roughly_balanced_bagging, kw_args={\"replace\": True})\n)\ncv_results = cross_validate(rbb, X, y, scoring=\"balanced_accuracy\")\n\nprint(f\"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}\")\n\n\n# %% [markdown]\n# .. topic:: References:\n#\n#    .. [1] R. Maclin, and D. Opitz. \"An empirical evaluation of bagging and\n#           boosting.\" AAAI/IAAI 1997 (1997): 546-551.\n#\n#    .. [2] S. Wang, and X. Yao. \"Diversity analysis on imbalanced data sets by\n#           using ensemble models.\" 2009 IEEE symposium on computational\n#           intelligence and data mining. IEEE, 2009.\n#\n#    .. [3] S. Hido, H. Kashima, and Y. Takahashi. \"Roughly balanced bagging\n#          for imbalanced data.\" Statistical Analysis and Data Mining: The ASA\n#          Data Science Journal 2.5‐6 (2009): 412-426.\n"
  },
  {
    "path": "examples/ensemble/plot_comparison_ensemble_classifier.py",
    "content": "\"\"\"\n=============================================\nCompare ensemble classifiers using resampling\n=============================================\n\nEnsemble classifiers have shown to improve classification performance compare\nto single learner. However, they will be affected by class imbalance. This\nexample shows the benefit of balancing the training set before to learn\nlearners. We are making the comparison with non-balanced ensemble methods.\n\nWe make a comparison using the balanced accuracy and geometric mean which are\nmetrics widely used in the literature to evaluate models learned on imbalanced\nset.\n\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\n# %%\nprint(__doc__)\n\n# %% [markdown]\n# Load an imbalanced dataset\n# --------------------------\n#\n# We will load the UCI SatImage dataset which has an imbalanced ratio of 9.3:1\n# (number of majority sample for a minority sample). The data are then split\n# into training and testing.\n\nfrom sklearn.model_selection import train_test_split\n\n# %%\nfrom imblearn.datasets import fetch_datasets\n\nsatimage = fetch_datasets()[\"satimage\"]\nX, y = satimage.data, satimage.target\nX_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)\n\n# %% [markdown]\n# Classification using a single decision tree\n# -------------------------------------------\n#\n# We train a decision tree classifier which will be used as a baseline for the\n# rest of this example.\n#\n# The results are reported in terms of balanced accuracy and geometric mean\n# which are metrics widely used in the literature to validate model trained on\n# imbalanced set.\n\n# %%\nfrom sklearn.tree import DecisionTreeClassifier\n\ntree = DecisionTreeClassifier()\ntree.fit(X_train, y_train)\ny_pred_tree = tree.predict(X_test)\n\n# %%\nfrom sklearn.metrics import balanced_accuracy_score\n\nfrom imblearn.metrics import geometric_mean_score\n\nprint(\"Decision tree classifier performance:\")\nprint(\n    f\"Balanced accuracy: {balanced_accuracy_score(y_test, y_pred_tree):.2f} - \"\n    f\"Geometric mean {geometric_mean_score(y_test, y_pred_tree):.2f}\"\n)\n\n# %%\nimport seaborn as sns\nfrom sklearn.metrics import ConfusionMatrixDisplay\n\nsns.set_context(\"poster\")\n\ndisp = ConfusionMatrixDisplay.from_estimator(tree, X_test, y_test, colorbar=False)\n_ = disp.ax_.set_title(\"Decision tree\")\n\n# %% [markdown]\n# Classification using bagging classifier with and without sampling\n# -----------------------------------------------------------------\n#\n# Instead of using a single tree, we will check if an ensemble of decision tree\n# can actually alleviate the issue induced by the class imbalancing. First, we\n# will use a bagging classifier and its counter part which internally uses a\n# random under-sampling to balanced each bootstrap sample.\n\n# %%\nfrom sklearn.ensemble import BaggingClassifier\n\nfrom imblearn.ensemble import BalancedBaggingClassifier\n\nbagging = BaggingClassifier(n_estimators=50, random_state=0)\nbalanced_bagging = BalancedBaggingClassifier(n_estimators=50, random_state=0)\n\nbagging.fit(X_train, y_train)\nbalanced_bagging.fit(X_train, y_train)\n\ny_pred_bc = bagging.predict(X_test)\ny_pred_bbc = balanced_bagging.predict(X_test)\n\n# %% [markdown]\n# Balancing each bootstrap sample allows to increase significantly the balanced\n# accuracy and the geometric mean.\n\n# %%\nprint(\"Bagging classifier performance:\")\nprint(\n    f\"Balanced accuracy: {balanced_accuracy_score(y_test, y_pred_bc):.2f} - \"\n    f\"Geometric mean {geometric_mean_score(y_test, y_pred_bc):.2f}\"\n)\nprint(\"Balanced Bagging classifier performance:\")\nprint(\n    f\"Balanced accuracy: {balanced_accuracy_score(y_test, y_pred_bbc):.2f} - \"\n    f\"Geometric mean {geometric_mean_score(y_test, y_pred_bbc):.2f}\"\n)\n\n# %%\nimport matplotlib.pyplot as plt\n\nfig, axs = plt.subplots(ncols=2, figsize=(10, 5))\nConfusionMatrixDisplay.from_estimator(\n    bagging, X_test, y_test, ax=axs[0], colorbar=False\n)\naxs[0].set_title(\"Bagging\")\n\nConfusionMatrixDisplay.from_estimator(\n    balanced_bagging, X_test, y_test, ax=axs[1], colorbar=False\n)\naxs[1].set_title(\"Balanced Bagging\")\n\nfig.tight_layout()\n\n# %% [markdown]\n# Classification using random forest classifier with and without sampling\n# -----------------------------------------------------------------------\n#\n# Random forest is another popular ensemble method and it is usually\n# outperforming bagging. Here, we used a vanilla random forest and its balanced\n# counterpart in which each bootstrap sample is balanced.\n\n# %%\nfrom sklearn.ensemble import RandomForestClassifier\n\nfrom imblearn.ensemble import BalancedRandomForestClassifier\n\nrf = RandomForestClassifier(n_estimators=50, random_state=0)\nbrf = BalancedRandomForestClassifier(\n    n_estimators=50,\n    sampling_strategy=\"all\",\n    replacement=True,\n    bootstrap=False,\n    random_state=0,\n)\n\nrf.fit(X_train, y_train)\nbrf.fit(X_train, y_train)\n\ny_pred_rf = rf.predict(X_test)\ny_pred_brf = brf.predict(X_test)\n\n# %% [markdown]\n# Similarly to the previous experiment, the balanced classifier outperform the\n# classifier which learn from imbalanced bootstrap samples. In addition, random\n# forest outperforms the bagging classifier.\n\n# %%\nprint(\"Random Forest classifier performance:\")\nprint(\n    f\"Balanced accuracy: {balanced_accuracy_score(y_test, y_pred_rf):.2f} - \"\n    f\"Geometric mean {geometric_mean_score(y_test, y_pred_rf):.2f}\"\n)\nprint(\"Balanced Random Forest classifier performance:\")\nprint(\n    f\"Balanced accuracy: {balanced_accuracy_score(y_test, y_pred_brf):.2f} - \"\n    f\"Geometric mean {geometric_mean_score(y_test, y_pred_brf):.2f}\"\n)\n\n# %%\nfig, axs = plt.subplots(ncols=2, figsize=(10, 5))\nConfusionMatrixDisplay.from_estimator(rf, X_test, y_test, ax=axs[0], colorbar=False)\naxs[0].set_title(\"Random forest\")\n\nConfusionMatrixDisplay.from_estimator(brf, X_test, y_test, ax=axs[1], colorbar=False)\naxs[1].set_title(\"Balanced random forest\")\n\nfig.tight_layout()\n\n# %% [markdown]\n# Boosting classifier\n# -------------------\n#\n# In the same manner, easy ensemble classifier is a bag of balanced AdaBoost\n# classifier. However, it will be slower to train than random forest and will\n# achieve worse performance.\n\n# %%\nfrom sklearn.ensemble import AdaBoostClassifier\n\nfrom imblearn.ensemble import EasyEnsembleClassifier, RUSBoostClassifier\n\nestimator = AdaBoostClassifier(n_estimators=10)\neec = EasyEnsembleClassifier(n_estimators=10, estimator=estimator)\neec.fit(X_train, y_train)\ny_pred_eec = eec.predict(X_test)\n\nrusboost = RUSBoostClassifier(n_estimators=10, estimator=estimator)\nrusboost.fit(X_train, y_train)\ny_pred_rusboost = rusboost.predict(X_test)\n\n# %%\nprint(\"Easy ensemble classifier performance:\")\nprint(\n    f\"Balanced accuracy: {balanced_accuracy_score(y_test, y_pred_eec):.2f} - \"\n    f\"Geometric mean {geometric_mean_score(y_test, y_pred_eec):.2f}\"\n)\nprint(\"RUSBoost classifier performance:\")\nprint(\n    f\"Balanced accuracy: {balanced_accuracy_score(y_test, y_pred_rusboost):.2f} - \"\n    f\"Geometric mean {geometric_mean_score(y_test, y_pred_rusboost):.2f}\"\n)\n\n# %%\nfig, axs = plt.subplots(ncols=2, figsize=(10, 5))\n\nConfusionMatrixDisplay.from_estimator(eec, X_test, y_test, ax=axs[0], colorbar=False)\naxs[0].set_title(\"Easy Ensemble\")\nConfusionMatrixDisplay.from_estimator(\n    rusboost, X_test, y_test, ax=axs[1], colorbar=False\n)\naxs[1].set_title(\"RUSBoost classifier\")\n\nfig.tight_layout()\nplt.show()\n"
  },
  {
    "path": "examples/evaluation/README.txt",
    "content": ".. _evaluation_examples:\n\nEvaluation examples\n-------------------\n\nExamples illustrating how classification using imbalanced dataset can be done.\n"
  },
  {
    "path": "examples/evaluation/plot_classification_report.py",
    "content": "\"\"\"\n=============================================\nEvaluate classification by compiling a report\n=============================================\n\nSpecific metrics have been developed to evaluate classifier which has been\ntrained using imbalanced data. :mod:`imblearn` provides a classification report\nsimilar to :mod:`sklearn`, with additional metrics specific to imbalanced\nlearning problem.\n\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\n\nfrom sklearn import datasets\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler\n\nfrom imblearn import over_sampling as os\nfrom imblearn import pipeline as pl\nfrom imblearn.metrics import classification_report_imbalanced\n\nprint(__doc__)\n\nRANDOM_STATE = 42\n\n# Generate a dataset\nX, y = datasets.make_classification(\n    n_classes=2,\n    class_sep=2,\n    weights=[0.1, 0.9],\n    n_informative=10,\n    n_redundant=1,\n    flip_y=0,\n    n_features=20,\n    n_clusters_per_class=4,\n    n_samples=5000,\n    random_state=RANDOM_STATE,\n)\n\npipeline = pl.make_pipeline(\n    StandardScaler(),\n    os.SMOTE(random_state=RANDOM_STATE),\n    LogisticRegression(max_iter=10_000),\n)\n\n# Split the data\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=RANDOM_STATE)\n\n# Train the classifier with balancing\npipeline.fit(X_train, y_train)\n\n# Test the classifier and get the prediction\ny_pred_bal = pipeline.predict(X_test)\n\n# Show the classification report\nprint(classification_report_imbalanced(y_test, y_pred_bal))\n"
  },
  {
    "path": "examples/evaluation/plot_metrics.py",
    "content": "\"\"\"\n=======================================\nMetrics specific to imbalanced learning\n=======================================\n\nSpecific metrics have been developed to evaluate classifier which\nhas been trained using imbalanced data. :mod:`imblearn` provides mainly\ntwo additional metrics which are not implemented in :mod:`sklearn`: (i)\ngeometric mean and (ii) index balanced accuracy.\n\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\n# %%\nprint(__doc__)\n\nRANDOM_STATE = 42\n\n# %% [markdown]\n# First, we will generate some imbalanced dataset.\n\n# %%\nfrom sklearn.datasets import make_classification\n\nX, y = make_classification(\n    n_classes=3,\n    class_sep=2,\n    weights=[0.1, 0.9],\n    n_informative=10,\n    n_redundant=1,\n    flip_y=0,\n    n_features=20,\n    n_clusters_per_class=4,\n    n_samples=5000,\n    random_state=RANDOM_STATE,\n)\n\n# %% [markdown]\n# We will split the data into a training and testing set.\n\n# %%\nfrom sklearn.model_selection import train_test_split\n\nX_train, X_test, y_train, y_test = train_test_split(\n    X, y, stratify=y, random_state=RANDOM_STATE\n)\n\n# %% [markdown]\n# We will create a pipeline made of a :class:`~imblearn.over_sampling.SMOTE`\n# over-sampler followed by a :class:`~sklearn.linear_model.LogisticRegression`\n# classifier.\n\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.preprocessing import StandardScaler\n\nfrom imblearn.over_sampling import SMOTE\n\n# %%\nfrom imblearn.pipeline import make_pipeline\n\nmodel = make_pipeline(\n    StandardScaler(),\n    SMOTE(random_state=RANDOM_STATE),\n    LogisticRegression(max_iter=10_000, random_state=RANDOM_STATE),\n)\n\n# %% [markdown]\n# Now, we will train the model on the training set and get the prediction\n# associated with the testing set. Be aware that the resampling will happen\n# only when calling `fit`: the number of samples in `y_pred` is the same than\n# in `y_test`.\n\n# %%\nmodel.fit(X_train, y_train)\ny_pred = model.predict(X_test)\n\n# %% [markdown]\n# The geometric mean corresponds to the square root of the product of the\n# sensitivity and specificity. Combining the two metrics should account for\n# the balancing of the dataset.\n\n# %%\nfrom imblearn.metrics import geometric_mean_score\n\nprint(f\"The geometric mean is {geometric_mean_score(y_test, y_pred):.3f}\")\n\n# %% [markdown]\n# The index balanced accuracy can transform any metric to be used in\n# imbalanced learning problems.\n\n# %%\nfrom imblearn.metrics import make_index_balanced_accuracy\n\nalpha = 0.1\ngeo_mean = make_index_balanced_accuracy(alpha=alpha, squared=True)(geometric_mean_score)\n\nprint(\n    f\"The IBA using alpha={alpha} and the geometric mean: \"\n    f\"{geo_mean(y_test, y_pred):.3f}\"\n)\n\n# %%\nalpha = 0.5\ngeo_mean = make_index_balanced_accuracy(alpha=alpha, squared=True)(geometric_mean_score)\n\nprint(\n    f\"The IBA using alpha={alpha} and the geometric mean: \"\n    f\"{geo_mean(y_test, y_pred):.3f}\"\n)\n"
  },
  {
    "path": "examples/model_selection/README.txt",
    "content": ".. _model_selection_examples:\n\nModel Selection\n---------------\n\nExamples related to the selection of balancing methods.\n"
  },
  {
    "path": "examples/model_selection/plot_instance_hardness_cv.py",
    "content": "\"\"\"\n====================================================\nDistribute hard-to-classify datapoints over CV folds\n====================================================\n\n'Instance hardness' refers to the difficulty to classify an instance. The way\nhard-to-classify instances are distributed over train and test sets has\nsignificant effect on the test set performance metrics. In this example we\nshow how to deal with this problem. We are making the comparison with normal\n:class:`~sklearn.model_selection.StratifiedKFold` cross-validation splitter.\n\"\"\"\n\n# Authors: Frits Hermans, https://fritshermans.github.io\n# License: MIT\n\n# %%\nprint(__doc__)\n\n# %%\n# Create an imbalanced dataset with instance hardness\n# ---------------------------------------------------\n#\n# We create an imbalanced dataset with using scikit-learn's\n# :func:`~sklearn.datasets.make_blobs` function and set the class imbalance ratio to\n# 5%.\nimport numpy as np\nfrom matplotlib import pyplot as plt\nfrom sklearn.datasets import make_blobs\n\nX, y = make_blobs(n_samples=[950, 50], centers=((-3, 0), (3, 0)), random_state=10)\n_ = plt.scatter(X[:, 0], X[:, 1], c=y)\n\n# %%\n# To introduce instance hardness in our dataset, we add some hard to classify samples:\nX_hard, y_hard = make_blobs(\n    n_samples=10, centers=((3, 0), (-3, 0)), cluster_std=1, random_state=10\n)\nX, y = np.vstack((X, X_hard)), np.hstack((y, y_hard))\n_ = plt.scatter(X[:, 0], X[:, 1], c=y)\n\n# %%\n# Compare cross validation scores using `StratifiedKFold` and `InstanceHardnessCV`\n# --------------------------------------------------------------------------------\n#\n# Now, we want to assess a linear predictive model. Therefore, we should use\n# cross-validation. The most important concept with cross-validation is to create\n# training and test splits that are representative of the the data in production to have\n# statistical results that one can expect in production.\n#\n# By applying a standard :class:`~sklearn.model_selection.StratifiedKFold`\n# cross-validation splitter, we do not control in which fold the hard-to-classify\n# samples will be.\n#\n# The :class:`~imblearn.model_selection.InstanceHardnessCV` splitter allows to\n# control the distribution of the hard-to-classify samples over the folds.\n#\n# Let's make an experiment to compare the results that we get with both splitters.\n# We use a :class:`~sklearn.linear_model.LogisticRegression` classifier and\n# :func:`~sklearn.model_selection.cross_validate` to calculate the cross validation\n# scores. We use average precision for scoring.\nimport pandas as pd\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import StratifiedKFold, cross_validate\n\nfrom imblearn.model_selection import InstanceHardnessCV\n\nlogistic_regression = LogisticRegression()\n\nresults = {}\nfor cv in (\n    StratifiedKFold(n_splits=5, shuffle=True, random_state=10),\n    InstanceHardnessCV(estimator=LogisticRegression()),\n):\n    result = cross_validate(\n        logistic_regression,\n        X,\n        y,\n        cv=cv,\n        scoring=\"average_precision\",\n    )\n    results[cv.__class__.__name__] = result[\"test_score\"]\nresults = pd.DataFrame(results)\n\n# %%\nax = results.plot.box(vert=False, whis=[0, 100])\n_ = ax.set(\n    xlabel=\"Average precision\",\n    title=\"Cross validation scores with different splitters\",\n    xlim=(0, 1),\n)\n\n# %%\n# The boxplot shows that the :class:`~imblearn.model_selection.InstanceHardnessCV`\n# splitter results in less variation of average precision than\n# :class:`~sklearn.model_selection.StratifiedKFold` splitter. When doing\n# hyperparameter tuning or feature selection using a wrapper method (like\n# :class:`~sklearn.feature_selection.RFECV`) this will give more stable results.\n"
  },
  {
    "path": "examples/model_selection/plot_validation_curve.py",
    "content": "\"\"\"\n==========================\nPlotting Validation Curves\n==========================\n\nIn this example the impact of the :class:`~imblearn.over_sampling.SMOTE`'s\n`k_neighbors` parameter is examined. In the plot you can see the validation\nscores of a SMOTE-CART classifier for different values of the\n:class:`~imblearn.over_sampling.SMOTE`'s `k_neighbors` parameter.\n\"\"\"\n\n# Authors: Christos Aridas\n#          Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\n# %%\nprint(__doc__)\n\nimport seaborn as sns\n\nsns.set_context(\"poster\")\n\n\nRANDOM_STATE = 42\n\n# %% [markdown]\n# Let's first generate a dataset with imbalanced class distribution.\n\n# %%\nfrom sklearn.datasets import make_classification\n\nX, y = make_classification(\n    n_classes=2,\n    class_sep=2,\n    weights=[0.1, 0.9],\n    n_informative=10,\n    n_redundant=1,\n    flip_y=0,\n    n_features=20,\n    n_clusters_per_class=4,\n    n_samples=5000,\n    random_state=RANDOM_STATE,\n)\n\n# %% [markdown]\n# We will use an over-sampler :class:`~imblearn.over_sampling.SMOTE` followed\n# by a :class:`~sklearn.tree.DecisionTreeClassifier`. The aim will be to\n# search which `k_neighbors` parameter is the most adequate with the dataset\n# that we generated.\n\nfrom sklearn.tree import DecisionTreeClassifier\n\n# %%\nfrom imblearn.over_sampling import SMOTE\nfrom imblearn.pipeline import make_pipeline\n\nmodel = make_pipeline(\n    SMOTE(random_state=RANDOM_STATE), DecisionTreeClassifier(random_state=RANDOM_STATE)\n)\n\n# %% [markdown]\n# We can use the :class:`~sklearn.model_selection.validation_curve` to inspect\n# the impact of varying the parameter `k_neighbors`. In this case, we need\n# to use a score to evaluate the generalization score during the\n# cross-validation.\n\n# %%\nfrom sklearn.metrics import cohen_kappa_score, make_scorer\nfrom sklearn.model_selection import validation_curve\n\nscorer = make_scorer(cohen_kappa_score)\nparam_range = range(1, 11)\ntrain_scores, test_scores = validation_curve(\n    model,\n    X,\n    y,\n    param_name=\"smote__k_neighbors\",\n    param_range=param_range,\n    cv=3,\n    scoring=scorer,\n)\n\n# %%\ntrain_scores_mean = train_scores.mean(axis=1)\ntrain_scores_std = train_scores.std(axis=1)\ntest_scores_mean = test_scores.mean(axis=1)\ntest_scores_std = test_scores.std(axis=1)\n\n# %% [markdown]\n# We can now plot the results of the cross-validation for the different\n# parameter values that we tried.\n\n# %%\nimport matplotlib.pyplot as plt\n\nfig, ax = plt.subplots(figsize=(7, 7))\nax.plot(param_range, test_scores_mean, label=\"SMOTE\")\nax.fill_between(\n    param_range,\n    test_scores_mean + test_scores_std,\n    test_scores_mean - test_scores_std,\n    alpha=0.2,\n)\nidx_max = test_scores_mean.argmax()\nax.scatter(\n    param_range[idx_max],\n    test_scores_mean[idx_max],\n    label=(\n        r\"Cohen Kappa:\"\n        rf\" ${test_scores_mean[idx_max]:.2f}\\pm{test_scores_std[idx_max]:.2f}$\"\n    ),\n)\n\nfig.suptitle(\"Validation Curve with SMOTE-CART\")\nax.set_xlabel(\"Number of neighbors\")\nax.set_ylabel(\"Cohen's kappa\")\n\n# make nice plotting\nsns.despine(ax=ax, offset=10)\nax.set_xlim([1, 10])\nax.set_ylim([0.4, 0.8])\nax.legend(loc=\"lower right\", fontsize=16)\nplt.tight_layout()\nplt.show()\n"
  },
  {
    "path": "examples/over-sampling/README.txt",
    "content": ".. _over_sampling_examples:\n\nExample using over-sampling class methods\n=========================================\n\nData balancing can be performed by over-sampling such that new samples are generated in the minority class to reach a given balancing ratio.\n"
  },
  {
    "path": "examples/over-sampling/plot_comparison_over_sampling.py",
    "content": "\"\"\"\n==============================\nCompare over-sampling samplers\n==============================\n\nThe following example attends to make a qualitative comparison between the\ndifferent over-sampling algorithms available in the imbalanced-learn package.\n\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\n# %%\nprint(__doc__)\n\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n\nsns.set_context(\"poster\")\n\n# %% [markdown]\n# The following function will be used to create toy dataset. It uses the\n# :func:`~sklearn.datasets.make_classification` from scikit-learn but fixing\n# some parameters.\n\n\n# %%\nfrom sklearn.datasets import make_classification\n\n\ndef create_dataset(\n    n_samples=1000,\n    weights=(0.01, 0.01, 0.98),\n    n_classes=3,\n    class_sep=0.8,\n    n_clusters=1,\n):\n    return make_classification(\n        n_samples=n_samples,\n        n_features=2,\n        n_informative=2,\n        n_redundant=0,\n        n_repeated=0,\n        n_classes=n_classes,\n        n_clusters_per_class=n_clusters,\n        weights=list(weights),\n        class_sep=class_sep,\n        random_state=0,\n    )\n\n\n# %% [markdown]\n# The following function will be used to plot the sample space after resampling\n# to illustrate the specificities of an algorithm.\n\n\n# %%\ndef plot_resampling(X, y, sampler, ax, title=None):\n    X_res, y_res = sampler.fit_resample(X, y)\n    ax.scatter(X_res[:, 0], X_res[:, 1], c=y_res, alpha=0.8, edgecolor=\"k\")\n    if title is None:\n        title = f\"Resampling with {sampler.__class__.__name__}\"\n    ax.set_title(title)\n    sns.despine(ax=ax, offset=10)\n\n\n# %% [markdown]\n# The following function will be used to plot the decision function of a\n# classifier given some data.\n\n\n# %%\nimport numpy as np\n\n\ndef plot_decision_function(X, y, clf, ax, title=None):\n    plot_step = 0.02\n    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1\n    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1\n    xx, yy = np.meshgrid(\n        np.arange(x_min, x_max, plot_step), np.arange(y_min, y_max, plot_step)\n    )\n\n    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])\n    Z = Z.reshape(xx.shape)\n    ax.contourf(xx, yy, Z, alpha=0.4)\n    ax.scatter(X[:, 0], X[:, 1], alpha=0.8, c=y, edgecolor=\"k\")\n    if title is not None:\n        ax.set_title(title)\n\n\n# %% [markdown]\n# Illustration of the influence of the balancing ratio\n# ----------------------------------------------------\n#\n# We will first illustrate the influence of the balancing ratio on some toy\n# data using a logistic regression classifier which is a linear model.\n\n# %%\nfrom sklearn.linear_model import LogisticRegression\n\nclf = LogisticRegression()\n\n# %% [markdown]\n# We will fit and show the decision boundary model to illustrate the impact of\n# dealing with imbalanced classes.\n\n# %%\nfig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15, 12))\n\nweights_arr = (\n    (0.01, 0.01, 0.98),\n    (0.01, 0.05, 0.94),\n    (0.2, 0.1, 0.7),\n    (0.33, 0.33, 0.33),\n)\nfor ax, weights in zip(axs.ravel(), weights_arr):\n    X, y = create_dataset(n_samples=300, weights=weights)\n    clf.fit(X, y)\n    plot_decision_function(X, y, clf, ax, title=f\"weight={weights}\")\n    fig.suptitle(f\"Decision function of {clf.__class__.__name__}\")\nfig.tight_layout()\n\n# %% [markdown]\n# Greater is the difference between the number of samples in each class, poorer\n# are the classification results.\n#\n# Random over-sampling to balance the data set\n# --------------------------------------------\n#\n# Random over-sampling can be used to repeat some samples and balance the\n# number of samples between the dataset. It can be seen that with this trivial\n# approach the boundary decision is already less biased toward the majority\n# class. The class :class:`~imblearn.over_sampling.RandomOverSampler`\n# implements such of a strategy.\n\nfrom imblearn.over_sampling import RandomOverSampler\n\n# %%\nfrom imblearn.pipeline import make_pipeline\n\nX, y = create_dataset(n_samples=100, weights=(0.05, 0.25, 0.7))\n\nfig, axs = plt.subplots(nrows=1, ncols=2, figsize=(15, 7))\n\nclf.fit(X, y)\nplot_decision_function(X, y, clf, axs[0], title=\"Without resampling\")\n\nsampler = RandomOverSampler(random_state=0)\nmodel = make_pipeline(sampler, clf).fit(X, y)\nplot_decision_function(X, y, model, axs[1], f\"Using {model[0].__class__.__name__}\")\n\nfig.suptitle(f\"Decision function of {clf.__class__.__name__}\")\nfig.tight_layout()\n\n# %% [markdown]\n# By default, random over-sampling generates a bootstrap. The parameter\n# `shrinkage` allows adding a small perturbation to the generated data\n# to generate a smoothed bootstrap instead. The plot below shows the difference\n# between the two data generation strategies.\n\n# %%\nfig, axs = plt.subplots(nrows=1, ncols=2, figsize=(15, 7))\n\nsampler.set_params(shrinkage=None)\nplot_resampling(X, y, sampler, ax=axs[0], title=\"Normal bootstrap\")\n\nsampler.set_params(shrinkage=0.3)\nplot_resampling(X, y, sampler, ax=axs[1], title=\"Smoothed bootstrap\")\n\nfig.suptitle(f\"Resampling with {sampler.__class__.__name__}\")\nfig.tight_layout()\n\n# %% [markdown]\n# It looks like more samples are generated with smoothed bootstrap. This is due\n# to the fact that the samples generated are not superimposing with the\n# original samples.\n#\n# More advanced over-sampling using ADASYN and SMOTE\n# --------------------------------------------------\n#\n# Instead of repeating the same samples when over-sampling or perturbating the\n# generated bootstrap samples, one can use some specific heuristic instead.\n# :class:`~imblearn.over_sampling.ADASYN` and\n# :class:`~imblearn.over_sampling.SMOTE` can be used in this case.\n\n# %%\nfrom imblearn import FunctionSampler  # to use a idendity sampler\nfrom imblearn.over_sampling import ADASYN, SMOTE\n\nX, y = create_dataset(n_samples=150, weights=(0.1, 0.2, 0.7))\n\nfig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15, 15))\n\nsamplers = [\n    FunctionSampler(),\n    RandomOverSampler(random_state=0),\n    SMOTE(random_state=0),\n    ADASYN(random_state=0),\n]\n\nfor ax, sampler in zip(axs.ravel(), samplers):\n    title = \"Original dataset\" if isinstance(sampler, FunctionSampler) else None\n    plot_resampling(X, y, sampler, ax, title=title)\nfig.tight_layout()\n\n# %% [markdown]\n# The following plot illustrates the difference between\n# :class:`~imblearn.over_sampling.ADASYN` and\n# :class:`~imblearn.over_sampling.SMOTE`.\n# :class:`~imblearn.over_sampling.ADASYN` will focus on the samples which are\n# difficult to classify with a nearest-neighbors rule while regular\n# :class:`~imblearn.over_sampling.SMOTE` will not make any distinction.\n# Therefore, the decision function depending of the algorithm.\n\nX, y = create_dataset(n_samples=150, weights=(0.05, 0.25, 0.7))\n\nfig, axs = plt.subplots(nrows=1, ncols=3, figsize=(20, 6))\n\nmodels = {\n    \"Without sampler\": clf,\n    \"ADASYN sampler\": make_pipeline(ADASYN(random_state=0), clf),\n    \"SMOTE sampler\": make_pipeline(SMOTE(random_state=0), clf),\n}\n\nfor ax, (title, model) in zip(axs, models.items()):\n    model.fit(X, y)\n    plot_decision_function(X, y, model, ax=ax, title=title)\n\nfig.suptitle(f\"Decision function using a {clf.__class__.__name__}\")\nfig.tight_layout()\n\n# %% [markdown]\n# Due to those sampling particularities, it can give rise to some specific\n# issues as illustrated below.\n\n# %%\nX, y = create_dataset(n_samples=5000, weights=(0.01, 0.05, 0.94), class_sep=0.8)\n\nsamplers = [SMOTE(random_state=0), ADASYN(random_state=0)]\n\nfig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15, 15))\nfor ax, sampler in zip(axs, samplers):\n    model = make_pipeline(sampler, clf).fit(X, y)\n    plot_decision_function(\n        X, y, clf, ax[0], title=f\"Decision function with {sampler.__class__.__name__}\"\n    )\n    plot_resampling(X, y, sampler, ax[1])\n\nfig.suptitle(\"Particularities of over-sampling with SMOTE and ADASYN\")\nfig.tight_layout()\n\n# %% [markdown]\n# SMOTE proposes several variants by identifying specific samples to consider\n# during the resampling. The borderline version\n# (:class:`~imblearn.over_sampling.BorderlineSMOTE`) will detect which point to\n# select which are in the border between two classes. The SVM version\n# (:class:`~imblearn.over_sampling.SVMSMOTE`) will use the support vectors\n# found using an SVM algorithm to create new sample while the KMeans version\n# (:class:`~imblearn.over_sampling.KMeansSMOTE`) will make a clustering before\n# to generate samples in each cluster independently depending each cluster\n# density.\n\n# %%\nfrom sklearn.cluster import MiniBatchKMeans\n\nfrom imblearn.over_sampling import SVMSMOTE, BorderlineSMOTE, KMeansSMOTE\n\nX, y = create_dataset(n_samples=5000, weights=(0.01, 0.05, 0.94), class_sep=0.8)\n\nfig, axs = plt.subplots(5, 2, figsize=(15, 30))\n\nsamplers = [\n    SMOTE(random_state=0),\n    BorderlineSMOTE(random_state=0, kind=\"borderline-1\"),\n    BorderlineSMOTE(random_state=0, kind=\"borderline-2\"),\n    KMeansSMOTE(\n        kmeans_estimator=MiniBatchKMeans(n_clusters=10, n_init=1, random_state=0),\n        random_state=0,\n    ),\n    SVMSMOTE(random_state=0),\n]\n\nfor ax, sampler in zip(axs, samplers):\n    model = make_pipeline(sampler, clf).fit(X, y)\n    plot_decision_function(\n        X, y, clf, ax[0], title=f\"Decision function for {sampler.__class__.__name__}\"\n    )\n    plot_resampling(X, y, sampler, ax[1])\n\nfig.suptitle(\"Decision function and resampling using SMOTE variants\")\nfig.tight_layout()\n\n# %% [markdown]\n# When dealing with a mixed of continuous and categorical features,\n# :class:`~imblearn.over_sampling.SMOTENC` is the only method which can handle\n# this case.\n\n# %%\nfrom collections import Counter\n\nfrom imblearn.over_sampling import SMOTENC\n\nrng = np.random.RandomState(42)\nn_samples = 50\n# Create a dataset of a mix of numerical and categorical data\nX = np.empty((n_samples, 3), dtype=object)\nX[:, 0] = rng.choice([\"A\", \"B\", \"C\"], size=n_samples).astype(object)\nX[:, 1] = rng.randn(n_samples)\nX[:, 2] = rng.randint(3, size=n_samples)\ny = np.array([0] * 20 + [1] * 30)\n\nprint(\"The original imbalanced dataset\")\nprint(sorted(Counter(y).items()))\nprint()\nprint(\"The first and last columns are containing categorical features:\")\nprint(X[:5])\nprint()\n\nsmote_nc = SMOTENC(categorical_features=[0, 2], random_state=0)\nX_resampled, y_resampled = smote_nc.fit_resample(X, y)\nprint(\"Dataset after resampling:\")\nprint(sorted(Counter(y_resampled).items()))\nprint()\nprint(\"SMOTE-NC will generate categories for the categorical features:\")\nprint(X_resampled[-5:])\nprint()\n\n# %% [markdown]\n# However, if the dataset is composed of only categorical features then one\n# should use :class:`~imblearn.over_sampling.SMOTEN`.\n\n# %%\nfrom imblearn.over_sampling import SMOTEN\n\n# Generate only categorical data\nX = np.array([\"A\"] * 10 + [\"B\"] * 20 + [\"C\"] * 30, dtype=object).reshape(-1, 1)\ny = np.array([0] * 20 + [1] * 40, dtype=np.int32)\n\nprint(f\"Original class counts: {Counter(y)}\")\nprint()\nprint(X[:5])\nprint()\n\nsampler = SMOTEN(random_state=0)\nX_res, y_res = sampler.fit_resample(X, y)\nprint(f\"Class counts after resampling {Counter(y_res)}\")\nprint()\nprint(X_res[-5:])\nprint()\n"
  },
  {
    "path": "examples/over-sampling/plot_illustration_generation_sample.py",
    "content": "\"\"\"\n============================================\nSample generator used in SMOTE-like samplers\n============================================\n\nThis example illustrates how a new sample is generated taking into account the\nneighbourhood of this sample. A new sample is generated by selecting the\nrandomly 2 samples of the same class and interpolating a point between these\nsamples.\n\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n# %%\nprint(__doc__)\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport seaborn as sns\n\nsns.set_context(\"poster\")\n\nrng = np.random.RandomState(18)\n\nf, ax = plt.subplots(figsize=(8, 8))\n\n# generate some data points\ny = np.array([3.65284, 3.52623, 3.51468, 3.22199, 3.21])\nz = np.array([0.43, 0.45, 0.6, 0.4, 0.211])\ny_2 = np.array([3.3, 3.6])\nz_2 = np.array([0.58, 0.34])\n\n# plot the majority and minority samples\nax.scatter(z, y, label=\"Minority class\", s=100)\nax.scatter(z_2, y_2, label=\"Majority class\", s=100)\n\nidx = rng.randint(len(y), size=2)\nannotation = [r\"$x_i$\", r\"$x_{zi}$\"]\n\nfor a, i in zip(annotation, idx):\n    ax.annotate(a, (z[i], y[i]), xytext=tuple([z[i] + 0.01, y[i] + 0.005]), fontsize=15)\n\n# draw the circle in which the new sample will generated\nradius = np.sqrt((z[idx[0]] - z[idx[1]]) ** 2 + (y[idx[0]] - y[idx[1]]) ** 2)\ncircle = plt.Circle((z[idx[0]], y[idx[0]]), radius=radius, alpha=0.2)\nax.add_artist(circle)\n\n# plot the line on which the sample will be generated\nax.plot(z[idx], y[idx], \"--\", alpha=0.5)\n\n# create and plot the new sample\nstep = rng.uniform()\ny_gen = y[idx[0]] + step * (y[idx[1]] - y[idx[0]])\nz_gen = z[idx[0]] + step * (z[idx[1]] - z[idx[0]])\n\nax.scatter(z_gen, y_gen, s=100)\nax.annotate(\n    r\"$x_{new}$\",\n    (z_gen, y_gen),\n    xytext=tuple([z_gen + 0.01, y_gen + 0.005]),\n    fontsize=15,\n)\n\n# make the plot nicer with legend and label\nsns.despine(ax=ax, offset=10)\nax.set_xlim([0.2, 0.7])\nax.set_ylim([3.2, 3.7])\nplt.xlabel(r\"$X_1$\")\nplt.ylabel(r\"$X_2$\")\nplt.legend()\nplt.tight_layout()\nplt.show()\n"
  },
  {
    "path": "examples/over-sampling/plot_shrinkage_effect.py",
    "content": "\"\"\"\n======================================================\nEffect of the shrinkage factor in random over-sampling\n======================================================\n\nThis example shows the effect of the shrinkage factor used to generate the\nsmoothed bootstrap using the\n:class:`~imblearn.over_sampling.RandomOverSampler`.\n\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\n# %%\nprint(__doc__)\n\nimport seaborn as sns\n\nsns.set_context(\"poster\")\n\n# %%\n# First, we will generate a toy classification dataset with only few samples.\n# The ratio between the classes will be imbalanced.\nfrom collections import Counter\n\nfrom sklearn.datasets import make_classification\n\nX, y = make_classification(\n    n_samples=100,\n    n_features=2,\n    n_redundant=0,\n    weights=[0.1, 0.9],\n    random_state=0,\n)\nCounter(y)\n\n\n# %%\nimport matplotlib.pyplot as plt\n\nfig, ax = plt.subplots(figsize=(7, 7))\nscatter = plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.4)\nclass_legend = ax.legend(*scatter.legend_elements(), loc=\"lower left\", title=\"Classes\")\nax.add_artist(class_legend)\nax.set_xlabel(\"Feature #1\")\n_ = ax.set_ylabel(\"Feature #2\")\nplt.tight_layout()\n\n# %%\n# Now, we will use a :class:`~imblearn.over_sampling.RandomOverSampler` to\n# generate a bootstrap for the minority class with as many samples as in the\n# majority class.\nfrom imblearn.over_sampling import RandomOverSampler\n\nsampler = RandomOverSampler(random_state=0)\nX_res, y_res = sampler.fit_resample(X, y)\nCounter(y_res)\n\n# %%\nfig, ax = plt.subplots(figsize=(7, 7))\nscatter = plt.scatter(X_res[:, 0], X_res[:, 1], c=y_res, alpha=0.4)\nclass_legend = ax.legend(*scatter.legend_elements(), loc=\"lower left\", title=\"Classes\")\nax.add_artist(class_legend)\nax.set_xlabel(\"Feature #1\")\n_ = ax.set_ylabel(\"Feature #2\")\nplt.tight_layout()\n# %%\n# We observe that the minority samples are less transparent than the samples\n# from the majority class. Indeed, it is due to the fact that these samples\n# of the minority class are repeated during the bootstrap generation.\n#\n# We can set `shrinkage` to a floating value to add a small perturbation to the\n# samples created and therefore create a smoothed bootstrap.\nsampler = RandomOverSampler(shrinkage=1, random_state=0)\nX_res, y_res = sampler.fit_resample(X, y)\nCounter(y_res)\n\n# %%\nfig, ax = plt.subplots(figsize=(7, 7))\nscatter = plt.scatter(X_res[:, 0], X_res[:, 1], c=y_res, alpha=0.4)\nclass_legend = ax.legend(*scatter.legend_elements(), loc=\"lower left\", title=\"Classes\")\nax.add_artist(class_legend)\nax.set_xlabel(\"Feature #1\")\n_ = ax.set_ylabel(\"Feature #2\")\nplt.tight_layout()\n\n# %%\n# In this case, we see that the samples in the minority class are not\n# overlapping anymore due to the added noise.\n#\n# The parameter `shrinkage` allows to add more or less perturbation. Let's\n# add more perturbation when generating the smoothed bootstrap.\nsampler = RandomOverSampler(shrinkage=3, random_state=0)\nX_res, y_res = sampler.fit_resample(X, y)\nCounter(y_res)\n\n# %%\nfig, ax = plt.subplots(figsize=(7, 7))\nscatter = plt.scatter(X_res[:, 0], X_res[:, 1], c=y_res, alpha=0.4)\nclass_legend = ax.legend(*scatter.legend_elements(), loc=\"lower left\", title=\"Classes\")\nax.add_artist(class_legend)\nax.set_xlabel(\"Feature #1\")\n_ = ax.set_ylabel(\"Feature #2\")\nplt.tight_layout()\n\n# %%\n# Increasing the value of `shrinkage` will disperse the new samples. Forcing\n# the shrinkage to 0 will be equivalent to generating a normal bootstrap.\nsampler = RandomOverSampler(shrinkage=0, random_state=0)\nX_res, y_res = sampler.fit_resample(X, y)\nCounter(y_res)\n\n# %%\nfig, ax = plt.subplots(figsize=(7, 7))\nscatter = plt.scatter(X_res[:, 0], X_res[:, 1], c=y_res, alpha=0.4)\nclass_legend = ax.legend(*scatter.legend_elements(), loc=\"lower left\", title=\"Classes\")\nax.add_artist(class_legend)\nax.set_xlabel(\"Feature #1\")\n_ = ax.set_ylabel(\"Feature #2\")\nplt.tight_layout()\n\n# %%\n# Therefore, the `shrinkage` is handy to manually tune the dispersion of the\n# new samples.\n"
  },
  {
    "path": "examples/pipeline/README.txt",
    "content": ".. _pipeline_examples:\n\nPipeline examples\n=================\n\nExample of how to use the a pipeline to include under-sampling with `scikit-learn` estimators.\n"
  },
  {
    "path": "examples/pipeline/plot_pipeline_classification.py",
    "content": "\"\"\"\n====================================\nUsage of pipeline embedding samplers\n====================================\n\nAn example of the :class:~imblearn.pipeline.Pipeline` object (or\n:func:`~imblearn.pipeline.make_pipeline` helper function) working with\ntransformers and resamplers.\n\"\"\"\n\n# Authors: Christos Aridas\n#          Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\n# %%\nprint(__doc__)\n\n# %% [markdown]\n# Let's first create an imbalanced dataset and split in to two sets.\n\n# %%\nfrom sklearn.datasets import make_classification\nfrom sklearn.model_selection import train_test_split\n\nX, y = make_classification(\n    n_classes=2,\n    class_sep=1.25,\n    weights=[0.3, 0.7],\n    n_informative=3,\n    n_redundant=1,\n    flip_y=0,\n    n_features=5,\n    n_clusters_per_class=1,\n    n_samples=5000,\n    random_state=10,\n)\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)\n\n# %% [markdown]\n# Now, we will create each individual steps that we would like later to combine\n\n# %%\nfrom sklearn.decomposition import PCA\nfrom sklearn.neighbors import KNeighborsClassifier\n\nfrom imblearn.over_sampling import SMOTE\nfrom imblearn.under_sampling import EditedNearestNeighbours\n\npca = PCA(n_components=2)\nenn = EditedNearestNeighbours()\nsmote = SMOTE(random_state=0)\nknn = KNeighborsClassifier(n_neighbors=1)\n\n# %% [markdown]\n# Now, we can finally create a pipeline to specify in which order the different\n# transformers and samplers should be executed before to provide the data to\n# the final classifier.\n\n# %%\nfrom imblearn.pipeline import make_pipeline\n\nmodel = make_pipeline(pca, enn, smote, knn)\n\n# %% [markdown]\n# We can now use the pipeline created as a normal classifier where resampling\n# will happen when calling `fit` and disabled when calling `decision_function`,\n# `predict_proba`, or `predict`.\n\n# %%\nfrom sklearn.metrics import classification_report\n\nmodel.fit(X_train, y_train)\ny_pred = model.predict(X_test)\nprint(classification_report(y_test, y_pred))\n"
  },
  {
    "path": "examples/under-sampling/README.txt",
    "content": ".. _under_sampling_examples:\n\nExample using under-sampling class methods\n==========================================\n\nUnder-sampling refers to the process of reducing the number of samples in the majority classes.\nThe implemented methods can be categorized into 2 groups: (i) fixed under-sampling and (ii) cleaning under-sampling.\n"
  },
  {
    "path": "examples/under-sampling/plot_comparison_under_sampling.py",
    "content": "\"\"\"\n===============================\nCompare under-sampling samplers\n===============================\n\nThe following example attends to make a qualitative comparison between the\ndifferent under-sampling algorithms available in the imbalanced-learn package.\n\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\n# %%\nprint(__doc__)\n\nimport seaborn as sns\n\nsns.set_context(\"poster\")\n\n# %% [markdown]\n# The following function will be used to create toy dataset. It uses the\n# :func:`~sklearn.datasets.make_classification` from scikit-learn but fixing\n# some parameters.\n\n\n# %%\nfrom sklearn.datasets import make_classification\n\n\ndef create_dataset(\n    n_samples=1000,\n    weights=(0.01, 0.01, 0.98),\n    n_classes=3,\n    class_sep=0.8,\n    n_clusters=1,\n):\n    return make_classification(\n        n_samples=n_samples,\n        n_features=2,\n        n_informative=2,\n        n_redundant=0,\n        n_repeated=0,\n        n_classes=n_classes,\n        n_clusters_per_class=n_clusters,\n        weights=list(weights),\n        class_sep=class_sep,\n        random_state=0,\n    )\n\n\n# %% [markdown]\n# The following function will be used to plot the sample space after resampling\n# to illustrate the specificities of an algorithm.\n\n\n# %%\ndef plot_resampling(X, y, sampler, ax, title=None):\n    X_res, y_res = sampler.fit_resample(X, y)\n    ax.scatter(X_res[:, 0], X_res[:, 1], c=y_res, alpha=0.8, edgecolor=\"k\")\n    if title is None:\n        title = f\"Resampling with {sampler.__class__.__name__}\"\n    ax.set_title(title)\n    sns.despine(ax=ax, offset=10)\n\n\n# %% [markdown]\n# The following function will be used to plot the decision function of a\n# classifier given some data.\n\n\n# %%\nimport numpy as np\n\n\ndef plot_decision_function(X, y, clf, ax, title=None):\n    plot_step = 0.02\n    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1\n    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1\n    xx, yy = np.meshgrid(\n        np.arange(x_min, x_max, plot_step), np.arange(y_min, y_max, plot_step)\n    )\n\n    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])\n    Z = Z.reshape(xx.shape)\n    ax.contourf(xx, yy, Z, alpha=0.4)\n    ax.scatter(X[:, 0], X[:, 1], alpha=0.8, c=y, edgecolor=\"k\")\n    if title is not None:\n        ax.set_title(title)\n\n\n# %%\nfrom sklearn.linear_model import LogisticRegression\n\nclf = LogisticRegression()\n\n\n# %% [markdown]\n# Prototype generation: under-sampling by generating new samples\n# --------------------------------------------------------------\n#\n# :class:`~imblearn.under_sampling.ClusterCentroids` under-samples by replacing\n# the original samples by the centroids of the cluster found.\n\n# %%\nimport matplotlib.pyplot as plt\nfrom sklearn.cluster import MiniBatchKMeans\n\nfrom imblearn import FunctionSampler\nfrom imblearn.pipeline import make_pipeline\nfrom imblearn.under_sampling import ClusterCentroids\n\nX, y = create_dataset(n_samples=400, weights=(0.05, 0.15, 0.8), class_sep=0.8)\n\nsamplers = {\n    FunctionSampler(),  # identity resampler\n    ClusterCentroids(\n        estimator=MiniBatchKMeans(n_init=1, random_state=0), random_state=0\n    ),\n}\n\nfig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15, 15))\nfor ax, sampler in zip(axs, samplers):\n    model = make_pipeline(sampler, clf).fit(X, y)\n    plot_decision_function(\n        X, y, model, ax[0], title=f\"Decision function with {sampler.__class__.__name__}\"\n    )\n    plot_resampling(X, y, sampler, ax[1])\n\nfig.tight_layout()\n\n# %% [markdown]\n# Prototype selection: under-sampling by selecting existing samples\n# -----------------------------------------------------------------\n#\n# The algorithm performing prototype selection can be subdivided into two\n# groups: (i) the controlled under-sampling methods and (ii) the cleaning\n# under-sampling methods.\n#\n# With the controlled under-sampling methods, the number of samples to be\n# selected can be specified.\n# :class:`~imblearn.under_sampling.RandomUnderSampler` is the most naive way of\n# performing such selection by randomly selecting a given number of samples by\n# the targeted class.\n\n# %%\nfrom imblearn.under_sampling import RandomUnderSampler\n\nX, y = create_dataset(n_samples=400, weights=(0.05, 0.15, 0.8), class_sep=0.8)\n\nsamplers = {\n    FunctionSampler(),  # identity resampler\n    RandomUnderSampler(random_state=0),\n}\n\nfig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15, 15))\nfor ax, sampler in zip(axs, samplers):\n    model = make_pipeline(sampler, clf).fit(X, y)\n    plot_decision_function(\n        X, y, model, ax[0], title=f\"Decision function with {sampler.__class__.__name__}\"\n    )\n    plot_resampling(X, y, sampler, ax[1])\n\nfig.tight_layout()\n\n# %% [markdown]\n# :class:`~imblearn.under_sampling.NearMiss` algorithms implement some\n# heuristic rules in order to select samples. NearMiss-1 selects samples from\n# the majority class for which the average distance of the :math:`k`` nearest\n# samples of the minority class is the smallest. NearMiss-2 selects the samples\n# from the majority class for which the average distance to the farthest\n# samples of the negative class is the smallest. NearMiss-3 is a 2-step\n# algorithm: first, for each minority sample, their :math:`m`\n# nearest-neighbors will be kept; then, the majority samples selected are the\n# on for which the average distance to the :math:`k` nearest neighbors is the\n# largest.\n\n# %%\nfrom imblearn.under_sampling import NearMiss\n\nX, y = create_dataset(n_samples=1000, weights=(0.05, 0.15, 0.8), class_sep=1.5)\n\nsamplers = [NearMiss(version=1), NearMiss(version=2), NearMiss(version=3)]\n\nfig, axs = plt.subplots(nrows=3, ncols=2, figsize=(15, 25))\nfor ax, sampler in zip(axs, samplers):\n    model = make_pipeline(sampler, clf).fit(X, y)\n    plot_decision_function(\n        X,\n        y,\n        model,\n        ax[0],\n        title=f\"Decision function for {sampler.__class__.__name__}-{sampler.version}\",\n    )\n    plot_resampling(\n        X,\n        y,\n        sampler,\n        ax[1],\n        title=f\"Resampling using {sampler.__class__.__name__}-{sampler.version}\",\n    )\nfig.tight_layout()\n\n# %% [markdown]\n# :class:`~imblearn.under_sampling.EditedNearestNeighbours` removes samples of\n# the majority class for which their class differ from the one of their\n# nearest-neighbors. This sieve can be repeated which is the principle of the\n# :class:`~imblearn.under_sampling.RepeatedEditedNearestNeighbours`.\n# :class:`~imblearn.under_sampling.AllKNN` is slightly different from the\n# :class:`~imblearn.under_sampling.RepeatedEditedNearestNeighbours` by changing\n# the :math:`k` parameter of the internal nearest neighors algorithm,\n# increasing it at each iteration.\n\n# %%\nfrom imblearn.under_sampling import (\n    AllKNN,\n    EditedNearestNeighbours,\n    RepeatedEditedNearestNeighbours,\n)\n\nX, y = create_dataset(n_samples=500, weights=(0.2, 0.3, 0.5), class_sep=0.8)\n\nsamplers = [\n    EditedNearestNeighbours(),\n    RepeatedEditedNearestNeighbours(),\n    AllKNN(allow_minority=True),\n]\n\nfig, axs = plt.subplots(3, 2, figsize=(15, 25))\nfor ax, sampler in zip(axs, samplers):\n    model = make_pipeline(sampler, clf).fit(X, y)\n    plot_decision_function(\n        X, y, clf, ax[0], title=f\"Decision function for \\n{sampler.__class__.__name__}\"\n    )\n    plot_resampling(\n        X, y, sampler, ax[1], title=f\"Resampling using \\n{sampler.__class__.__name__}\"\n    )\n\nfig.tight_layout()\n\n# %% [markdown]\n# :class:`~imblearn.under_sampling.CondensedNearestNeighbour` makes use of a\n# 1-NN to iteratively decide if a sample should be kept in a dataset or not.\n# The issue is that :class:`~imblearn.under_sampling.CondensedNearestNeighbour`\n# is sensitive to noise by preserving the noisy samples.\n# :class:`~imblearn.under_sampling.OneSidedSelection` also used the 1-NN and\n# use :class:`~imblearn.under_sampling.TomekLinks` to remove the samples\n# considered noisy. The\n# :class:`~imblearn.under_sampling.NeighbourhoodCleaningRule` use a\n# :class:`~imblearn.under_sampling.EditedNearestNeighbours` to remove some\n# sample. Additionally, they use a 3 nearest-neighbors to remove samples which\n# do not agree with this rule.\n\n# %%\nfrom imblearn.under_sampling import (\n    CondensedNearestNeighbour,\n    NeighbourhoodCleaningRule,\n    OneSidedSelection,\n)\n\nX, y = create_dataset(n_samples=500, weights=(0.2, 0.3, 0.5), class_sep=0.8)\n\nfig, axs = plt.subplots(nrows=3, ncols=2, figsize=(15, 25))\n\nsamplers = [\n    CondensedNearestNeighbour(random_state=0),\n    OneSidedSelection(random_state=0),\n    NeighbourhoodCleaningRule(n_neighbors=11),\n]\n\nfor ax, sampler in zip(axs, samplers):\n    model = make_pipeline(sampler, clf).fit(X, y)\n    plot_decision_function(\n        X, y, clf, ax[0], title=f\"Decision function for \\n{sampler.__class__.__name__}\"\n    )\n    plot_resampling(\n        X, y, sampler, ax[1], title=f\"Resampling using \\n{sampler.__class__.__name__}\"\n    )\nfig.tight_layout()\n\n# %% [markdown]\n# :class:`~imblearn.under_sampling.InstanceHardnessThreshold` uses the\n# prediction of classifier to exclude samples. All samples which are classified\n# with a low probability will be removed.\n\n# %%\nfrom imblearn.under_sampling import InstanceHardnessThreshold\n\nsamplers = {\n    FunctionSampler(),  # identity resampler\n    InstanceHardnessThreshold(\n        estimator=LogisticRegression(),\n        random_state=0,\n    ),\n}\n\nfig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15, 15))\nfor ax, sampler in zip(axs, samplers):\n    model = make_pipeline(sampler, clf).fit(X, y)\n    plot_decision_function(\n        X,\n        y,\n        model,\n        ax[0],\n        title=f\"Decision function with \\n{sampler.__class__.__name__}\",\n    )\n    plot_resampling(\n        X, y, sampler, ax[1], title=f\"Resampling using \\n{sampler.__class__.__name__}\"\n    )\n\nfig.tight_layout()\nplt.show()\n"
  },
  {
    "path": "examples/under-sampling/plot_illustration_nearmiss.py",
    "content": "\"\"\"\n============================\nSample selection in NearMiss\n============================\n\nThis example illustrates the different way of selecting example in\n:class:`~imblearn.under_sampling.NearMiss`.\n\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\n# %%\nprint(__doc__)\n\nimport seaborn as sns\n\nsns.set_context(\"poster\")\n\n# %% [markdown]\n# We define a function allowing to make some nice decoration on the plot.\n\n# %%\n\n\ndef make_plot_despine(ax):\n    sns.despine(ax=ax, offset=10)\n    ax.set_xlim([0, 3.5])\n    ax.set_ylim([0, 3.5])\n    ax.set_xticks(np.arange(0, 3.6, 0.5))\n    ax.set_yticks(np.arange(0, 3.6, 0.5))\n    ax.set_xlabel(r\"$X_1$\")\n    ax.set_ylabel(r\"$X_2$\")\n    ax.legend(loc=\"upper left\", fontsize=16)\n\n\n# %% [markdown]\n# We can start by generating some data to later illustrate the principle of\n# each :class:`~imblearn.under_sampling.NearMiss` heuristic rules.\n\n# %%\nimport numpy as np\n\nrng = np.random.RandomState(18)\n\nX_minority = np.transpose(\n    [[1.1, 1.3, 1.15, 0.8, 0.8, 0.6, 0.55], [1.0, 1.5, 1.7, 2.5, 2.0, 1.2, 0.55]]\n)\nX_majority = np.transpose(\n    [\n        [2.1, 2.12, 2.13, 2.14, 2.2, 2.3, 2.5, 2.45],\n        [1.5, 2.1, 2.7, 0.9, 1.0, 1.4, 2.4, 2.9],\n    ]\n)\n\n# %% [mardown]\n# NearMiss-1\n# ----------\n#\n# NearMiss-1 selects samples from the majority class for which the average\n# distance to some nearest neighbours is the smallest. In the following\n# example, we use a 3-NN to compute the average distance on 2 specific samples\n# of the majority class. Therefore, in this case the point linked by the\n# green-dashed line will be selected since the average distance is smaller.\n\n# %%\nimport matplotlib.pyplot as plt\nfrom sklearn.neighbors import NearestNeighbors\n\nfig, ax = plt.subplots(figsize=(8, 8))\nax.scatter(\n    X_minority[:, 0],\n    X_minority[:, 1],\n    label=\"Minority class\",\n    s=200,\n    marker=\"_\",\n)\nax.scatter(\n    X_majority[:, 0],\n    X_majority[:, 1],\n    label=\"Majority class\",\n    s=200,\n    marker=\"+\",\n)\n\nnearest_neighbors = NearestNeighbors(n_neighbors=3)\nnearest_neighbors.fit(X_minority)\ndist, ind = nearest_neighbors.kneighbors(X_majority[:2, :])\ndist_avg = dist.sum(axis=1) / 3\n\nfor positive_idx, (neighbors, distance, color) in enumerate(\n    zip(ind, dist_avg, [\"g\", \"r\"])\n):\n    for make_plot, sample_idx in enumerate(neighbors):\n        ax.plot(\n            [X_majority[positive_idx, 0], X_minority[sample_idx, 0]],\n            [X_majority[positive_idx, 1], X_minority[sample_idx, 1]],\n            \"--\" + color,\n            alpha=0.3,\n            label=f\"Avg. dist.={distance:.2f}\" if make_plot == 0 else \"\",\n        )\nax.set_title(\"NearMiss-1\")\nmake_plot_despine(ax)\nplt.tight_layout()\n\n# %% [mardown]\n# NearMiss-2\n# ----------\n#\n# NearMiss-2 selects samples from the majority class for which the average\n# distance to the farthest neighbors is the smallest. With the same\n# configuration as previously presented, the sample linked to the green-dashed\n# line will be selected since its distance the 3 farthest neighbors is the\n# smallest.\n\n# %%\nfig, ax = plt.subplots(figsize=(8, 8))\nax.scatter(\n    X_minority[:, 0],\n    X_minority[:, 1],\n    label=\"Minority class\",\n    s=200,\n    marker=\"_\",\n)\nax.scatter(\n    X_majority[:, 0],\n    X_majority[:, 1],\n    label=\"Majority class\",\n    s=200,\n    marker=\"+\",\n)\n\nnearest_neighbors = NearestNeighbors(n_neighbors=X_minority.shape[0])\nnearest_neighbors.fit(X_minority)\ndist, ind = nearest_neighbors.kneighbors(X_majority[:2, :])\ndist = dist[:, -3::]\nind = ind[:, -3::]\ndist_avg = dist.sum(axis=1) / 3\n\nfor positive_idx, (neighbors, distance, color) in enumerate(\n    zip(ind, dist_avg, [\"g\", \"r\"])\n):\n    for make_plot, sample_idx in enumerate(neighbors):\n        ax.plot(\n            [X_majority[positive_idx, 0], X_minority[sample_idx, 0]],\n            [X_majority[positive_idx, 1], X_minority[sample_idx, 1]],\n            \"--\" + color,\n            alpha=0.3,\n            label=f\"Avg. dist.={distance:.2f}\" if make_plot == 0 else \"\",\n        )\nax.set_title(\"NearMiss-2\")\nmake_plot_despine(ax)\nplt.tight_layout()\n\n# %% [mardown]\n# NearMiss-3\n# ----------\n#\n# NearMiss-3 can be divided into 2 steps. First, a nearest-neighbors is used to\n# short-list samples from the majority class (i.e. correspond to the\n# highlighted samples in the following plot). Then, the sample with the largest\n# average distance to the *k* nearest-neighbors are selected.\n\n# %%\nfig, ax = plt.subplots(figsize=(8.5, 8.5))\nax.scatter(\n    X_minority[:, 0],\n    X_minority[:, 1],\n    label=\"Minority class\",\n    s=200,\n    marker=\"_\",\n)\nax.scatter(\n    X_majority[:, 0],\n    X_majority[:, 1],\n    label=\"Majority class\",\n    s=200,\n    marker=\"+\",\n)\n\nnearest_neighbors = NearestNeighbors(n_neighbors=3)\nnearest_neighbors.fit(X_majority)\n\n# select only the majority point of interest\nselected_idx = nearest_neighbors.kneighbors(X_minority, return_distance=False)\nX_majority = X_majority[np.unique(selected_idx), :]\nax.scatter(\n    X_majority[:, 0],\n    X_majority[:, 1],\n    label=\"Short-listed samples\",\n    s=200,\n    alpha=0.3,\n    color=\"g\",\n)\nnearest_neighbors = NearestNeighbors(n_neighbors=3)\nnearest_neighbors.fit(X_minority)\ndist, ind = nearest_neighbors.kneighbors(X_majority[:2, :])\ndist_avg = dist.sum(axis=1) / 3\n\nfor positive_idx, (neighbors, distance, color) in enumerate(\n    zip(ind, dist_avg, [\"r\", \"g\"])\n):\n    for make_plot, sample_idx in enumerate(neighbors):\n        ax.plot(\n            [X_majority[positive_idx, 0], X_minority[sample_idx, 0]],\n            [X_majority[positive_idx, 1], X_minority[sample_idx, 1]],\n            \"--\" + color,\n            alpha=0.3,\n            label=f\"Avg. dist.={distance:.2f}\" if make_plot == 0 else \"\",\n        )\nax.set_title(\"NearMiss-3\")\nmake_plot_despine(ax)\nplt.tight_layout()\nplt.show()\n"
  },
  {
    "path": "examples/under-sampling/plot_illustration_tomek_links.py",
    "content": "\"\"\"\n==============================================\nIllustration of the definition of a Tomek link\n==============================================\n\nThis example illustrates what is a Tomek link.\n\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\n# %%\nprint(__doc__)\n\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n\nsns.set_context(\"poster\")\n\n# %% [markdown]\n# This function allows to make nice plotting\n\n# %%\n\n\ndef make_plot_despine(ax):\n    sns.despine(ax=ax, offset=10)\n    ax.set_xlim([0, 3])\n    ax.set_ylim([0, 3])\n    ax.set_xlabel(r\"$X_1$\")\n    ax.set_ylabel(r\"$X_2$\")\n    ax.legend(loc=\"lower right\")\n\n\n# %% [markdown]\n# We will generate some toy data that illustrates how\n# :class:`~imblearn.under_sampling.TomekLinks` is used to clean a dataset.\n\n# %%\nimport numpy as np\n\nrng = np.random.RandomState(18)\n\nX_minority = np.transpose(\n    [[1.1, 1.3, 1.15, 0.8, 0.55, 2.1], [1.0, 1.5, 1.7, 2.5, 0.55, 1.9]]\n)\nX_majority = np.transpose(\n    [\n        [2.1, 2.12, 2.13, 2.14, 2.2, 2.3, 2.5, 2.45],\n        [1.5, 2.1, 2.7, 0.9, 1.0, 1.4, 2.4, 2.9],\n    ]\n)\n\n# %% [markdown]\n# In the figure above, the samples highlighted in green form a Tomek link since\n# they are of different classes and are nearest neighbors of each other.\n\nfig, ax = plt.subplots(figsize=(8, 8))\nax.scatter(\n    X_minority[:, 0],\n    X_minority[:, 1],\n    label=\"Minority class\",\n    s=200,\n    marker=\"_\",\n)\nax.scatter(\n    X_majority[:, 0],\n    X_majority[:, 1],\n    label=\"Majority class\",\n    s=200,\n    marker=\"+\",\n)\n\n# highlight the samples of interest\nax.scatter(\n    [X_minority[-1, 0], X_majority[1, 0]],\n    [X_minority[-1, 1], X_majority[1, 1]],\n    label=\"Tomek link\",\n    s=200,\n    alpha=0.3,\n)\nmake_plot_despine(ax)\nfig.suptitle(\"Illustration of a Tomek link\")\nfig.tight_layout()\n\n# %% [markdown]\n# We can run the :class:`~imblearn.under_sampling.TomekLinks` sampling to\n# remove the corresponding samples. If `sampling_strategy='auto'` only the\n# sample from the majority class will be removed. If `sampling_strategy='all'`\n# both samples will be removed.\n\n# %%\nfrom imblearn.under_sampling import TomekLinks\n\nfig, axs = plt.subplots(nrows=1, ncols=2, figsize=(16, 8))\n\nsamplers = {\n    \"Removing only majority samples\": TomekLinks(sampling_strategy=\"auto\"),\n    \"Removing all samples\": TomekLinks(sampling_strategy=\"all\"),\n}\n\nfor ax, (title, sampler) in zip(axs, samplers.items()):\n    X_res, y_res = sampler.fit_resample(\n        np.vstack((X_minority, X_majority)),\n        np.array([0] * X_minority.shape[0] + [1] * X_majority.shape[0]),\n    )\n    ax.scatter(\n        X_res[y_res == 0][:, 0],\n        X_res[y_res == 0][:, 1],\n        label=\"Minority class\",\n        s=200,\n        marker=\"_\",\n    )\n    ax.scatter(\n        X_res[y_res == 1][:, 0],\n        X_res[y_res == 1][:, 1],\n        label=\"Majority class\",\n        s=200,\n        marker=\"+\",\n    )\n\n    # highlight the samples of interest\n    ax.scatter(\n        [X_minority[-1, 0], X_majority[1, 0]],\n        [X_minority[-1, 1], X_majority[1, 1]],\n        label=\"Tomek link\",\n        s=200,\n        alpha=0.3,\n    )\n\n    ax.set_title(title)\n    make_plot_despine(ax)\nfig.tight_layout()\n\nplt.show()\n"
  },
  {
    "path": "imblearn/VERSION.txt",
    "content": "0.15.dev0\n"
  },
  {
    "path": "imblearn/__init__.py",
    "content": "\"\"\"Toolbox for imbalanced dataset in machine learning.\n\n``imbalanced-learn`` is a set of python methods to deal with imbalanced\ndatset in machine learning and pattern recognition.\n\nSubpackages\n-----------\ncombine\n    Module which provides methods based on over-sampling and under-sampling.\nensemble\n    Module which provides methods generating an ensemble of\n    under-sampled subsets.\nexceptions\n    Module including custom warnings and error classes used across\n    imbalanced-learn.\nkeras\n    Module which provides custom generator, layers for deep learning using\n    keras.\nmetrics\n    Module which provides metrics to quantified the classification performance\n    with imbalanced dataset.\nmodel_selection\n    Module which provides methods to split the dataset into training and test sets.\nover_sampling\n    Module which provides methods to over-sample a dataset.\ntensorflow\n    Module which provides custom generator, layers for deep learning using\n    tensorflow.\nunder-sampling\n    Module which provides methods to under-sample a dataset.\nutils\n    Module including various utilities.\npipeline\n    Module which allowing to create pipeline with scikit-learn estimators.\n\"\"\"\nimport importlib\nimport sys\nimport types\n\ntry:\n    # This variable is injected in the __builtins__ by the build\n    # process. It is used to enable importing subpackages of sklearn when\n    # the binaries are not built\n    # mypy error: Cannot determine type of '__SKLEARN_SETUP__'\n    __IMBLEARN_SETUP__  # type: ignore\nexcept NameError:\n    __IMBLEARN_SETUP__ = False\n\nif __IMBLEARN_SETUP__:\n    sys.stderr.write(\"Partial import of imblearn during the build process.\\n\")\n    # We are not importing the rest of scikit-learn during the build\n    # process, as it may not be compiled yet\nelse:\n    from . import (\n        combine,\n        ensemble,\n        exceptions,\n        metrics,\n        model_selection,\n        over_sampling,\n        pipeline,\n        tensorflow,\n        under_sampling,\n        utils,\n    )\n    from ._version import __version__\n    from .base import FunctionSampler\n    from .utils._show_versions import show_versions  # noqa: F401\n\n    # FIXME: When we get Python 3.7 as minimal version, we will need to switch to\n    # the following solution:\n    # https://snarky.ca/lazy-importing-in-python-3-7/\n    class LazyLoader(types.ModuleType):\n        \"\"\"Lazily import a module, mainly to avoid pulling in large dependencies.\n\n        Adapted from TensorFlow:\n        https://github.com/tensorflow/tensorflow/blob/master/tensorflow/\n        python/util/lazy_loader.py\n        \"\"\"\n\n        def __init__(self, local_name, parent_module_globals, name, warning=None):\n            self._local_name = local_name\n            self._parent_module_globals = parent_module_globals\n            self._warning = warning\n\n            super().__init__(name)\n\n        def _load(self):\n            \"\"\"Load the module and insert it into the parent's globals.\"\"\"\n            # Import the target module and insert it into the parent's namespace\n            module = importlib.import_module(self.__name__)\n            self._parent_module_globals[self._local_name] = module\n\n            # Update this object's dict so that if someone keeps a reference to the\n            #   LazyLoader, lookups are efficient (__getattr__ is only called on\n            #   lookups that fail).\n            self.__dict__.update(module.__dict__)\n\n            return module\n\n        def __getattr__(self, item):\n            module = self._load()\n            return getattr(module, item)\n\n        def __dir__(self):\n            module = self._load()\n            return dir(module)\n\n    # delay the import of keras since we are going to import either tensorflow\n    # or keras\n    keras = LazyLoader(\"keras\", globals(), \"imblearn.keras\")\n\n    __all__ = [\n        \"combine\",\n        \"ensemble\",\n        \"exceptions\",\n        \"keras\",\n        \"metrics\",\n        \"model_selection\",\n        \"over_sampling\",\n        \"tensorflow\",\n        \"under_sampling\",\n        \"utils\",\n        \"pipeline\",\n        \"FunctionSampler\",\n        \"__version__\",\n    ]\n"
  },
  {
    "path": "imblearn/_version.py",
    "content": "\"\"\"\n``imbalanced-learn`` is a set of python methods to deal with imbalanced\ndatset in machine learning and pattern recognition.\n\"\"\"\n# Based on NiLearn package\n# License: simplified BSD\n\n# PEP0440 compatible formatted version, see:\n# https://www.python.org/dev/peps/pep-0440/\n#\n# Generic release markers:\n# X.Y\n# X.Y.Z # For bugfix releases\n#\n# Admissible pre-release markers:\n# X.YaN # Alpha release\n# X.YbN # Beta release\n# X.YrcN # Release Candidate\n# X.Y # Final release\n#\n# Dev branch marker is: 'X.Y.dev' or 'X.Y.devN' where N is an integer.\n# 'X.Y.dev0' is the canonical version of 'X.Y.dev'\n\nfrom pathlib import Path\n\nwith open(Path(__file__).parent / \"VERSION.txt\") as _fh:\n    __version__ = _fh.read().strip()\n"
  },
  {
    "path": "imblearn/base.py",
    "content": "\"\"\"Base class for sampling\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nfrom abc import ABCMeta, abstractmethod\n\nimport numpy as np\nfrom sklearn.base import BaseEstimator, OneToOneFeatureMixin\nfrom sklearn.preprocessing import label_binarize\nfrom sklearn.utils._metadata_requests import METHODS\nfrom sklearn.utils.multiclass import check_classification_targets\nfrom sklearn_compat.base import _fit_context\nfrom sklearn_compat.utils.validation import validate_data\n\nfrom imblearn.utils import check_sampling_strategy, check_target_type\nfrom imblearn.utils._tags import get_tags\nfrom imblearn.utils._validation import ArraysTransformer\n\nif \"fit_predict\" not in METHODS:\n    METHODS.append(\"fit_predict\")\nif \"fit_transform\" not in METHODS:\n    METHODS.append(\"fit_transform\")\nMETHODS.append(\"fit_resample\")\n\ntry:\n    from sklearn.utils._metadata_requests import SIMPLE_METHODS\n\n    SIMPLE_METHODS.append(\"fit_resample\")\nexcept ImportError:\n    # in older versions of scikit-learn, only METHODS is used\n    pass\n\n\nclass SamplerMixin(metaclass=ABCMeta):\n    \"\"\"Mixin class for samplers with abstract method.\n\n    Warning: This class should not be used directly. Use the derive classes\n    instead.\n    \"\"\"\n\n    _estimator_type = \"sampler\"\n\n    @_fit_context(prefer_skip_nested_validation=True)\n    def fit(self, X, y, **params):\n        \"\"\"Check inputs and statistics of the sampler.\n\n        You should use ``fit_resample`` in all cases.\n\n        Parameters\n        ----------\n        X : {array-like, dataframe, sparse matrix} of shape \\\n                (n_samples, n_features)\n            Data array.\n\n        y : array-like of shape (n_samples,)\n            Target array.\n\n        **params : dict\n            Extra parameters to use by the sampler.\n\n        Returns\n        -------\n        self : object\n            Return the instance itself.\n        \"\"\"\n        X, y, _ = self._check_X_y(X, y)\n        self.sampling_strategy_ = check_sampling_strategy(\n            self.sampling_strategy, y, self._sampling_type\n        )\n        return self\n\n    @_fit_context(prefer_skip_nested_validation=True)\n    def fit_resample(self, X, y, **params):\n        \"\"\"Resample the dataset.\n\n        Parameters\n        ----------\n        X : {array-like, dataframe, sparse matrix} of shape \\\n                (n_samples, n_features)\n            Matrix containing the data which have to be sampled.\n\n        y : array-like of shape (n_samples,)\n            Corresponding label for each sample in X.\n\n        **params : dict\n            Extra parameters to use by the sampler.\n\n        Returns\n        -------\n        X_resampled : {array-like, dataframe, sparse matrix} of shape \\\n                (n_samples_new, n_features)\n            The array containing the resampled data.\n\n        y_resampled : array-like of shape (n_samples_new,)\n            The corresponding label of `X_resampled`.\n        \"\"\"\n        check_classification_targets(y)\n        arrays_transformer = ArraysTransformer(X, y)\n        X, y, binarize_y = self._check_X_y(X, y)\n\n        self.sampling_strategy_ = check_sampling_strategy(\n            self.sampling_strategy, y, self._sampling_type\n        )\n\n        output = self._fit_resample(X, y, **params)\n\n        y_ = (\n            label_binarize(output[1], classes=np.unique(y)) if binarize_y else output[1]\n        )\n\n        X_, y_ = arrays_transformer.transform(output[0], y_)\n        return (X_, y_) if len(output) == 2 else (X_, y_, output[2])\n\n    @abstractmethod\n    def _fit_resample(self, X, y, **params):\n        \"\"\"Base method defined in each sampler to defined the sampling\n        strategy.\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix} of shape (n_samples, n_features)\n            Matrix containing the data which have to be sampled.\n\n        y : array-like of shape (n_samples,)\n            Corresponding label for each sample in X.\n\n        **params : dict\n            Extra parameters to use by the sampler.\n\n        Returns\n        -------\n        X_resampled : {ndarray, sparse matrix} of shape \\\n                (n_samples_new, n_features)\n            The array containing the resampled data.\n\n        y_resampled : ndarray of shape (n_samples_new,)\n            The corresponding label of `X_resampled`.\n\n        \"\"\"\n        pass\n\n\nclass BaseSampler(SamplerMixin, OneToOneFeatureMixin, BaseEstimator):\n    \"\"\"Base class for sampling algorithms.\n\n    Warning: This class should not be used directly. Use the derive classes\n    instead.\n    \"\"\"\n\n    def __init__(self, sampling_strategy=\"auto\"):\n        self.sampling_strategy = sampling_strategy\n\n    def _check_X_y(self, X, y, accept_sparse=None):\n        if accept_sparse is None:\n            accept_sparse = [\"csr\", \"csc\"]\n        y, binarize_y = check_target_type(y, indicate_one_vs_all=True)\n        X, y = validate_data(self, X=X, y=y, reset=True, accept_sparse=accept_sparse)\n        return X, y, binarize_y\n\n    def fit(self, X, y, **params):\n        \"\"\"Check inputs and statistics of the sampler.\n\n        You should use ``fit_resample`` in all cases.\n\n        Parameters\n        ----------\n        X : {array-like, dataframe, sparse matrix} of shape \\\n                (n_samples, n_features)\n            Data array.\n\n        y : array-like of shape (n_samples,)\n            Target array.\n\n        Returns\n        -------\n        self : object\n            Return the instance itself.\n        \"\"\"\n        return super().fit(X, y, **params)\n\n    def fit_resample(self, X, y, **params):\n        \"\"\"Resample the dataset.\n\n        Parameters\n        ----------\n        X : {array-like, dataframe, sparse matrix} of shape \\\n                (n_samples, n_features)\n            Matrix containing the data which have to be sampled.\n\n        y : array-like of shape (n_samples,)\n            Corresponding label for each sample in X.\n\n        Returns\n        -------\n        X_resampled : {array-like, dataframe, sparse matrix} of shape \\\n                (n_samples_new, n_features)\n            The array containing the resampled data.\n\n        y_resampled : array-like of shape (n_samples_new,)\n            The corresponding label of `X_resampled`.\n        \"\"\"\n        return super().fit_resample(X, y, **params)\n\n    def _more_tags(self):\n        return {\"X_types\": [\"2darray\", \"sparse\", \"dataframe\"]}\n\n    def __sklearn_tags__(self):\n        from sklearn_compat.utils._tags import TargetTags\n\n        from imblearn.utils._tags import InputTags, SamplerTags, Tags\n\n        tags = Tags(\n            estimator_type=\"sampler\",\n            target_tags=TargetTags(required=True),\n            transformer_tags=None,\n            regressor_tags=None,\n            classifier_tags=None,\n            sampler_tags=SamplerTags(),\n        )\n        tags.input_tags = InputTags()\n        tags.input_tags.two_d_array = True\n        tags.input_tags.sparse = True\n        tags.input_tags.dataframe = True\n        return tags\n\n\ndef _identity(X, y):\n    return X, y\n\n\ndef is_sampler(estimator):\n    \"\"\"Return True if the given estimator is a sampler, False otherwise.\n\n    Parameters\n    ----------\n    estimator : object\n        Estimator to test.\n\n    Returns\n    -------\n    is_sampler : bool\n        True if estimator is a sampler, otherwise False.\n    \"\"\"\n\n    if hasattr(estimator, \"_estimator_type\") and estimator._estimator_type == \"sampler\":\n        return True\n    tags = get_tags(estimator)\n    if hasattr(tags, \"sampler_tags\") and tags.sampler_tags is not None:\n        return True\n    return False\n\n\nclass FunctionSampler(BaseSampler):\n    \"\"\"Construct a sampler from calling an arbitrary callable.\n\n    Read more in the :ref:`User Guide <function_sampler>`.\n\n    Parameters\n    ----------\n    func : callable, default=None\n        The callable to use for the transformation. This will be passed the\n        same arguments as transform, with args and kwargs forwarded. If func is\n        None, then func will be the identity function.\n\n    accept_sparse : bool, default=True\n        Whether sparse input are supported. By default, sparse inputs are\n        supported.\n\n    kw_args : dict, default=None\n        The keyword argument expected by ``func``.\n\n    validate : bool, default=True\n        Whether or not to bypass the validation of ``X`` and ``y``. Turning-off\n        validation allows to use the ``FunctionSampler`` with any type of\n        data.\n\n        .. versionadded:: 0.6\n\n    Attributes\n    ----------\n    sampling_strategy_ : dict\n        Dictionary containing the information to sample the dataset. The keys\n        corresponds to the class labels from which to sample and the values\n        are the number of samples to sample.\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.10\n\n    See Also\n    --------\n    sklearn.preprocessing.FunctionTransfomer : Stateless transformer.\n\n    Notes\n    -----\n    See\n    :ref:`sphx_glr_auto_examples_applications_plot_outlier_rejections.py`\n\n    Examples\n    --------\n    >>> import numpy as np\n    >>> from sklearn.datasets import make_classification\n    >>> from imblearn import FunctionSampler\n    >>> X, y = make_classification(n_classes=2, class_sep=2,\n    ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,\n    ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)\n\n    We can create to select only the first ten samples for instance.\n\n    >>> def func(X, y):\n    ...   return X[:10], y[:10]\n    >>> sampler = FunctionSampler(func=func)\n    >>> X_res, y_res = sampler.fit_resample(X, y)\n    >>> np.all(X_res == X[:10])\n    True\n    >>> np.all(y_res == y[:10])\n    True\n\n    We can also create a specific function which take some arguments.\n\n    >>> from collections import Counter\n    >>> from imblearn.under_sampling import RandomUnderSampler\n    >>> def func(X, y, sampling_strategy, random_state):\n    ...   return RandomUnderSampler(\n    ...       sampling_strategy=sampling_strategy,\n    ...       random_state=random_state).fit_resample(X, y)\n    >>> sampler = FunctionSampler(func=func,\n    ...                           kw_args={'sampling_strategy': 'auto',\n    ...                                    'random_state': 0})\n    >>> X_res, y_res = sampler.fit_resample(X, y)\n    >>> print(f'Resampled dataset shape {sorted(Counter(y_res).items())}')\n    Resampled dataset shape [(0, 100), (1, 100)]\n    \"\"\"\n\n    _sampling_type = \"bypass\"\n\n    _parameter_constraints: dict = {\n        \"func\": [callable, None],\n        \"accept_sparse\": [\"boolean\"],\n        \"kw_args\": [dict, None],\n        \"validate\": [\"boolean\"],\n    }\n\n    def __init__(self, *, func=None, accept_sparse=True, kw_args=None, validate=True):\n        super().__init__()\n        self.func = func\n        self.accept_sparse = accept_sparse\n        self.kw_args = kw_args\n        self.validate = validate\n\n    def fit(self, X, y):\n        \"\"\"Check inputs and statistics of the sampler.\n\n        You should use ``fit_resample`` in all cases.\n\n        Parameters\n        ----------\n        X : {array-like, dataframe, sparse matrix} of shape \\\n                (n_samples, n_features)\n            Data array.\n\n        y : array-like of shape (n_samples,)\n            Target array.\n\n        Returns\n        -------\n        self : object\n            Return the instance itself.\n        \"\"\"\n        self._validate_params()\n        # we need to overwrite SamplerMixin.fit to bypass the validation\n        if self.validate:\n            check_classification_targets(y)\n            X, y, _ = self._check_X_y(X, y, accept_sparse=self.accept_sparse)\n\n        self.sampling_strategy_ = check_sampling_strategy(\n            self.sampling_strategy, y, self._sampling_type\n        )\n\n        return self\n\n    def fit_resample(self, X, y):\n        \"\"\"Resample the dataset.\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix} of shape (n_samples, n_features)\n            Matrix containing the data which have to be sampled.\n\n        y : array-like of shape (n_samples,)\n            Corresponding label for each sample in X.\n\n        Returns\n        -------\n        X_resampled : {array-like, sparse matrix} of shape \\\n                (n_samples_new, n_features)\n            The array containing the resampled data.\n\n        y_resampled : array-like of shape (n_samples_new,)\n            The corresponding label of `X_resampled`.\n        \"\"\"\n        self._validate_params()\n        arrays_transformer = ArraysTransformer(X, y)\n\n        if self.validate:\n            check_classification_targets(y)\n            X, y, binarize_y = self._check_X_y(X, y, accept_sparse=self.accept_sparse)\n\n        self.sampling_strategy_ = check_sampling_strategy(\n            self.sampling_strategy, y, self._sampling_type\n        )\n\n        output = self._fit_resample(X, y)\n\n        if self.validate:\n            y_ = (\n                label_binarize(output[1], classes=np.unique(y))\n                if binarize_y\n                else output[1]\n            )\n            X_, y_ = arrays_transformer.transform(output[0], y_)\n            return (X_, y_) if len(output) == 2 else (X_, y_, output[2])\n\n        return output\n\n    def _fit_resample(self, X, y):\n        func = _identity if self.func is None else self.func\n        output = func(X, y, **(self.kw_args if self.kw_args else {}))\n        return output\n"
  },
  {
    "path": "imblearn/combine/__init__.py",
    "content": "\"\"\"The :mod:`imblearn.combine` provides methods which combine\nover-sampling and under-sampling.\n\"\"\"\n\nfrom imblearn.combine._smote_enn import SMOTEENN\nfrom imblearn.combine._smote_tomek import SMOTETomek\n\n__all__ = [\"SMOTEENN\", \"SMOTETomek\"]\n"
  },
  {
    "path": "imblearn/combine/_smote_enn.py",
    "content": "\"\"\"Class to perform over-sampling using SMOTE and cleaning using ENN.\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport numbers\n\nfrom sklearn.base import clone\nfrom sklearn.utils import check_X_y\n\nfrom imblearn.base import BaseSampler\nfrom imblearn.over_sampling import SMOTE\nfrom imblearn.over_sampling.base import BaseOverSampler\nfrom imblearn.under_sampling import EditedNearestNeighbours\nfrom imblearn.utils import Substitution, check_target_type\nfrom imblearn.utils._docstring import _n_jobs_docstring, _random_state_docstring\n\n\n@Substitution(\n    sampling_strategy=BaseOverSampler._sampling_strategy_docstring,\n    n_jobs=_n_jobs_docstring,\n    random_state=_random_state_docstring,\n)\nclass SMOTEENN(BaseSampler):\n    \"\"\"Over-sampling using SMOTE and cleaning using ENN.\n\n    Combine over- and under-sampling using SMOTE and Edited Nearest Neighbours.\n\n    Read more in the :ref:`User Guide <combine>`.\n\n    Parameters\n    ----------\n    {sampling_strategy}\n\n    {random_state}\n\n    smote : sampler object, default=None\n        The :class:`~imblearn.over_sampling.SMOTE` object to use. If not given,\n        a :class:`~imblearn.over_sampling.SMOTE` object with default parameters\n        will be given.\n\n    enn : sampler object, default=None\n        The :class:`~imblearn.under_sampling.EditedNearestNeighbours` object\n        to use. If not given, a\n        :class:`~imblearn.under_sampling.EditedNearestNeighbours` object with\n        sampling strategy='all' will be given.\n\n    {n_jobs}\n\n    Attributes\n    ----------\n    sampling_strategy_ : dict\n        Dictionary containing the information to sample the dataset. The keys\n        corresponds to the class labels from which to sample and the values\n        are the number of samples to sample.\n\n    smote_ : sampler object\n        The validated :class:`~imblearn.over_sampling.SMOTE` instance.\n\n    enn_ : sampler object\n        The validated :class:`~imblearn.under_sampling.EditedNearestNeighbours`\n        instance.\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.10\n\n    See Also\n    --------\n    SMOTETomek : Over-sample using SMOTE followed by under-sampling removing\n        the Tomek's links.\n\n    Notes\n    -----\n    The method is presented in [1]_.\n\n    Supports multi-class resampling. Refer to SMOTE and ENN regarding the\n    scheme which used.\n\n    References\n    ----------\n    .. [1] G. Batista, R. C. Prati, M. C. Monard. \"A study of the behavior of\n       several methods for balancing machine learning training data,\" ACM\n       Sigkdd Explorations Newsletter 6 (1), 20-29, 2004.\n\n    Examples\n    --------\n\n    >>> from collections import Counter\n    >>> from sklearn.datasets import make_classification\n    >>> from imblearn.combine import SMOTEENN\n    >>> X, y = make_classification(n_classes=2, class_sep=2,\n    ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,\n    ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)\n    >>> print('Original dataset shape %s' % Counter(y))\n    Original dataset shape Counter({{1: 900, 0: 100}})\n    >>> sme = SMOTEENN(random_state=42)\n    >>> X_res, y_res = sme.fit_resample(X, y)\n    >>> print('Resampled dataset shape %s' % Counter(y_res))\n    Resampled dataset shape Counter({{0: 900, 1: 881}})\n    \"\"\"\n\n    _sampling_type = \"over-sampling\"\n\n    _parameter_constraints: dict = {\n        **BaseOverSampler._parameter_constraints,\n        \"smote\": [SMOTE, None],\n        \"enn\": [EditedNearestNeighbours, None],\n        \"n_jobs\": [numbers.Integral, None],\n    }\n\n    def __init__(\n        self,\n        *,\n        sampling_strategy=\"auto\",\n        random_state=None,\n        smote=None,\n        enn=None,\n        n_jobs=None,\n    ):\n        super().__init__()\n        self.sampling_strategy = sampling_strategy\n        self.random_state = random_state\n        self.smote = smote\n        self.enn = enn\n        self.n_jobs = n_jobs\n\n    def _validate_estimator(self):\n        \"Private function to validate SMOTE and ENN objects\"\n        if self.smote is not None:\n            self.smote_ = clone(self.smote)\n        else:\n            self.smote_ = SMOTE(\n                sampling_strategy=self.sampling_strategy,\n                random_state=self.random_state,\n            )\n\n        if self.enn is not None:\n            self.enn_ = clone(self.enn)\n        else:\n            self.enn_ = EditedNearestNeighbours(\n                sampling_strategy=\"all\", n_jobs=self.n_jobs\n            )\n\n    def _fit_resample(self, X, y):\n        self._validate_estimator()\n        y = check_target_type(y)\n        X, y = check_X_y(X, y, accept_sparse=[\"csr\", \"csc\"])\n        self.sampling_strategy_ = self.sampling_strategy\n\n        X_res, y_res = self.smote_.fit_resample(X, y)\n        return self.enn_.fit_resample(X_res, y_res)\n"
  },
  {
    "path": "imblearn/combine/_smote_tomek.py",
    "content": "\"\"\"Class to perform over-sampling using SMOTE and cleaning using Tomek\nlinks.\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport numbers\n\nfrom sklearn.base import clone\nfrom sklearn.utils import check_X_y\n\nfrom imblearn.base import BaseSampler\nfrom imblearn.over_sampling import SMOTE\nfrom imblearn.over_sampling.base import BaseOverSampler\nfrom imblearn.under_sampling import TomekLinks\nfrom imblearn.utils import Substitution, check_target_type\nfrom imblearn.utils._docstring import _n_jobs_docstring, _random_state_docstring\n\n\n@Substitution(\n    sampling_strategy=BaseOverSampler._sampling_strategy_docstring,\n    n_jobs=_n_jobs_docstring,\n    random_state=_random_state_docstring,\n)\nclass SMOTETomek(BaseSampler):\n    \"\"\"Over-sampling using SMOTE and cleaning using Tomek links.\n\n    Combine over- and under-sampling using SMOTE and Tomek links.\n\n    Read more in the :ref:`User Guide <combine>`.\n\n    Parameters\n    ----------\n    {sampling_strategy}\n\n    {random_state}\n\n    smote : sampler object, default=None\n        The :class:`~imblearn.over_sampling.SMOTE` object to use. If not given,\n        a :class:`~imblearn.over_sampling.SMOTE` object with default parameters\n        will be given.\n\n    tomek : sampler object, default=None\n        The :class:`~imblearn.under_sampling.TomekLinks` object to use. If not\n        given, a :class:`~imblearn.under_sampling.TomekLinks` object with\n        sampling strategy='all' will be given.\n\n    {n_jobs}\n\n    Attributes\n    ----------\n    sampling_strategy_ : dict\n        Dictionary containing the information to sample the dataset. The keys\n        corresponds to the class labels from which to sample and the values\n        are the number of samples to sample.\n\n    smote_ : sampler object\n        The validated :class:`~imblearn.over_sampling.SMOTE` instance.\n\n    tomek_ : sampler object\n        The validated :class:`~imblearn.under_sampling.TomekLinks` instance.\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.10\n\n    See Also\n    --------\n    SMOTEENN : Over-sample using SMOTE followed by under-sampling using Edited\n        Nearest Neighbours.\n\n    Notes\n    -----\n    The method is presented in [1]_.\n\n    Supports multi-class resampling. Refer to SMOTE and TomekLinks regarding\n    the scheme which used.\n\n    References\n    ----------\n    .. [1] G. Batista, B. Bazzan, M. Monard, \"Balancing Training Data for\n       Automated Annotation of Keywords: a Case Study,\" In WOB, 10-18, 2003.\n\n    Examples\n    --------\n\n    >>> from collections import Counter\n    >>> from sklearn.datasets import make_classification\n    >>> from imblearn.combine import SMOTETomek\n    >>> X, y = make_classification(n_classes=2, class_sep=2,\n    ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,\n    ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)\n    >>> print('Original dataset shape %s' % Counter(y))\n    Original dataset shape Counter({{1: 900, 0: 100}})\n    >>> smt = SMOTETomek(random_state=42)\n    >>> X_res, y_res = smt.fit_resample(X, y)\n    >>> print('Resampled dataset shape %s' % Counter(y_res))\n    Resampled dataset shape Counter({{0: 900, 1: 900}})\n    \"\"\"\n\n    _sampling_type = \"over-sampling\"\n\n    _parameter_constraints: dict = {\n        **BaseOverSampler._parameter_constraints,\n        \"smote\": [SMOTE, None],\n        \"tomek\": [TomekLinks, None],\n        \"n_jobs\": [numbers.Integral, None],\n    }\n\n    def __init__(\n        self,\n        *,\n        sampling_strategy=\"auto\",\n        random_state=None,\n        smote=None,\n        tomek=None,\n        n_jobs=None,\n    ):\n        super().__init__()\n        self.sampling_strategy = sampling_strategy\n        self.random_state = random_state\n        self.smote = smote\n        self.tomek = tomek\n        self.n_jobs = n_jobs\n\n    def _validate_estimator(self):\n        \"Private function to validate SMOTE and ENN objects\"\n\n        if self.smote is not None:\n            self.smote_ = clone(self.smote)\n        else:\n            self.smote_ = SMOTE(\n                sampling_strategy=self.sampling_strategy,\n                random_state=self.random_state,\n            )\n\n        if self.tomek is not None:\n            self.tomek_ = clone(self.tomek)\n        else:\n            self.tomek_ = TomekLinks(sampling_strategy=\"all\", n_jobs=self.n_jobs)\n\n    def _fit_resample(self, X, y):\n        self._validate_estimator()\n        y = check_target_type(y)\n        X, y = check_X_y(X, y, accept_sparse=[\"csr\", \"csc\"])\n        self.sampling_strategy_ = self.sampling_strategy\n\n        X_res, y_res = self.smote_.fit_resample(X, y)\n        return self.tomek_.fit_resample(X_res, y_res)\n"
  },
  {
    "path": "imblearn/combine/tests/__init__.py",
    "content": ""
  },
  {
    "path": "imblearn/combine/tests/test_smote_enn.py",
    "content": "\"\"\"Test the module SMOTE ENN.\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport numpy as np\nfrom sklearn.utils._testing import assert_allclose, assert_array_equal\n\nfrom imblearn.combine import SMOTEENN\nfrom imblearn.over_sampling import SMOTE\nfrom imblearn.under_sampling import EditedNearestNeighbours\n\nRND_SEED = 0\nX = np.array(\n    [\n        [0.11622591, -0.0317206],\n        [0.77481731, 0.60935141],\n        [1.25192108, -0.22367336],\n        [0.53366841, -0.30312976],\n        [1.52091956, -0.49283504],\n        [-0.28162401, -2.10400981],\n        [0.83680821, 1.72827342],\n        [0.3084254, 0.33299982],\n        [0.70472253, -0.73309052],\n        [0.28893132, -0.38761769],\n        [1.15514042, 0.0129463],\n        [0.88407872, 0.35454207],\n        [1.31301027, -0.92648734],\n        [-1.11515198, -0.93689695],\n        [-0.18410027, -0.45194484],\n        [0.9281014, 0.53085498],\n        [-0.14374509, 0.27370049],\n        [-0.41635887, -0.38299653],\n        [0.08711622, 0.93259929],\n        [1.70580611, -0.11219234],\n    ]\n)\nY = np.array([0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0])\nR_TOL = 1e-4\n\n\ndef test_sample_regular():\n    smote = SMOTEENN(random_state=RND_SEED)\n    X_resampled, y_resampled = smote.fit_resample(X, Y)\n\n    X_gt = np.array(\n        [\n            [1.52091956, -0.49283504],\n            [0.84976473, -0.15570176],\n            [0.61319159, -0.11571667],\n            [0.66052536, -0.28246518],\n            [-0.28162401, -2.10400981],\n            [0.83680821, 1.72827342],\n            [0.08711622, 0.93259929],\n        ]\n    )\n    y_gt = np.array([0, 0, 0, 0, 1, 1, 1])\n    assert_allclose(X_resampled, X_gt, rtol=R_TOL)\n    assert_array_equal(y_resampled, y_gt)\n\n\ndef test_sample_regular_pass_smote_enn():\n    smote = SMOTEENN(\n        smote=SMOTE(sampling_strategy=\"auto\", random_state=RND_SEED),\n        enn=EditedNearestNeighbours(sampling_strategy=\"all\"),\n        random_state=RND_SEED,\n    )\n    X_resampled, y_resampled = smote.fit_resample(X, Y)\n\n    X_gt = np.array(\n        [\n            [1.52091956, -0.49283504],\n            [0.84976473, -0.15570176],\n            [0.61319159, -0.11571667],\n            [0.66052536, -0.28246518],\n            [-0.28162401, -2.10400981],\n            [0.83680821, 1.72827342],\n            [0.08711622, 0.93259929],\n        ]\n    )\n    y_gt = np.array([0, 0, 0, 0, 1, 1, 1])\n    assert_allclose(X_resampled, X_gt, rtol=R_TOL)\n    assert_array_equal(y_resampled, y_gt)\n\n\ndef test_sample_regular_half():\n    sampling_strategy = {0: 10, 1: 12}\n    smote = SMOTEENN(sampling_strategy=sampling_strategy, random_state=RND_SEED)\n    X_resampled, y_resampled = smote.fit_resample(X, Y)\n\n    X_gt = np.array(\n        [\n            [1.52091956, -0.49283504],\n            [-0.28162401, -2.10400981],\n            [0.83680821, 1.72827342],\n            [0.08711622, 0.93259929],\n        ]\n    )\n    y_gt = np.array([0, 1, 1, 1])\n    assert_allclose(X_resampled, X_gt)\n    assert_array_equal(y_resampled, y_gt)\n\n\ndef test_validate_estimator_init():\n    smote = SMOTE(random_state=RND_SEED)\n    enn = EditedNearestNeighbours(sampling_strategy=\"all\")\n    smt = SMOTEENN(smote=smote, enn=enn, random_state=RND_SEED)\n    X_resampled, y_resampled = smt.fit_resample(X, Y)\n    X_gt = np.array(\n        [\n            [1.52091956, -0.49283504],\n            [0.84976473, -0.15570176],\n            [0.61319159, -0.11571667],\n            [0.66052536, -0.28246518],\n            [-0.28162401, -2.10400981],\n            [0.83680821, 1.72827342],\n            [0.08711622, 0.93259929],\n        ]\n    )\n    y_gt = np.array([0, 0, 0, 0, 1, 1, 1])\n    assert_allclose(X_resampled, X_gt, rtol=R_TOL)\n    assert_array_equal(y_resampled, y_gt)\n\n\ndef test_validate_estimator_default():\n    smt = SMOTEENN(random_state=RND_SEED)\n    X_resampled, y_resampled = smt.fit_resample(X, Y)\n    X_gt = np.array(\n        [\n            [1.52091956, -0.49283504],\n            [0.84976473, -0.15570176],\n            [0.61319159, -0.11571667],\n            [0.66052536, -0.28246518],\n            [-0.28162401, -2.10400981],\n            [0.83680821, 1.72827342],\n            [0.08711622, 0.93259929],\n        ]\n    )\n    y_gt = np.array([0, 0, 0, 0, 1, 1, 1])\n    assert_allclose(X_resampled, X_gt, rtol=R_TOL)\n    assert_array_equal(y_resampled, y_gt)\n\n\ndef test_parallelisation():\n    # Check if default job count is none\n    smt = SMOTEENN(random_state=RND_SEED)\n    smt._validate_estimator()\n    assert smt.n_jobs is None\n    assert smt.enn_.n_jobs is None\n\n    # Check if job count is set\n    smt = SMOTEENN(random_state=RND_SEED, n_jobs=8)\n    smt._validate_estimator()\n    assert smt.n_jobs == 8\n    assert smt.enn_.n_jobs == 8\n"
  },
  {
    "path": "imblearn/combine/tests/test_smote_tomek.py",
    "content": "\"\"\"Test the module SMOTE ENN.\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport numpy as np\nfrom sklearn.utils._testing import assert_allclose, assert_array_equal\n\nfrom imblearn.combine import SMOTETomek\nfrom imblearn.over_sampling import SMOTE\nfrom imblearn.under_sampling import TomekLinks\n\nRND_SEED = 0\nX = np.array(\n    [\n        [0.20622591, 0.0582794],\n        [0.68481731, 0.51935141],\n        [1.34192108, -0.13367336],\n        [0.62366841, -0.21312976],\n        [1.61091956, -0.40283504],\n        [-0.37162401, -2.19400981],\n        [0.74680821, 1.63827342],\n        [0.2184254, 0.24299982],\n        [0.61472253, -0.82309052],\n        [0.19893132, -0.47761769],\n        [1.06514042, -0.0770537],\n        [0.97407872, 0.44454207],\n        [1.40301027, -0.83648734],\n        [-1.20515198, -1.02689695],\n        [-0.27410027, -0.54194484],\n        [0.8381014, 0.44085498],\n        [-0.23374509, 0.18370049],\n        [-0.32635887, -0.29299653],\n        [-0.00288378, 0.84259929],\n        [1.79580611, -0.02219234],\n    ]\n)\nY = np.array([0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0])\nR_TOL = 1e-4\n\n\ndef test_sample_regular():\n    smote = SMOTETomek(random_state=RND_SEED)\n    X_resampled, y_resampled = smote.fit_resample(X, Y)\n    X_gt = np.array(\n        [\n            [0.68481731, 0.51935141],\n            [1.34192108, -0.13367336],\n            [0.62366841, -0.21312976],\n            [1.61091956, -0.40283504],\n            [-0.37162401, -2.19400981],\n            [0.74680821, 1.63827342],\n            [0.61472253, -0.82309052],\n            [0.19893132, -0.47761769],\n            [1.40301027, -0.83648734],\n            [-1.20515198, -1.02689695],\n            [-0.23374509, 0.18370049],\n            [-0.00288378, 0.84259929],\n            [1.79580611, -0.02219234],\n            [0.38307743, -0.05670439],\n            [0.70319159, -0.02571667],\n            [0.75052536, -0.19246518],\n        ]\n    )\n    y_gt = np.array([1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0])\n    assert_allclose(X_resampled, X_gt, rtol=R_TOL)\n    assert_array_equal(y_resampled, y_gt)\n\n\ndef test_sample_regular_half():\n    sampling_strategy = {0: 9, 1: 12}\n    smote = SMOTETomek(sampling_strategy=sampling_strategy, random_state=RND_SEED)\n    X_resampled, y_resampled = smote.fit_resample(X, Y)\n    X_gt = np.array(\n        [\n            [0.68481731, 0.51935141],\n            [0.62366841, -0.21312976],\n            [1.61091956, -0.40283504],\n            [-0.37162401, -2.19400981],\n            [0.74680821, 1.63827342],\n            [0.61472253, -0.82309052],\n            [0.19893132, -0.47761769],\n            [1.40301027, -0.83648734],\n            [-1.20515198, -1.02689695],\n            [-0.23374509, 0.18370049],\n            [-0.00288378, 0.84259929],\n            [1.79580611, -0.02219234],\n            [0.45784496, -0.1053161],\n        ]\n    )\n    y_gt = np.array([1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0])\n    assert_allclose(X_resampled, X_gt, rtol=R_TOL)\n    assert_array_equal(y_resampled, y_gt)\n\n\ndef test_validate_estimator_init():\n    smote = SMOTE(random_state=RND_SEED)\n    tomek = TomekLinks(sampling_strategy=\"all\")\n    smt = SMOTETomek(smote=smote, tomek=tomek, random_state=RND_SEED)\n    X_resampled, y_resampled = smt.fit_resample(X, Y)\n    X_gt = np.array(\n        [\n            [0.68481731, 0.51935141],\n            [1.34192108, -0.13367336],\n            [0.62366841, -0.21312976],\n            [1.61091956, -0.40283504],\n            [-0.37162401, -2.19400981],\n            [0.74680821, 1.63827342],\n            [0.61472253, -0.82309052],\n            [0.19893132, -0.47761769],\n            [1.40301027, -0.83648734],\n            [-1.20515198, -1.02689695],\n            [-0.23374509, 0.18370049],\n            [-0.00288378, 0.84259929],\n            [1.79580611, -0.02219234],\n            [0.38307743, -0.05670439],\n            [0.70319159, -0.02571667],\n            [0.75052536, -0.19246518],\n        ]\n    )\n    y_gt = np.array([1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0])\n    assert_allclose(X_resampled, X_gt, rtol=R_TOL)\n    assert_array_equal(y_resampled, y_gt)\n\n\ndef test_validate_estimator_default():\n    smt = SMOTETomek(random_state=RND_SEED)\n    X_resampled, y_resampled = smt.fit_resample(X, Y)\n    X_gt = np.array(\n        [\n            [0.68481731, 0.51935141],\n            [1.34192108, -0.13367336],\n            [0.62366841, -0.21312976],\n            [1.61091956, -0.40283504],\n            [-0.37162401, -2.19400981],\n            [0.74680821, 1.63827342],\n            [0.61472253, -0.82309052],\n            [0.19893132, -0.47761769],\n            [1.40301027, -0.83648734],\n            [-1.20515198, -1.02689695],\n            [-0.23374509, 0.18370049],\n            [-0.00288378, 0.84259929],\n            [1.79580611, -0.02219234],\n            [0.38307743, -0.05670439],\n            [0.70319159, -0.02571667],\n            [0.75052536, -0.19246518],\n        ]\n    )\n    y_gt = np.array([1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0])\n    assert_allclose(X_resampled, X_gt, rtol=R_TOL)\n    assert_array_equal(y_resampled, y_gt)\n\n\ndef test_parallelisation():\n    # Check if default job count is None\n    smt = SMOTETomek(random_state=RND_SEED)\n    smt._validate_estimator()\n    assert smt.n_jobs is None\n    assert smt.tomek_.n_jobs is None\n\n    # Check if job count is set\n    smt = SMOTETomek(random_state=RND_SEED, n_jobs=8)\n    smt._validate_estimator()\n    assert smt.n_jobs == 8\n    assert smt.tomek_.n_jobs == 8\n"
  },
  {
    "path": "imblearn/datasets/__init__.py",
    "content": "\"\"\"\nThe :mod:`imblearn.datasets` provides methods to generate\nimbalanced data.\n\"\"\"\n\nfrom imblearn.datasets._imbalance import make_imbalance\nfrom imblearn.datasets._zenodo import fetch_datasets\n\n__all__ = [\"make_imbalance\", \"fetch_datasets\"]\n"
  },
  {
    "path": "imblearn/datasets/_imbalance.py",
    "content": "\"\"\"Transform a dataset into an imbalanced dataset.\"\"\"\n\n# Authors: Dayvid Oliveira\n#          Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nfrom collections import Counter\nfrom collections.abc import Mapping\n\nfrom sklearn_compat.utils._param_validation import validate_params\n\nfrom imblearn.under_sampling import RandomUnderSampler\nfrom imblearn.utils import check_sampling_strategy\n\n\n@validate_params(\n    {\n        \"X\": [\"array-like\"],\n        \"y\": [\"array-like\"],\n        \"sampling_strategy\": [Mapping, callable, None],\n        \"random_state\": [\"random_state\"],\n        \"verbose\": [\"boolean\"],\n    },\n    prefer_skip_nested_validation=True,\n)\ndef make_imbalance(\n    X, y, *, sampling_strategy=None, random_state=None, verbose=False, **kwargs\n):\n    \"\"\"Turn a dataset into an imbalanced dataset with a specific sampling strategy.\n\n    A simple toy dataset to visualize clustering and classification\n    algorithms.\n\n    Read more in the :ref:`User Guide <make_imbalanced>`.\n\n    Parameters\n    ----------\n    X : {array-like, dataframe} of shape (n_samples, n_features)\n        Matrix containing the data to be imbalanced.\n\n    y : array-like of shape (n_samples,)\n        Corresponding label for each sample in X.\n\n    sampling_strategy : dict or callable,\n        Ratio to use for resampling the data set.\n\n        - When ``dict``, the keys correspond to the targeted classes. The\n          values correspond to the desired number of samples for each targeted\n          class.\n\n        - When callable, function taking ``y`` and returns a ``dict``. The keys\n          correspond to the targeted classes. The values correspond to the\n          desired number of samples for each class.\n\n    random_state : int, RandomState instance or None, default=None\n        If int, random_state is the seed used by the random number generator;\n        If RandomState instance, random_state is the random number generator;\n        If None, the random number generator is the RandomState instance used\n        by np.random.\n\n    verbose : bool, default=False\n        Show information regarding the sampling.\n\n    **kwargs : dict\n        Dictionary of additional keyword arguments to pass to\n        ``sampling_strategy``.\n\n    Returns\n    -------\n    X_resampled : {ndarray, dataframe} of shape (n_samples_new, n_features)\n        The array containing the imbalanced data.\n\n    y_resampled : ndarray of shape (n_samples_new)\n        The corresponding label of `X_resampled`.\n\n    Notes\n    -----\n    See\n    :ref:`sphx_glr_auto_examples_applications_plot_multi_class_under_sampling.py`,\n    :ref:`sphx_glr_auto_examples_datasets_plot_make_imbalance.py`, and\n    :ref:`sphx_glr_auto_examples_api_plot_sampling_strategy_usage.py`.\n\n    Examples\n    --------\n    >>> from collections import Counter\n    >>> from sklearn.datasets import load_iris\n    >>> from imblearn.datasets import make_imbalance\n\n    >>> data = load_iris()\n    >>> X, y = data.data, data.target\n    >>> print(f'Distribution before imbalancing: {Counter(y)}')\n    Distribution before imbalancing: Counter({0: 50, 1: 50, 2: 50})\n    >>> X_res, y_res = make_imbalance(X, y,\n    ...                               sampling_strategy={0: 10, 1: 20, 2: 30},\n    ...                               random_state=42)\n    >>> print(f'Distribution after imbalancing: {Counter(y_res)}')\n    Distribution after imbalancing: Counter({2: 30, 1: 20, 0: 10})\n    \"\"\"\n    target_stats = Counter(y)\n    # restrict ratio to be a dict or a callable\n    if isinstance(sampling_strategy, Mapping) or callable(sampling_strategy):\n        sampling_strategy_ = check_sampling_strategy(\n            sampling_strategy, y, \"under-sampling\", **kwargs\n        )\n\n    if verbose:\n        print(f\"The original target distribution in the dataset is: {target_stats}\")\n    rus = RandomUnderSampler(\n        sampling_strategy=sampling_strategy_,\n        replacement=False,\n        random_state=random_state,\n    )\n    X_resampled, y_resampled = rus.fit_resample(X, y)\n    if verbose:\n        print(f\"Make the dataset imbalanced: {Counter(y_resampled)}\")\n\n    return X_resampled, y_resampled\n"
  },
  {
    "path": "imblearn/datasets/_zenodo.py",
    "content": "\"\"\"Collection of imbalanced datasets.\n\nThis collection of datasets has been proposed in [1]_. The\ncharacteristics of the available datasets are presented in the table\nbelow.\n\n ID    Name           Repository & Target           Ratio  #S       #F\n 1     ecoli          UCI, target: imU              8.6:1  336      7\n 2     optical_digits UCI, target: 8                9.1:1  5,620    64\n 3     satimage       UCI, target: 4                9.3:1  6,435    36\n 4     pen_digits     UCI, target: 5                9.4:1  10,992   16\n 5     abalone        UCI, target: 7                9.7:1  4,177    10\n 6     sick_euthyroid UCI, target: sick euthyroid   9.8:1  3,163    42\n 7     spectrometer   UCI, target: >=44             11:1   531      93\n 8     car_eval_34    UCI, target: good, v good     12:1   1,728    21\n 9     isolet         UCI, target: A, B             12:1   7,797    617\n 10    us_crime       UCI, target: >0.65            12:1   1,994    100\n 11    yeast_ml8      LIBSVM, target: 8             13:1   2,417    103\n 12    scene          LIBSVM, target: >one label    13:1   2,407    294\n 13    libras_move    UCI, target: 1                14:1   360      90\n 14    thyroid_sick   UCI, target: sick             15:1   3,772    52\n 15    coil_2000      KDD, CoIL, target: minority   16:1   9,822    85\n 16    arrhythmia     UCI, target: 06               17:1   452      278\n 17    solar_flare_m0 UCI, target: M->0             19:1   1,389    32\n 18    oil            UCI, target: minority         22:1   937      49\n 19    car_eval_4     UCI, target: vgood            26:1   1,728    21\n 20    wine_quality   UCI, wine, target: <=4        26:1   4,898    11\n 21    letter_img     UCI, target: Z                26:1   20,000   16\n 22    yeast_me2      UCI, target: ME2              28:1   1,484    8\n 23    webpage        LIBSVM, w7a, target: minority 33:1   34,780   300\n 24    ozone_level    UCI, ozone, data              34:1   2,536    72\n 25    mammography    UCI, target: minority         42:1   11,183   6\n 26    protein_homo   KDD CUP 2004, minority        111:1  145,751  74\n 27    abalone_19     UCI, target: 19               130:1  4,177    10\n\nReferences\n----------\n.. [1] Ding, Zejin, \"Diversified Ensemble Classifiers for Highly\n   Imbalanced Data Learning and their Application in Bioinformatics.\"\n   Dissertation, Georgia State University, (2011).\n\"\"\"\n\n# Author: Guillaume Lemaitre\n# License: BSD 3 clause\n\nimport tarfile\nfrom collections import OrderedDict\nfrom inspect import signature\nfrom io import BytesIO\nfrom os import makedirs\nfrom os.path import isfile, join\nfrom urllib.request import urlopen\n\nimport numpy as np\nfrom sklearn.datasets import get_data_home\nfrom sklearn.utils import Bunch, check_random_state\nfrom sklearn_compat.utils._param_validation import validate_params\n\nURL = \"https://zenodo.org/record/61452/files/benchmark-imbalanced-learn.tar.gz\"\nPRE_FILENAME = \"x\"\nPOST_FILENAME = \"data.npz\"\n\nMAP_NAME_ID_KEYS = [\n    \"ecoli\",\n    \"optical_digits\",\n    \"satimage\",\n    \"pen_digits\",\n    \"abalone\",\n    \"sick_euthyroid\",\n    \"spectrometer\",\n    \"car_eval_34\",\n    \"isolet\",\n    \"us_crime\",\n    \"yeast_ml8\",\n    \"scene\",\n    \"libras_move\",\n    \"thyroid_sick\",\n    \"coil_2000\",\n    \"arrhythmia\",\n    \"solar_flare_m0\",\n    \"oil\",\n    \"car_eval_4\",\n    \"wine_quality\",\n    \"letter_img\",\n    \"yeast_me2\",\n    \"webpage\",\n    \"ozone_level\",\n    \"mammography\",\n    \"protein_homo\",\n    \"abalone_19\",\n]\n\nMAP_NAME_ID = OrderedDict()\nMAP_ID_NAME = OrderedDict()\nfor v, k in enumerate(MAP_NAME_ID_KEYS):\n    MAP_NAME_ID[k] = v + 1\n    MAP_ID_NAME[v + 1] = k\n\n\n@validate_params(\n    {\n        \"data_home\": [None, str],\n        \"filter_data\": [None, tuple],\n        \"download_if_missing\": [\"boolean\"],\n        \"random_state\": [\"random_state\"],\n        \"shuffle\": [\"boolean\"],\n        \"verbose\": [\"boolean\"],\n    },\n    prefer_skip_nested_validation=True,\n)\ndef fetch_datasets(\n    *,\n    data_home=None,\n    filter_data=None,\n    download_if_missing=True,\n    random_state=None,\n    shuffle=False,\n    verbose=False,\n):\n    \"\"\"Load the benchmark datasets from Zenodo, downloading it if necessary.\n\n    .. versionadded:: 0.3\n\n    Parameters\n    ----------\n    data_home : str, default=None\n        Specify another download and cache folder for the datasets. By default\n        all scikit-learn data is stored in '~/scikit_learn_data' subfolders.\n\n    filter_data : tuple of str/int, default=None\n        A tuple containing the ID or the name of the datasets to be returned.\n        Refer to the above table to get the ID and name of the datasets.\n\n    download_if_missing : bool, default=True\n        If False, raise a IOError if the data is not locally available\n        instead of trying to download the data from the source site.\n\n    random_state : int, RandomState instance or None, default=None\n        Random state for shuffling the dataset.\n        If int, random_state is the seed used by the random number generator;\n        If RandomState instance, random_state is the random number generator;\n        If None, the random number generator is the RandomState instance used\n        by `np.random`.\n\n    shuffle : bool, default=False\n        Whether to shuffle dataset.\n\n    verbose : bool, default=False\n        Show information regarding the fetching.\n\n    Returns\n    -------\n    datasets : OrderedDict of Bunch object,\n        The ordered is defined by ``filter_data``. Each Bunch object ---\n        referred as dataset --- have the following attributes:\n\n        dataset.data : ndarray of shape (n_samples, n_features)\n\n        dataset.target : ndarray of shape (n_samples,)\n\n        dataset.DESCR : str\n            Description of the each dataset.\n\n    Notes\n    -----\n    This collection of datasets have been proposed in [1]_. The\n    characteristics of the available datasets are presented in the table\n    below.\n\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |ID|Name          | Repository & Target           | Ratio | #S      | #F  |\n    +==+==============+===============================+=======+=========+=====+\n    |1 |ecoli         | UCI, target: imU              | 8.6:1 | 336     | 7   |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |2 |optical_digits| UCI, target: 8                | 9.1:1 | 5,620   | 64  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |3 |satimage      | UCI, target: 4                | 9.3:1 | 6,435   | 36  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |4 |pen_digits    | UCI, target: 5                | 9.4:1 | 10,992  | 16  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |5 |abalone       | UCI, target: 7                | 9.7:1 | 4,177   | 10  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |6 |sick_euthyroid| UCI, target: sick euthyroid   | 9.8:1 | 3,163   | 42  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |7 |spectrometer  | UCI, target: >=44             | 11:1  | 531     | 93  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |8 |car_eval_34   | UCI, target: good, v good     | 12:1  | 1,728   | 21  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |9 |isolet        | UCI, target: A, B             | 12:1  | 7,797   | 617 |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |10|us_crime      | UCI, target: >0.65            | 12:1  | 1,994   | 100 |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |11|yeast_ml8     | LIBSVM, target: 8             | 13:1  | 2,417   | 103 |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |12|scene         | LIBSVM, target: >one label    | 13:1  | 2,407   | 294 |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |13|libras_move   | UCI, target: 1                | 14:1  | 360     | 90  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |14|thyroid_sick  | UCI, target: sick             | 15:1  | 3,772   | 52  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |15|coil_2000     | KDD, CoIL, target: minority   | 16:1  | 9,822   | 85  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |16|arrhythmia    | UCI, target: 06               | 17:1  | 452     | 278 |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |17|solar_flare_m0| UCI, target: M->0             | 19:1  | 1,389   | 32  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |18|oil           | UCI, target: minority         | 22:1  | 937     | 49  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |19|car_eval_4    | UCI, target: vgood            | 26:1  | 1,728   | 21  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |20|wine_quality  | UCI, wine, target: <=4        | 26:1  | 4,898   | 11  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |21|letter_img    | UCI, target: Z                | 26:1  | 20,000  | 16  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |22|yeast_me2     | UCI, target: ME2              | 28:1  | 1,484   | 8   |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |23|webpage       | LIBSVM, w7a, target: minority | 33:1  | 34,780  | 300 |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |24|ozone_level   | UCI, ozone, data              | 34:1  | 2,536   | 72  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |25|mammography   | UCI, target: minority         | 42:1  | 11,183  | 6   |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |26|protein_homo  | KDD CUP 2004, minority        | 111:1 | 145,751 | 74  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n    |27|abalone_19    | UCI, target: 19               | 130:1 | 4,177   | 10  |\n    +--+--------------+-------------------------------+-------+---------+-----+\n\n    References\n    ----------\n    .. [1] Ding, Zejin, \"Diversified Ensemble Classifiers for Highly\n       Imbalanced Data Learning and their Application in Bioinformatics.\"\n       Dissertation, Georgia State University, (2011).\n    \"\"\"\n\n    data_home = get_data_home(data_home=data_home)\n    zenodo_dir = join(data_home, \"zenodo\")\n    datasets = OrderedDict()\n\n    if filter_data is None:\n        filter_data_ = MAP_NAME_ID.keys()\n    else:\n        list_data = MAP_NAME_ID.keys()\n        filter_data_ = []\n        for it in filter_data:\n            if isinstance(it, str):\n                if it not in list_data:\n                    raise ValueError(\n                        f\"{it} is not a dataset available. \"\n                        f\"The available datasets are {list_data}\"\n                    )\n                else:\n                    filter_data_.append(it)\n            elif isinstance(it, int):\n                if it < 1 or it > 27:\n                    raise ValueError(\n                        f\"The dataset with the ID={it} is not an \"\n                        \"available dataset. The IDs are \"\n                        f\"{range(1, 28)}\"\n                    )\n                else:\n                    # The index start at one, then we need to remove one\n                    # to not have issue with the indexing.\n                    filter_data_.append(MAP_ID_NAME[it])\n            else:\n                raise ValueError(\n                    \"The value in the tuple should be str or int.\"\n                    f\" Got {type(it)} instead.\"\n                )\n\n    # go through the list and check if the data are available\n    for it in filter_data_:\n        filename = PRE_FILENAME + str(MAP_NAME_ID[it]) + POST_FILENAME\n        filename = join(zenodo_dir, filename)\n        available = isfile(filename)\n\n        if download_if_missing and not available:\n            makedirs(zenodo_dir, exist_ok=True)\n            if verbose:\n                print(f\"Downloading {URL}\")\n            f = BytesIO(urlopen(URL).read())\n            tar = tarfile.open(fileobj=f)\n            if \"filter\" in signature(tar.extractall).parameters:\n                tar.extractall(path=zenodo_dir, filter=\"data\")\n            else:  # Python < 3.12\n                tar.extractall(path=zenodo_dir)\n        elif not download_if_missing and not available:\n            raise OSError(\"Data not found and `download_if_missing` is False\")\n\n        data = np.load(filename)\n        X, y = data[\"data\"], data[\"label\"]\n\n        if shuffle:\n            ind = np.arange(X.shape[0])\n            rng = check_random_state(random_state)\n            rng.shuffle(ind)\n            X = X[ind]\n            y = y[ind]\n\n        datasets[it] = Bunch(data=X, target=y, DESCR=it)\n\n    return datasets\n"
  },
  {
    "path": "imblearn/datasets/tests/__init__.py",
    "content": ""
  },
  {
    "path": "imblearn/datasets/tests/test_imbalance.py",
    "content": "\"\"\"Test the module easy ensemble.\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nfrom collections import Counter\n\nimport numpy as np\nimport pytest\nfrom sklearn.datasets import load_iris\n\nfrom imblearn.datasets import make_imbalance\n\n\n@pytest.fixture\ndef iris():\n    return load_iris(return_X_y=True)\n\n\n@pytest.mark.parametrize(\n    \"sampling_strategy, err_msg\",\n    [\n        ({0: -100, 1: 50, 2: 50}, \"in a class cannot be negative\"),\n        ({0: 10, 1: 70}, \"should be less or equal to the original\"),\n    ],\n)\ndef test_make_imbalance_error(iris, sampling_strategy, err_msg):\n    # we are reusing part of utils.check_sampling_strategy, however this is not\n    # cover in the common tests so we will repeat it here\n    X, y = iris\n    with pytest.raises(ValueError, match=err_msg):\n        make_imbalance(X, y, sampling_strategy=sampling_strategy)\n\n\ndef test_make_imbalance_error_single_class(iris):\n    X, y = iris\n    y = np.zeros_like(y)\n    with pytest.raises(ValueError, match=\"needs to have more than 1 class.\"):\n        make_imbalance(X, y, sampling_strategy={0: 10})\n\n\n@pytest.mark.parametrize(\n    \"sampling_strategy, expected_counts\",\n    [\n        ({0: 10, 1: 20, 2: 30}, {0: 10, 1: 20, 2: 30}),\n        ({0: 10, 1: 20}, {0: 10, 1: 20, 2: 50}),\n    ],\n)\ndef test_make_imbalance_dict(iris, sampling_strategy, expected_counts):\n    X, y = iris\n    _, y_ = make_imbalance(X, y, sampling_strategy=sampling_strategy)\n    assert Counter(y_) == expected_counts\n\n\n@pytest.mark.parametrize(\"as_frame\", [True, False], ids=[\"dataframe\", \"array\"])\n@pytest.mark.parametrize(\n    \"sampling_strategy, expected_counts\",\n    [\n        (\n            {\"setosa\": 10, \"versicolor\": 20, \"virginica\": 30},\n            {\"setosa\": 10, \"versicolor\": 20, \"virginica\": 30},\n        ),\n        (\n            {\"setosa\": 10, \"versicolor\": 20},\n            {\"setosa\": 10, \"versicolor\": 20, \"virginica\": 50},\n        ),\n    ],\n)\ndef test_make_imbalanced_iris(as_frame, sampling_strategy, expected_counts):\n    pd = pytest.importorskip(\"pandas\")\n    iris = load_iris(as_frame=as_frame)\n    X, y = iris.data, iris.target\n    y = iris.target_names[iris.target]\n    if as_frame:\n        y = pd.Series(iris.target_names[iris.target], name=\"target\")\n    X_res, y_res = make_imbalance(X, y, sampling_strategy=sampling_strategy)\n    if as_frame:\n        assert hasattr(X_res, \"loc\")\n        pd.testing.assert_index_equal(X_res.index, y_res.index)\n    assert Counter(y_res) == expected_counts\n"
  },
  {
    "path": "imblearn/datasets/tests/test_zenodo.py",
    "content": "\"\"\"Test the datasets loader.\n\nSkipped if datasets is not already downloaded to data_home.\n\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport pytest\nfrom sklearn.utils._testing import SkipTest\n\nfrom imblearn.datasets import fetch_datasets\n\nDATASET_SHAPE = {\n    \"ecoli\": (336, 7),\n    \"optical_digits\": (5620, 64),\n    \"satimage\": (6435, 36),\n    \"pen_digits\": (10992, 16),\n    \"abalone\": (4177, 10),\n    \"sick_euthyroid\": (3163, 42),\n    \"spectrometer\": (531, 93),\n    \"car_eval_34\": (1728, 21),\n    \"isolet\": (7797, 617),\n    \"us_crime\": (1994, 100),\n    \"yeast_ml8\": (2417, 103),\n    \"scene\": (2407, 294),\n    \"libras_move\": (360, 90),\n    \"thyroid_sick\": (3772, 52),\n    \"coil_2000\": (9822, 85),\n    \"arrhythmia\": (452, 278),\n    \"solar_flare_m0\": (1389, 32),\n    \"oil\": (937, 49),\n    \"car_eval_4\": (1728, 21),\n    \"wine_quality\": (4898, 11),\n    \"letter_img\": (20000, 16),\n    \"yeast_me2\": (1484, 8),\n    \"webpage\": (34780, 300),\n    \"ozone_level\": (2536, 72),\n    \"mammography\": (11183, 6),\n    \"protein_homo\": (145751, 74),\n    \"abalone_19\": (4177, 10),\n}\n\n\ndef fetch(*args, **kwargs):\n    return fetch_datasets(*args, download_if_missing=True, **kwargs)\n\n\n@pytest.mark.xfail\ndef test_fetch():\n    try:\n        datasets1 = fetch(shuffle=True, random_state=42)\n    except OSError:\n        raise SkipTest(\"Zenodo dataset can not be loaded.\")\n\n    datasets2 = fetch(shuffle=True, random_state=37)\n\n    for k in DATASET_SHAPE.keys():\n        X1, X2 = datasets1[k].data, datasets2[k].data\n        assert DATASET_SHAPE[k] == X1.shape\n        assert X1.shape == X2.shape\n\n        y1, y2 = datasets1[k].target, datasets2[k].target\n        assert (X1.shape[0],) == y1.shape\n        assert (X1.shape[0],) == y2.shape\n\n\ndef test_fetch_filter():\n    try:\n        datasets1 = fetch(filter_data=tuple([1]), shuffle=True, random_state=42)\n    except OSError:\n        raise SkipTest(\"Zenodo dataset can not be loaded.\")\n\n    datasets2 = fetch(filter_data=tuple([\"ecoli\"]), shuffle=True, random_state=37)\n\n    X1, X2 = datasets1[\"ecoli\"].data, datasets2[\"ecoli\"].data\n    assert DATASET_SHAPE[\"ecoli\"] == X1.shape\n    assert X1.shape == X2.shape\n\n    assert X1.sum() == pytest.approx(X2.sum())\n\n    y1, y2 = datasets1[\"ecoli\"].target, datasets2[\"ecoli\"].target\n    assert (X1.shape[0],) == y1.shape\n    assert (X1.shape[0],) == y2.shape\n\n\n@pytest.mark.parametrize(\n    \"filter_data, err_msg\",\n    [\n        ((\"rnf\",), \"is not a dataset available\"),\n        ((-1,), \"dataset with the ID=\"),\n        ((100,), \"dataset with the ID=\"),\n        ((1.00,), \"value in the tuple\"),\n    ],\n)\ndef test_fetch_error(filter_data, err_msg):\n    with pytest.raises(ValueError, match=err_msg):\n        fetch_datasets(filter_data=filter_data)\n"
  },
  {
    "path": "imblearn/ensemble/__init__.py",
    "content": "\"\"\"\nThe :mod:`imblearn.ensemble` module include methods generating\nunder-sampled subsets combined inside an ensemble.\n\"\"\"\n\nfrom imblearn.ensemble._bagging import BalancedBaggingClassifier\nfrom imblearn.ensemble._easy_ensemble import EasyEnsembleClassifier\nfrom imblearn.ensemble._forest import BalancedRandomForestClassifier\nfrom imblearn.ensemble._weight_boosting import RUSBoostClassifier\n\n__all__ = [\n    \"BalancedBaggingClassifier\",\n    \"BalancedRandomForestClassifier\",\n    \"EasyEnsembleClassifier\",\n    \"RUSBoostClassifier\",\n]\n"
  },
  {
    "path": "imblearn/ensemble/_bagging.py",
    "content": "\"\"\"Bagging classifier trained on balanced bootstrap samples.\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport copy\nimport numbers\n\nimport numpy as np\nfrom sklearn.base import clone\nfrom sklearn.ensemble import BaggingClassifier\nfrom sklearn.tree import DecisionTreeClassifier\nfrom sklearn.utils._param_validation import HasMethods, Interval, StrOptions\nfrom sklearn_compat.base import _fit_context\n\nfrom imblearn.pipeline import Pipeline\nfrom imblearn.under_sampling import RandomUnderSampler\nfrom imblearn.under_sampling.base import BaseUnderSampler\nfrom imblearn.utils import Substitution, check_sampling_strategy, check_target_type\nfrom imblearn.utils._docstring import _n_jobs_docstring, _random_state_docstring\n\n\n@Substitution(\n    sampling_strategy=BaseUnderSampler._sampling_strategy_docstring,\n    n_jobs=_n_jobs_docstring,\n    random_state=_random_state_docstring,\n)\nclass BalancedBaggingClassifier(BaggingClassifier):\n    \"\"\"A Bagging classifier with additional balancing.\n\n    This implementation of Bagging is similar to the scikit-learn\n    implementation. It includes an additional step to balance the training set\n    at fit time using a given sampler.\n\n    This classifier can serves as a basis to implement various methods such as\n    Exactly Balanced Bagging [6]_, Roughly Balanced Bagging [7]_,\n    Over-Bagging [6]_, or SMOTE-Bagging [8]_.\n\n    Read more in the :ref:`User Guide <bagging>`.\n\n    Parameters\n    ----------\n    estimator : estimator object, default=None\n        The base estimator to fit on random subsets of the dataset.\n        If None, then the base estimator is a decision tree.\n\n        .. versionadded:: 0.10\n\n    n_estimators : int, default=10\n        The number of base estimators in the ensemble.\n\n    max_samples : int or float, default=1.0\n        The number of samples to draw from X to train each base estimator.\n\n        - If int, then draw ``max_samples`` samples.\n        - If float, then draw ``max_samples * X.shape[0]`` samples.\n\n    max_features : int or float, default=1.0\n        The number of features to draw from X to train each base estimator.\n\n        - If int, then draw ``max_features`` features.\n        - If float, then draw ``max_features * X.shape[1]`` features.\n\n    bootstrap : bool, default=True\n        Whether samples are drawn with replacement.\n\n        .. note::\n           Note that this bootstrap will be generated from the resampled\n           dataset.\n\n    bootstrap_features : bool, default=False\n        Whether features are drawn with replacement.\n\n    oob_score : bool, default=False\n        Whether to use out-of-bag samples to estimate\n        the generalization error.\n\n    warm_start : bool, default=False\n        When set to True, reuse the solution of the previous call to fit\n        and add more estimators to the ensemble, otherwise, just fit\n        a whole new ensemble.\n\n    {sampling_strategy}\n\n    replacement : bool, default=False\n        Whether or not to randomly sample with replacement or not when\n        `sampler is None`, corresponding to a\n        :class:`~imblearn.under_sampling.RandomUnderSampler`.\n\n    {n_jobs}\n\n    {random_state}\n\n    verbose : int, default=0\n        Controls the verbosity of the building process.\n\n    sampler : sampler object, default=None\n        The sampler used to balanced the dataset before to bootstrap\n        (if `bootstrap=True`) and `fit` a base estimator. By default, a\n        :class:`~imblearn.under_sampling.RandomUnderSampler` is used.\n\n        .. versionadded:: 0.8\n\n    Attributes\n    ----------\n    estimator_ : estimator\n        The base estimator from which the ensemble is grown.\n\n        .. versionadded:: 0.10\n\n    estimators_ : list of estimators\n        The collection of fitted base estimators.\n\n    sampler_ : sampler object\n        The validate sampler created from the `sampler` parameter.\n\n    estimators_samples_ : list of ndarray\n        The subset of drawn samples (i.e., the in-bag samples) for each base\n        estimator. Each subset is defined by a boolean mask.\n\n    estimators_features_ : list of ndarray\n        The subset of drawn features for each base estimator.\n\n    classes_ : ndarray of shape (n_classes,)\n        The classes labels.\n\n    n_classes_ : int or list\n        The number of classes.\n\n    oob_score_ : float\n        Score of the training dataset obtained using an out-of-bag estimate.\n\n    oob_decision_function_ : ndarray of shape (n_samples, n_classes)\n        Decision function computed with out-of-bag estimate on the training\n        set. If n_estimators is small it might be possible that a data point\n        was never left out during the bootstrap. In this case,\n        ``oob_decision_function_`` might contain NaN.\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.9\n\n    See Also\n    --------\n    BalancedRandomForestClassifier : Random forest applying random-under\n        sampling to balance the different bootstraps.\n\n    EasyEnsembleClassifier : Ensemble of AdaBoost classifier trained on\n        balanced bootstraps.\n\n    RUSBoostClassifier : AdaBoost classifier were each bootstrap is balanced\n        using random-under sampling at each round of boosting.\n\n    Notes\n    -----\n    This is possible to turn this classifier into a balanced random forest [5]_\n    by passing a :class:`~sklearn.tree.DecisionTreeClassifier` with\n    `max_features='auto'` as a base estimator.\n\n    See\n    :ref:`sphx_glr_auto_examples_ensemble_plot_comparison_ensemble_classifier.py`.\n\n    References\n    ----------\n    .. [1] L. Breiman, \"Pasting small votes for classification in large\n           databases and on-line\", Machine Learning, 36(1), 85-103, 1999.\n\n    .. [2] L. Breiman, \"Bagging predictors\", Machine Learning, 24(2), 123-140,\n           1996.\n\n    .. [3] T. Ho, \"The random subspace method for constructing decision\n           forests\", Pattern Analysis and Machine Intelligence, 20(8), 832-844,\n           1998.\n\n    .. [4] G. Louppe and P. Geurts, \"Ensembles on Random Patches\", Machine\n           Learning and Knowledge Discovery in Databases, 346-361, 2012.\n\n    .. [5] C. Chen Chao, A. Liaw, and L. Breiman. \"Using random forest to\n           learn imbalanced data.\" University of California, Berkeley 110,\n           2004.\n\n    .. [6] R. Maclin, and D. Opitz. \"An empirical evaluation of bagging and\n           boosting.\" AAAI/IAAI 1997 (1997): 546-551.\n\n    .. [7] S. Hido, H. Kashima, and Y. Takahashi. \"Roughly balanced bagging\n           for imbalanced data.\" Statistical Analysis and Data Mining: The ASA\n           Data Science Journal 2.5‐6 (2009): 412-426.\n\n    .. [8] S. Wang, and X. Yao. \"Diversity analysis on imbalanced data sets by\n           using ensemble models.\" 2009 IEEE symposium on computational\n           intelligence and data mining. IEEE, 2009.\n\n    Examples\n    --------\n    >>> from collections import Counter\n    >>> from sklearn.datasets import make_classification\n    >>> from sklearn.model_selection import train_test_split\n    >>> from sklearn.metrics import confusion_matrix\n    >>> from imblearn.ensemble import BalancedBaggingClassifier\n    >>> X, y = make_classification(n_classes=2, class_sep=2,\n    ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,\n    ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)\n    >>> print('Original dataset shape %s' % Counter(y))\n    Original dataset shape Counter({{1: 900, 0: 100}})\n    >>> X_train, X_test, y_train, y_test = train_test_split(X, y,\n    ...                                                     random_state=0)\n    >>> bbc = BalancedBaggingClassifier(random_state=42)\n    >>> bbc.fit(X_train, y_train)\n    BalancedBaggingClassifier(...)\n    >>> y_pred = bbc.predict(X_test)\n    >>> print(confusion_matrix(y_test, y_pred))\n    [[ 23   0]\n     [  2 225]]\n    \"\"\"\n\n    # make a deepcopy to not modify the original dictionary\n    _parameter_constraints = copy.deepcopy(BaggingClassifier._parameter_constraints)\n    _parameter_constraints.update(\n        {\n            \"sampling_strategy\": [\n                Interval(numbers.Real, 0, 1, closed=\"right\"),\n                StrOptions({\"auto\", \"majority\", \"not minority\", \"not majority\", \"all\"}),\n                dict,\n                callable,\n            ],\n            \"replacement\": [\"boolean\"],\n            \"sampler\": [HasMethods([\"fit_resample\"]), None],\n        }\n    )\n\n    def __init__(\n        self,\n        estimator=None,\n        n_estimators=10,\n        *,\n        max_samples=1.0,\n        max_features=1.0,\n        bootstrap=True,\n        bootstrap_features=False,\n        oob_score=False,\n        warm_start=False,\n        sampling_strategy=\"auto\",\n        replacement=False,\n        n_jobs=None,\n        random_state=None,\n        verbose=0,\n        sampler=None,\n    ):\n        super().__init__(\n            n_estimators=n_estimators,\n            max_samples=max_samples,\n            max_features=max_features,\n            bootstrap=bootstrap,\n            bootstrap_features=bootstrap_features,\n            oob_score=oob_score,\n            warm_start=warm_start,\n            n_jobs=n_jobs,\n            random_state=random_state,\n            verbose=verbose,\n        )\n        self.estimator = estimator\n        self.sampling_strategy = sampling_strategy\n        self.replacement = replacement\n        self.sampler = sampler\n\n    def _validate_y(self, y):\n        y_encoded = super()._validate_y(y)\n        if (\n            isinstance(self.sampling_strategy, dict)\n            and self.sampler_._sampling_type != \"bypass\"\n        ):\n            self._sampling_strategy = {\n                np.where(self.classes_ == key)[0][0]: value\n                for key, value in check_sampling_strategy(\n                    self.sampling_strategy,\n                    y,\n                    self.sampler_._sampling_type,\n                ).items()\n            }\n        else:\n            self._sampling_strategy = self.sampling_strategy\n        return y_encoded\n\n    def _validate_estimator(self, default=DecisionTreeClassifier()):\n        \"\"\"Check the estimator and the n_estimator attribute, set the\n        `estimator_` attribute.\"\"\"\n        if self.estimator is not None:\n            estimator = clone(self.estimator)\n        else:\n            estimator = clone(default)\n\n        if self.sampler_._sampling_type != \"bypass\":\n            self.sampler_.set_params(sampling_strategy=self._sampling_strategy)\n\n        self.estimator_ = Pipeline(\n            [(\"sampler\", self.sampler_), (\"classifier\", estimator)]\n        )\n\n    @_fit_context(prefer_skip_nested_validation=False)\n    def fit(self, X, y):\n        \"\"\"Build a Bagging ensemble of estimators from the training set (X, y).\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix} of shape (n_samples, n_features)\n            The training input samples. Sparse matrices are accepted only if\n            they are supported by the base estimator.\n\n        y : array-like of shape (n_samples,)\n            The target values (class labels in classification, real numbers in\n            regression).\n\n        Returns\n        -------\n        self : object\n            Fitted estimator.\n        \"\"\"\n        # overwrite the base class method by disallowing `sample_weight`\n        self._validate_params()\n        return super().fit(X, y)\n\n    def _fit(self, X, y, max_samples=None, max_depth=None, sample_weight=None):\n        check_target_type(y)\n        # the sampler needs to be validated before to call _fit because\n        # _validate_y is called before _validate_estimator and would require\n        # to know which type of sampler we are using.\n        if self.sampler is None:\n            self.sampler_ = RandomUnderSampler(\n                replacement=self.replacement,\n            )\n        else:\n            self.sampler_ = clone(self.sampler)\n        # RandomUnderSampler is not supporting sample_weight. We need to pass\n        # None.\n        return super()._fit(X, y, self.max_samples)\n\n    @property\n    def base_estimator_(self):\n        \"\"\"Attribute for older sklearn version compatibility.\"\"\"\n        error = AttributeError(\n            f\"{self.__class__.__name__} object has no attribute 'base_estimator_'.\"\n        )\n        raise error\n\n    def _more_tags(self):\n        tags = super()._more_tags()\n        tags_key = \"_xfail_checks\"\n        failing_test = \"check_estimators_nan_inf\"\n        reason = \"Fails because the sampler removed infinity and NaN values\"\n        if tags_key in tags:\n            tags[tags_key][failing_test] = reason\n        else:\n            tags[tags_key] = {failing_test: reason}\n        return tags\n\n    def __sklearn_tags__(self):\n        tags = super().__sklearn_tags__()\n        return tags\n"
  },
  {
    "path": "imblearn/ensemble/_common.py",
    "content": "from numbers import Integral, Real\n\nfrom sklearn.tree._criterion import Criterion\nfrom sklearn.utils._param_validation import (\n    HasMethods,\n    Hidden,\n    Interval,\n    RealNotInt,\n    StrOptions,\n)\n\n\ndef _estimator_has(attr):\n    \"\"\"Check if we can delegate a method to the underlying estimator.\n    First, we check the first fitted estimator if available, otherwise we\n    check the estimator attribute.\n    \"\"\"\n\n    def check(self):\n        if hasattr(self, \"estimators_\"):\n            return hasattr(self.estimators_[0], attr)\n        else:  # self.estimator is not None\n            return hasattr(self.estimator, attr)\n\n    return check\n\n\n_bagging_parameter_constraints = {\n    \"estimator\": [HasMethods([\"fit\", \"predict\"]), None],\n    \"n_estimators\": [Interval(Integral, 1, None, closed=\"left\")],\n    \"max_samples\": [\n        Interval(Integral, 1, None, closed=\"left\"),\n        Interval(RealNotInt, 0, 1, closed=\"right\"),\n    ],\n    \"max_features\": [\n        Interval(Integral, 1, None, closed=\"left\"),\n        Interval(RealNotInt, 0, 1, closed=\"right\"),\n    ],\n    \"bootstrap\": [\"boolean\"],\n    \"bootstrap_features\": [\"boolean\"],\n    \"oob_score\": [\"boolean\"],\n    \"warm_start\": [\"boolean\"],\n    \"n_jobs\": [None, Integral],\n    \"random_state\": [\"random_state\"],\n    \"verbose\": [\"verbose\"],\n}\n\n_adaboost_classifier_parameter_constraints = {\n    \"estimator\": [HasMethods([\"fit\", \"predict\"]), None],\n    \"n_estimators\": [Interval(Integral, 1, None, closed=\"left\")],\n    \"learning_rate\": [Interval(Real, 0, None, closed=\"neither\")],\n    \"random_state\": [\"random_state\"],\n    \"base_estimator\": [HasMethods([\"fit\", \"predict\"]), StrOptions({\"deprecated\"})],\n    \"algorithm\": [StrOptions({\"SAMME\", \"SAMME.R\"})],\n}\n\n_random_forest_classifier_parameter_constraints = {\n    \"n_estimators\": [Interval(Integral, 1, None, closed=\"left\")],\n    \"bootstrap\": [\"boolean\"],\n    \"oob_score\": [\"boolean\"],\n    \"n_jobs\": [Integral, None],\n    \"random_state\": [\"random_state\"],\n    \"verbose\": [\"verbose\"],\n    \"warm_start\": [\"boolean\"],\n    \"criterion\": [StrOptions({\"gini\", \"entropy\", \"log_loss\"}), Hidden(Criterion)],\n    \"max_samples\": [\n        None,\n        Interval(Real, 0.0, 1.0, closed=\"right\"),\n        Interval(Integral, 1, None, closed=\"left\"),\n    ],\n    \"max_depth\": [Interval(Integral, 1, None, closed=\"left\"), None],\n    \"min_samples_split\": [\n        Interval(Integral, 2, None, closed=\"left\"),\n        Interval(RealNotInt, 0.0, 1.0, closed=\"right\"),\n    ],\n    \"min_samples_leaf\": [\n        Interval(Integral, 1, None, closed=\"left\"),\n        Interval(RealNotInt, 0.0, 1.0, closed=\"neither\"),\n    ],\n    \"min_weight_fraction_leaf\": [Interval(Real, 0.0, 0.5, closed=\"both\")],\n    \"max_features\": [\n        Interval(Integral, 1, None, closed=\"left\"),\n        Interval(RealNotInt, 0.0, 1.0, closed=\"right\"),\n        StrOptions({\"sqrt\", \"log2\"}),\n        None,\n    ],\n    \"max_leaf_nodes\": [Interval(Integral, 2, None, closed=\"left\"), None],\n    \"min_impurity_decrease\": [Interval(Real, 0.0, None, closed=\"left\")],\n    \"ccp_alpha\": [Interval(Real, 0.0, None, closed=\"left\")],\n    \"class_weight\": [\n        StrOptions({\"balanced_subsample\", \"balanced\"}),\n        dict,\n        list,\n        None,\n    ],\n    \"monotonic_cst\": [\"array-like\", None],\n}\n"
  },
  {
    "path": "imblearn/ensemble/_easy_ensemble.py",
    "content": "\"\"\"Class to perform under-sampling using easy ensemble.\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport copy\nimport numbers\n\nimport numpy as np\nfrom sklearn.base import clone\nfrom sklearn.ensemble import AdaBoostClassifier, BaggingClassifier\nfrom sklearn.utils._param_validation import Interval, StrOptions\nfrom sklearn.utils.fixes import parse_version\nfrom sklearn_compat._sklearn_compat import sklearn_version\nfrom sklearn_compat.base import _fit_context\n\nfrom imblearn.ensemble._common import _bagging_parameter_constraints\nfrom imblearn.pipeline import Pipeline\nfrom imblearn.under_sampling import RandomUnderSampler\nfrom imblearn.under_sampling.base import BaseUnderSampler\nfrom imblearn.utils import Substitution, check_sampling_strategy, check_target_type\nfrom imblearn.utils._docstring import _n_jobs_docstring, _random_state_docstring\nfrom imblearn.utils._tags import get_tags\n\nMAX_INT = np.iinfo(np.int32).max\n\n\n@Substitution(\n    sampling_strategy=BaseUnderSampler._sampling_strategy_docstring,\n    n_jobs=_n_jobs_docstring,\n    random_state=_random_state_docstring,\n)\nclass EasyEnsembleClassifier(BaggingClassifier):\n    \"\"\"Bag of balanced boosted learners also known as EasyEnsemble.\n\n    This algorithm is known as EasyEnsemble [1]_. The classifier is an\n    ensemble of AdaBoost learners trained on different balanced bootstrap\n    samples. The balancing is achieved by random under-sampling.\n\n    Read more in the :ref:`User Guide <boosting>`.\n\n    .. versionadded:: 0.4\n\n    Parameters\n    ----------\n    n_estimators : int, default=10\n        Number of AdaBoost learners in the ensemble.\n\n    estimator : estimator object, default=AdaBoostClassifier()\n        The base AdaBoost classifier used in the inner ensemble. Note that you\n        can set the number of inner learner by passing your own instance.\n\n        .. versionadded:: 0.10\n\n    warm_start : bool, default=False\n        When set to True, reuse the solution of the previous call to fit\n        and add more estimators to the ensemble, otherwise, just fit\n        a whole new ensemble.\n\n    {sampling_strategy}\n\n    replacement : bool, default=False\n        Whether or not to sample randomly with replacement or not.\n\n    {n_jobs}\n\n    {random_state}\n\n    verbose : int, default=0\n        Controls the verbosity of the building process.\n\n    Attributes\n    ----------\n    estimator_ : estimator\n        The base estimator from which the ensemble is grown.\n\n        .. versionadded:: 0.10\n\n    estimators_ : list of estimators\n        The collection of fitted base estimators.\n\n    estimators_samples_ : list of arrays\n        The subset of drawn samples for each base estimator.\n\n    estimators_features_ : list of arrays\n        The subset of drawn features for each base estimator.\n\n    classes_ : array, shape (n_classes,)\n        The classes labels.\n\n    n_classes_ : int or list\n        The number of classes.\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.9\n\n    See Also\n    --------\n    BalancedBaggingClassifier : Bagging classifier for which each base\n        estimator is trained on a balanced bootstrap.\n\n    BalancedRandomForestClassifier : Random forest applying random-under\n        sampling to balance the different bootstraps.\n\n    RUSBoostClassifier : AdaBoost classifier were each bootstrap is balanced\n        using random-under sampling at each round of boosting.\n\n    Notes\n    -----\n    The method is described in [1]_.\n\n    Supports multi-class resampling by sampling each class independently.\n\n    References\n    ----------\n    .. [1] X. Y. Liu, J. Wu and Z. H. Zhou, \"Exploratory Undersampling for\n       Class-Imbalance Learning,\" in IEEE Transactions on Systems, Man, and\n       Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp. 539-550,\n       April 2009.\n\n    Examples\n    --------\n    >>> from collections import Counter\n    >>> from sklearn.datasets import make_classification\n    >>> from sklearn.model_selection import train_test_split\n    >>> from sklearn.metrics import confusion_matrix\n    >>> from imblearn.ensemble import EasyEnsembleClassifier\n    >>> X, y = make_classification(n_classes=2, class_sep=2,\n    ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,\n    ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)\n    >>> print('Original dataset shape %s' % Counter(y))\n    Original dataset shape Counter({{1: 900, 0: 100}})\n    >>> X_train, X_test, y_train, y_test = train_test_split(X, y,\n    ...                                                     random_state=0)\n    >>> eec = EasyEnsembleClassifier(random_state=42)\n    >>> eec.fit(X_train, y_train)\n    EasyEnsembleClassifier(...)\n    >>> y_pred = eec.predict(X_test)\n    >>> print(confusion_matrix(y_test, y_pred))\n    [[ 23   0]\n     [  2 225]]\n    \"\"\"\n\n    # make a deepcopy to not modify the original dictionary\n    if sklearn_version >= parse_version(\"1.4\"):\n        _parameter_constraints = copy.deepcopy(BaggingClassifier._parameter_constraints)\n    else:\n        _parameter_constraints = copy.deepcopy(_bagging_parameter_constraints)\n\n    excluded_params = {\n        \"bootstrap\",\n        \"bootstrap_features\",\n        \"max_features\",\n        \"oob_score\",\n        \"max_samples\",\n    }\n    for param in excluded_params:\n        _parameter_constraints.pop(param, None)\n\n    _parameter_constraints.update(\n        {\n            \"sampling_strategy\": [\n                Interval(numbers.Real, 0, 1, closed=\"right\"),\n                StrOptions({\"auto\", \"majority\", \"not minority\", \"not majority\", \"all\"}),\n                dict,\n                callable,\n            ],\n            \"replacement\": [\"boolean\"],\n        }\n    )\n    # TODO: remove when minimum supported version of scikit-learn is 1.4\n    if \"base_estimator\" in _parameter_constraints:\n        del _parameter_constraints[\"base_estimator\"]\n\n    def __init__(\n        self,\n        n_estimators=10,\n        estimator=None,\n        *,\n        warm_start=False,\n        sampling_strategy=\"auto\",\n        replacement=False,\n        n_jobs=None,\n        random_state=None,\n        verbose=0,\n    ):\n        super().__init__(\n            n_estimators=n_estimators,\n            max_samples=1.0,\n            max_features=1.0,\n            bootstrap=False,\n            bootstrap_features=False,\n            oob_score=False,\n            warm_start=warm_start,\n            n_jobs=n_jobs,\n            random_state=random_state,\n            verbose=verbose,\n        )\n        self.estimator = estimator\n        self.sampling_strategy = sampling_strategy\n        self.replacement = replacement\n\n    def _validate_y(self, y):\n        y_encoded = super()._validate_y(y)\n        if isinstance(self.sampling_strategy, dict):\n            self._sampling_strategy = {\n                np.where(self.classes_ == key)[0][0]: value\n                for key, value in check_sampling_strategy(\n                    self.sampling_strategy,\n                    y,\n                    \"under-sampling\",\n                ).items()\n            }\n        else:\n            self._sampling_strategy = self.sampling_strategy\n        return y_encoded\n\n    def _validate_estimator(self, default=None):\n        \"\"\"Check the estimator and the n_estimator attribute, set the\n        `estimator_` attribute.\"\"\"\n        if self.estimator is not None:\n            estimator = clone(self.estimator)\n        else:\n            if default is None:\n                default = self._get_estimator()\n            estimator = clone(default)\n\n        sampler = RandomUnderSampler(\n            sampling_strategy=self._sampling_strategy,\n            replacement=self.replacement,\n        )\n        self.estimator_ = Pipeline([(\"sampler\", sampler), (\"classifier\", estimator)])\n\n    @_fit_context(prefer_skip_nested_validation=False)\n    def fit(self, X, y):\n        \"\"\"Build a Bagging ensemble of estimators from the training set (X, y).\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix} of shape (n_samples, n_features)\n            The training input samples. Sparse matrices are accepted only if\n            they are supported by the base estimator.\n\n        y : array-like of shape (n_samples,)\n            The target values (class labels in classification, real numbers in\n            regression).\n\n        Returns\n        -------\n        self : object\n            Fitted estimator.\n        \"\"\"\n        self._validate_params()\n        # overwrite the base class method by disallowing `sample_weight`\n        return super().fit(X, y)\n\n    def _fit(self, X, y, max_samples=None, max_depth=None, sample_weight=None):\n        check_target_type(y)\n        # RandomUnderSampler is not supporting sample_weight. We need to pass\n        # None.\n        return super()._fit(X, y, self.max_samples)\n\n    @property\n    def base_estimator_(self):\n        \"\"\"Attribute for older sklearn version compatibility.\"\"\"\n        error = AttributeError(\n            f\"{self.__class__.__name__} object has no attribute 'base_estimator_'.\"\n        )\n        raise error\n\n    def _get_estimator(self):\n        if self.estimator is None:\n            if parse_version(\"1.4\") <= sklearn_version < parse_version(\"1.6\"):\n                return AdaBoostClassifier(algorithm=\"SAMME\")\n            else:\n                return AdaBoostClassifier()\n        return self.estimator\n\n    def _more_tags(self):\n        return {\"allow_nan\": get_tags(self._get_estimator()).input_tags.allow_nan}\n\n    def __sklearn_tags__(self):\n        tags = super().__sklearn_tags__()\n        tags.input_tags.allow_nan = get_tags(self._get_estimator()).input_tags.allow_nan\n        return tags\n"
  },
  {
    "path": "imblearn/ensemble/_forest.py",
    "content": "\"\"\"Forest classifiers trained on balanced boostrasp samples.\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\nimport numbers\nfrom copy import deepcopy\nfrom warnings import warn\n\nimport numpy as np\nfrom numpy import float32 as DTYPE\nfrom numpy import float64 as DOUBLE\nfrom scipy.sparse import issparse\nfrom sklearn.base import clone, is_classifier\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.ensemble._base import _set_random_states\nfrom sklearn.ensemble._forest import (\n    _generate_unsampled_indices,\n    _get_n_samples_bootstrap,\n    _parallel_build_trees,\n)\nfrom sklearn.exceptions import DataConversionWarning\nfrom sklearn.tree import DecisionTreeClassifier\nfrom sklearn.utils import _safe_indexing, check_random_state\nfrom sklearn.utils._param_validation import Hidden, Interval, StrOptions\nfrom sklearn.utils.fixes import parse_version\nfrom sklearn.utils.multiclass import type_of_target\nfrom sklearn.utils.parallel import Parallel, delayed\nfrom sklearn.utils.validation import _check_sample_weight\nfrom sklearn_compat._sklearn_compat import sklearn_version\nfrom sklearn_compat.base import _fit_context\nfrom sklearn_compat.utils.validation import validate_data\n\nfrom imblearn.ensemble._common import _random_forest_classifier_parameter_constraints\nfrom imblearn.pipeline import make_pipeline\nfrom imblearn.under_sampling import RandomUnderSampler\nfrom imblearn.utils import Substitution\nfrom imblearn.utils._docstring import _n_jobs_docstring, _random_state_docstring\nfrom imblearn.utils._validation import check_sampling_strategy\n\nMAX_INT = np.iinfo(np.int32).max\n\n\ndef _local_parallel_build_trees(\n    sampler,\n    tree,\n    bootstrap,\n    X,\n    y,\n    sample_weight,\n    tree_idx,\n    n_trees,\n    verbose=0,\n    class_weight=None,\n    n_samples_bootstrap=None,\n    forest=None,\n    missing_values_in_feature_mask=None,\n):\n    # resample before to fit the tree\n    X_resampled, y_resampled = sampler.fit_resample(X, y)\n    if sample_weight is not None:\n        sample_weight = _safe_indexing(sample_weight, sampler.sample_indices_)\n    if _get_n_samples_bootstrap is not None:\n        n_samples_bootstrap = min(n_samples_bootstrap, X_resampled.shape[0])\n\n    params_parallel_build_trees = {\n        \"tree\": tree,\n        \"X\": X_resampled,\n        \"y\": y_resampled,\n        \"sample_weight\": sample_weight,\n        \"tree_idx\": tree_idx,\n        \"n_trees\": n_trees,\n        \"verbose\": verbose,\n        \"class_weight\": class_weight,\n        \"n_samples_bootstrap\": n_samples_bootstrap,\n        \"bootstrap\": bootstrap,\n    }\n\n    params_parallel_build_trees[\"missing_values_in_feature_mask\"] = (\n        missing_values_in_feature_mask\n    )\n\n    tree = _parallel_build_trees(**params_parallel_build_trees)\n\n    return sampler, tree\n\n\n@Substitution(\n    n_jobs=_n_jobs_docstring,\n    random_state=_random_state_docstring,\n)\nclass BalancedRandomForestClassifier(RandomForestClassifier):\n    \"\"\"A balanced random forest classifier.\n\n    A balanced random forest differs from a classical random forest by the\n    fact that it will draw a bootstrap sample from the minority class and\n    sample with replacement the same number of samples from the majority\n    class.\n\n    Read more in the :ref:`User Guide <forest>`.\n\n    .. versionadded:: 0.4\n\n    Parameters\n    ----------\n    n_estimators : int, default=100\n        The number of trees in the forest.\n\n    criterion : {{\"gini\", \"entropy\"}}, default=\"gini\"\n        The function to measure the quality of a split. Supported criteria are\n        \"gini\" for the Gini impurity and \"entropy\" for the information gain.\n        Note: this parameter is tree-specific.\n\n    max_depth : int, default=None\n        The maximum depth of the tree. If None, then nodes are expanded until\n        all leaves are pure or until all leaves contain less than\n        min_samples_split samples.\n\n    min_samples_split : int or float, default=2\n        The minimum number of samples required to split an internal node:\n\n        - If int, then consider `min_samples_split` as the minimum number.\n        - If float, then `min_samples_split` is a percentage and\n          `ceil(min_samples_split * n_samples)` are the minimum\n          number of samples for each split.\n\n    min_samples_leaf : int or float, default=1\n        The minimum number of samples required to be at a leaf node:\n\n        - If int, then consider ``min_samples_leaf`` as the minimum number.\n        - If float, then ``min_samples_leaf`` is a fraction and\n          `ceil(min_samples_leaf * n_samples)` are the minimum\n          number of samples for each node.\n\n    min_weight_fraction_leaf : float, default=0.0\n        The minimum weighted fraction of the sum total of weights (of all\n        the input samples) required to be at a leaf node. Samples have\n        equal weight when sample_weight is not provided.\n\n    max_features : {{\"auto\", \"sqrt\", \"log2\"}}, int, float, or None, \\\n            default=\"sqrt\"\n        The number of features to consider when looking for the best split:\n\n        - If int, then consider `max_features` features at each split.\n        - If float, then `max_features` is a percentage and\n          `int(max_features * n_features)` features are considered at each\n          split.\n        - If \"auto\", then `max_features=sqrt(n_features)`.\n        - If \"sqrt\", then `max_features=sqrt(n_features)` (same as \"auto\").\n        - If \"log2\", then `max_features=log2(n_features)`.\n        - If None, then `max_features=n_features`.\n\n        Note: the search for a split does not stop until at least one\n        valid partition of the node samples is found, even if it requires to\n        effectively inspect more than ``max_features`` features.\n\n    max_leaf_nodes : int, default=None\n        Grow trees with ``max_leaf_nodes`` in best-first fashion.\n        Best nodes are defined as relative reduction in impurity.\n        If None then unlimited number of leaf nodes.\n\n    min_impurity_decrease : float, default=0.0\n        A node will be split if this split induces a decrease of the impurity\n        greater than or equal to this value.\n        The weighted impurity decrease equation is the following::\n\n            N_t / N * (impurity - N_t_R / N_t * right_impurity\n                                - N_t_L / N_t * left_impurity)\n\n        where ``N`` is the total number of samples, ``N_t`` is the number of\n        samples at the current node, ``N_t_L`` is the number of samples in the\n        left child, and ``N_t_R`` is the number of samples in the right child.\n        ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum,\n        if ``sample_weight`` is passed.\n\n    bootstrap : bool, default=True\n        Whether bootstrap samples are used when building trees.\n\n        .. versionchanged:: 0.13\n           The default of `bootstrap` will change from `True` to `False` in\n           version 0.13. Bootstrapping is already taken care by the internal\n           sampler using `replacement=True`. This implementation follows the\n           algorithm proposed in [1]_.\n\n    oob_score : bool, default=False\n        Whether to use out-of-bag samples to estimate\n        the generalization accuracy.\n\n    sampling_strategy : float, str, dict, callable, default=\"auto\"\n        Sampling information to sample the data set.\n\n        - When ``float``, it corresponds to the desired ratio of the number of\n          samples in the minority class over the number of samples in the\n          majority class after resampling. Therefore, the ratio is expressed as\n          :math:`\\\\alpha_{{us}} = N_{{m}} / N_{{rM}}` where :math:`N_{{m}}` is the\n          number of samples in the minority class and\n          :math:`N_{{rM}}` is the number of samples in the majority class\n          after resampling.\n\n          .. warning::\n             ``float`` is only available for **binary** classification. An\n             error is raised for multi-class classification.\n\n        - When ``str``, specify the class targeted by the resampling. The\n          number of samples in the different classes will be equalized.\n          Possible choices are:\n\n            ``'majority'``: resample only the majority class;\n\n            ``'not minority'``: resample all classes but the minority class;\n\n            ``'not majority'``: resample all classes but the majority class;\n\n            ``'all'``: resample all classes;\n\n            ``'auto'``: equivalent to ``'not minority'``.\n\n        - When ``dict``, the keys correspond to the targeted classes. The\n          values correspond to the desired number of samples for each targeted\n          class.\n\n        - When callable, function taking ``y`` and returns a ``dict``. The keys\n          correspond to the targeted classes. The values correspond to the\n          desired number of samples for each class.\n\n        .. versionchanged:: 0.11\n           The default of `sampling_strategy` will change from `\"auto\"` to\n           `\"all\"` in version 0.13. This forces to use a bootstrap of the\n           minority class as proposed in [1]_.\n\n    replacement : bool, default=False\n        Whether or not to sample randomly with replacement or not.\n\n        .. versionchanged:: 0.11\n           The default of `replacement` will change from `False` to `True` in\n           version 0.13. This forces to use a bootstrap of the\n           minority class and draw with replacement as proposed in [1]_.\n\n    {n_jobs}\n\n    {random_state}\n\n    verbose : int, default=0\n        Controls the verbosity of the tree building process.\n\n    warm_start : bool, default=False\n        When set to ``True``, reuse the solution of the previous call to fit\n        and add more estimators to the ensemble, otherwise, just fit a whole\n        new forest.\n\n    class_weight : dict, list of dicts, {{\"balanced\", \"balanced_subsample\"}}, \\\n            default=None\n        Weights associated with classes in the form dictionary with the key\n        being the class_label and the value the weight.\n        If not given, all classes are supposed to have weight one. For\n        multi-output problems, a list of dicts can be provided in the same\n        order as the columns of y.\n        Note that for multioutput (including multilabel) weights should be\n        defined for each class of every column in its own dict. For example,\n        for four-class multilabel classification weights should be\n        [{{0: 1, 1: 1}}, {{0: 1, 1: 5}}, {{0: 1, 1: 1}}, {{0: 1, 1: 1}}]\n        instead of [{{1:1}}, {{2:5}}, {{3:1}}, {{4:1}}].\n        The \"balanced\" mode uses the values of y to automatically adjust\n        weights inversely proportional to class frequencies in the input data\n        as ``n_samples / (n_classes * np.bincount(y))``\n        The \"balanced_subsample\" mode is the same as \"balanced\" except that\n        weights are computed based on the bootstrap sample for every tree\n        grown.\n        For multi-output, the weights of each column of y will be multiplied.\n        Note that these weights will be multiplied with sample_weight (passed\n        through the fit method) if sample_weight is specified.\n\n    ccp_alpha : non-negative float, default=0.0\n        Complexity parameter used for Minimal Cost-Complexity Pruning. The\n        subtree with the largest cost complexity that is smaller than\n        ``ccp_alpha`` will be chosen. By default, no pruning is performed.\n\n        .. versionadded:: 0.6\n           Added in `scikit-learn` in 0.22\n\n    max_samples : int or float, default=None\n        If bootstrap is True, the number of samples to draw from X\n        to train each base estimator.\n            - If None (default), then draw `X.shape[0]` samples.\n            - If int, then draw `max_samples` samples.\n            - If float, then draw `max_samples * X.shape[0]` samples. Thus,\n              `max_samples` should be in the interval `(0, 1)`.\n        Be aware that the final number samples used will be the minimum between\n        the number of samples given in `max_samples` and the number of samples\n        obtained after resampling.\n\n        .. versionadded:: 0.6\n           Added in `scikit-learn` in 0.22\n\n    monotonic_cst : array-like of int of shape (n_features), default=None\n        Indicates the monotonicity constraint to enforce on each feature.\n          - 1: monotonic increase\n          - 0: no constraint\n          - -1: monotonic decrease\n\n        If monotonic_cst is None, no constraints are applied.\n\n        Monotonicity constraints are not supported for:\n          - multiclass classifications (i.e. when `n_classes > 2`),\n          - multioutput classifications (i.e. when `n_outputs_ > 1`),\n          - classifications trained on data with missing values.\n\n        The constraints hold over the probability of the positive class.\n\n        .. versionadded:: 0.12\n           Only supported when scikit-learn >= 1.4 is installed. Otherwise, a\n           `ValueError` is raised.\n\n    Attributes\n    ----------\n    estimator_ : :class:`~sklearn.tree.DecisionTreeClassifier` instance\n        The child estimator template used to create the collection of fitted\n        sub-estimators.\n\n        .. versionadded:: 0.10\n\n    estimators_ : list of :class:`~sklearn.tree.DecisionTreeClassifier`\n        The collection of fitted sub-estimators.\n\n    base_sampler_ : :class:`~imblearn.under_sampling.RandomUnderSampler`\n        The base sampler used to construct the subsequent list of samplers.\n\n    samplers_ : list of :class:`~imblearn.under_sampling.RandomUnderSampler`\n        The collection of fitted samplers.\n\n    pipelines_ : list of Pipeline.\n        The collection of fitted pipelines (samplers + trees).\n\n    classes_ : ndarray of shape (n_classes,) or a list of such arrays\n        The classes labels (single output problem), or a list of arrays of\n        class labels (multi-output problem).\n\n    n_classes_ : int or list\n        The number of classes (single output problem), or a list containing the\n        number of classes for each output (multi-output problem).\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.9\n\n    n_outputs_ : int\n        The number of outputs when ``fit`` is performed.\n\n    feature_importances_ : ndarray of shape (n_features,)\n        The feature importances (the higher, the more important the feature).\n\n    oob_score_ : float\n        Score of the training dataset obtained using an out-of-bag estimate.\n\n    oob_decision_function_ : ndarray of shape (n_samples, n_classes)\n        Decision function computed with out-of-bag estimate on the training\n        set. If n_estimators is small it might be possible that a data point\n        was never left out during the bootstrap. In this case,\n        `oob_decision_function_` might contain NaN.\n\n    See Also\n    --------\n    BalancedBaggingClassifier : Bagging classifier for which each base\n        estimator is trained on a balanced bootstrap.\n\n    EasyEnsembleClassifier : Ensemble of AdaBoost classifier trained on\n        balanced bootstraps.\n\n    RUSBoostClassifier : AdaBoost classifier were each bootstrap is balanced\n        using random-under sampling at each round of boosting.\n\n    References\n    ----------\n    .. [1] Chen, Chao, Andy Liaw, and Leo Breiman. \"Using random forest to\n       learn imbalanced data.\" University of California, Berkeley 110 (2004):\n       1-12.\n\n    Examples\n    --------\n    >>> from imblearn.ensemble import BalancedRandomForestClassifier\n    >>> from sklearn.datasets import make_classification\n    >>>\n    >>> X, y = make_classification(n_samples=1000, n_classes=3,\n    ...                            n_informative=4, weights=[0.2, 0.3, 0.5],\n    ...                            random_state=0)\n    >>> clf = BalancedRandomForestClassifier(\n    ...     sampling_strategy=\"all\", replacement=True, max_depth=2, random_state=0,\n    ...     bootstrap=False)\n    >>> clf.fit(X, y)\n    BalancedRandomForestClassifier(...)\n    >>> print(clf.feature_importances_)\n    [...]\n    >>> print(clf.predict([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n    ...                     0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]))\n    [1]\n    \"\"\"\n\n    # make a deepcopy to not modify the original dictionary\n    if sklearn_version >= parse_version(\"1.4\"):\n        _parameter_constraints = deepcopy(RandomForestClassifier._parameter_constraints)\n    else:\n        _parameter_constraints = deepcopy(\n            _random_forest_classifier_parameter_constraints\n        )\n\n    _parameter_constraints.update(\n        {\n            \"bootstrap\": [\"boolean\", Hidden(StrOptions({\"warn\"}))],\n            \"sampling_strategy\": [\n                Interval(numbers.Real, 0, 1, closed=\"right\"),\n                StrOptions({\"auto\", \"majority\", \"not minority\", \"not majority\", \"all\"}),\n                dict,\n                callable,\n                Hidden(StrOptions({\"warn\"})),\n            ],\n            \"replacement\": [\"boolean\", Hidden(StrOptions({\"warn\"}))],\n        }\n    )\n\n    def __init__(\n        self,\n        n_estimators=100,\n        *,\n        criterion=\"gini\",\n        max_depth=None,\n        min_samples_split=2,\n        min_samples_leaf=1,\n        min_weight_fraction_leaf=0.0,\n        max_features=\"sqrt\",\n        max_leaf_nodes=None,\n        min_impurity_decrease=0.0,\n        bootstrap=False,\n        oob_score=False,\n        sampling_strategy=\"all\",\n        replacement=True,\n        n_jobs=None,\n        random_state=None,\n        verbose=0,\n        warm_start=False,\n        class_weight=None,\n        ccp_alpha=0.0,\n        max_samples=None,\n        monotonic_cst=None,\n    ):\n        params_random_forest = {\n            \"criterion\": criterion,\n            \"max_depth\": max_depth,\n            \"n_estimators\": n_estimators,\n            \"bootstrap\": bootstrap,\n            \"oob_score\": oob_score,\n            \"n_jobs\": n_jobs,\n            \"random_state\": random_state,\n            \"verbose\": verbose,\n            \"warm_start\": warm_start,\n            \"class_weight\": class_weight,\n            \"min_samples_split\": min_samples_split,\n            \"min_samples_leaf\": min_samples_leaf,\n            \"min_weight_fraction_leaf\": min_weight_fraction_leaf,\n            \"max_features\": max_features,\n            \"max_leaf_nodes\": max_leaf_nodes,\n            \"min_impurity_decrease\": min_impurity_decrease,\n            \"ccp_alpha\": ccp_alpha,\n            \"max_samples\": max_samples,\n            \"monotonic_cst\": monotonic_cst,\n        }\n\n        super().__init__(**params_random_forest)\n\n        self.sampling_strategy = sampling_strategy\n        self.replacement = replacement\n\n    def _validate_estimator(self, default=DecisionTreeClassifier()):\n        \"\"\"Check the estimator and the n_estimator attribute, set the\n        `estimator_` attribute.\"\"\"\n        if self.estimator is not None:\n            self.estimator_ = clone(self.estimator)\n        else:\n            self.estimator_ = clone(default)\n\n        self.base_sampler_ = RandomUnderSampler(\n            sampling_strategy=self._sampling_strategy,\n            replacement=self.replacement,\n        )\n\n    def _make_sampler_estimator(self, random_state=None):\n        \"\"\"Make and configure a copy of the `base_estimator_` attribute.\n        Warning: This method should be used to properly instantiate new\n        sub-estimators.\n        \"\"\"\n        estimator = clone(self.estimator_)\n        estimator.set_params(**{p: getattr(self, p) for p in self.estimator_params})\n        sampler = clone(self.base_sampler_)\n\n        if random_state is not None:\n            _set_random_states(estimator, random_state)\n            _set_random_states(sampler, random_state)\n\n        return estimator, sampler\n\n    @_fit_context(prefer_skip_nested_validation=True)\n    def fit(self, X, y, sample_weight=None):\n        \"\"\"Build a forest of trees from the training set (X, y).\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix} of shape (n_samples, n_features)\n            The training input samples. Internally, its dtype will be converted\n            to ``dtype=np.float32``. If a sparse matrix is provided, it will be\n            converted into a sparse ``csc_matrix``.\n\n        y : array-like of shape (n_samples,) or (n_samples, n_outputs)\n            The target values (class labels in classification, real numbers in\n            regression).\n\n        sample_weight : array-like of shape (n_samples,)\n            Sample weights. If None, then samples are equally weighted. Splits\n            that would create child nodes with net zero or negative weight are\n            ignored while searching for a split in each node. In the case of\n            classification, splits are also ignored if they would result in any\n            single class carrying a negative weight in either child node.\n\n        Returns\n        -------\n        self : object\n            The fitted instance.\n        \"\"\"\n        self._validate_params()\n        # Validate or convert input data\n        if issparse(y):\n            raise ValueError(\"sparse multilabel-indicator for y is not supported.\")\n\n        X, y = validate_data(\n            self,\n            X=X,\n            y=y,\n            multi_output=True,\n            accept_sparse=\"csc\",\n            dtype=DTYPE,\n            ensure_all_finite=False,\n        )\n\n        # _compute_missing_values_in_feature_mask checks if X has missing values and\n        # will raise an error if the underlying tree base estimator can't handle\n        # missing values. Only the criterion is required to determine if the tree\n        # supports missing values.\n        estimator = type(self.estimator)(criterion=self.criterion)\n        missing_values_in_feature_mask = (\n            estimator._compute_missing_values_in_feature_mask(\n                X, estimator_name=self.__class__.__name__\n            )\n        )\n\n        if sample_weight is not None:\n            sample_weight = _check_sample_weight(sample_weight, X)\n\n        self._n_features = X.shape[1]\n\n        if issparse(X):\n            # Pre-sort indices to avoid that each individual tree of the\n            # ensemble sorts the indices.\n            X.sort_indices()\n\n        y = np.atleast_1d(y)\n        if y.ndim == 2 and y.shape[1] == 1:\n            warn(\n                (\n                    \"A column-vector y was passed when a 1d array was\"\n                    \" expected. Please change the shape of y to \"\n                    \"(n_samples,), for example using ravel().\"\n                ),\n                DataConversionWarning,\n                stacklevel=2,\n            )\n\n        if y.ndim == 1:\n            # reshape is necessary to preserve the data contiguity against vs\n            # [:, np.newaxis] that does not.\n            y = np.reshape(y, (-1, 1))\n\n        self.n_outputs_ = y.shape[1]\n\n        y_encoded, expanded_class_weight = self._validate_y_class_weight(y)\n\n        if getattr(y, \"dtype\", None) != DOUBLE or not y.flags.contiguous:\n            y_encoded = np.ascontiguousarray(y_encoded, dtype=DOUBLE)\n\n        if isinstance(self.sampling_strategy, dict):\n            self._sampling_strategy = {\n                np.where(self.classes_[0] == key)[0][0]: value\n                for key, value in check_sampling_strategy(\n                    self.sampling_strategy,\n                    y,\n                    \"under-sampling\",\n                ).items()\n            }\n        else:\n            self._sampling_strategy = self.sampling_strategy\n\n        if expanded_class_weight is not None:\n            if sample_weight is not None:\n                sample_weight = sample_weight * expanded_class_weight\n            else:\n                sample_weight = expanded_class_weight\n\n        # Get bootstrap sample size\n        n_samples_bootstrap = _get_n_samples_bootstrap(\n            n_samples=X.shape[0], max_samples=self.max_samples\n        )\n\n        # Check parameters\n        self._validate_estimator()\n\n        if not self.bootstrap and self.oob_score:\n            raise ValueError(\"Out of bag estimation only available if bootstrap=True\")\n\n        random_state = check_random_state(self.random_state)\n\n        if not self.warm_start or not hasattr(self, \"estimators_\"):\n            # Free allocated memory, if any\n            self.estimators_ = []\n            self.samplers_ = []\n            self.pipelines_ = []\n\n        n_more_estimators = self.n_estimators - len(self.estimators_)\n\n        if n_more_estimators < 0:\n            raise ValueError(\n                \"n_estimators=%d must be larger or equal to \"\n                \"len(estimators_)=%d when warm_start==True\"\n                % (self.n_estimators, len(self.estimators_))\n            )\n\n        elif n_more_estimators == 0:\n            warn(\n                \"Warm-start fitting without increasing n_estimators does not \"\n                \"fit new trees.\"\n            )\n        else:\n            if self.warm_start and len(self.estimators_) > 0:\n                # We draw from the random state to get the random state we\n                # would have got if we hadn't used a warm_start.\n                random_state.randint(MAX_INT, size=len(self.estimators_))\n\n            trees = []\n            samplers = []\n            for _ in range(n_more_estimators):\n                tree, sampler = self._make_sampler_estimator(random_state=random_state)\n                trees.append(tree)\n                samplers.append(sampler)\n\n            # Parallel loop: we prefer the threading backend as the Cython code\n            # for fitting the trees is internally releasing the Python GIL\n            # making threading more efficient than multiprocessing in\n            # that case. However, we respect any parallel_backend contexts set\n            # at a higher level, since correctness does not rely on using\n            # threads.\n            samplers_trees = Parallel(\n                n_jobs=self.n_jobs,\n                verbose=self.verbose,\n                prefer=\"threads\",\n            )(\n                delayed(_local_parallel_build_trees)(\n                    s,\n                    t,\n                    self.bootstrap,\n                    X,\n                    y_encoded,\n                    sample_weight,\n                    i,\n                    len(trees),\n                    verbose=self.verbose,\n                    class_weight=self.class_weight,\n                    n_samples_bootstrap=n_samples_bootstrap,\n                    forest=self,\n                    missing_values_in_feature_mask=missing_values_in_feature_mask,\n                )\n                for i, (s, t) in enumerate(zip(samplers, trees))\n            )\n            samplers, trees = zip(*samplers_trees)\n\n            # Collect newly grown trees\n            self.estimators_.extend(trees)\n            self.samplers_.extend(samplers)\n\n            # Create pipeline with the fitted samplers and trees\n            self.pipelines_.extend(\n                [\n                    make_pipeline(deepcopy(s), deepcopy(t))\n                    for s, t in zip(samplers, trees)\n                ]\n            )\n\n        if self.oob_score:\n            y_type = type_of_target(y)\n            if y_type in (\"multiclass-multioutput\", \"unknown\"):\n                # FIXME: we could consider to support multiclass-multioutput if\n                # we introduce or reuse a constructor parameter (e.g.\n                # oob_score) allowing our user to pass a callable defining the\n                # scoring strategy on OOB sample.\n                raise ValueError(\n                    \"The type of target cannot be used to compute OOB \"\n                    f\"estimates. Got {y_type} while only the following are \"\n                    \"supported: continuous, continuous-multioutput, binary, \"\n                    \"multiclass, multilabel-indicator.\"\n                )\n            self._set_oob_score_and_attributes(X, y_encoded)\n\n        # Decapsulate classes_ attributes\n        if hasattr(self, \"classes_\") and self.n_outputs_ == 1:\n            self.n_classes_ = self.n_classes_[0]\n            self.classes_ = self.classes_[0]\n\n        return self\n\n    def _set_oob_score_and_attributes(self, X, y):\n        \"\"\"Compute and set the OOB score and attributes.\n\n        Parameters\n        ----------\n        X : array-like of shape (n_samples, n_features)\n            The data matrix.\n        y : ndarray of shape (n_samples, n_outputs)\n            The target matrix.\n        \"\"\"\n        self.oob_decision_function_ = self._compute_oob_predictions(X, y)\n        if self.oob_decision_function_.shape[-1] == 1:\n            # drop the n_outputs axis if there is a single output\n            self.oob_decision_function_ = self.oob_decision_function_.squeeze(axis=-1)\n        from sklearn.metrics import accuracy_score\n\n        self.oob_score_ = accuracy_score(\n            y, np.argmax(self.oob_decision_function_, axis=1)\n        )\n\n    def _compute_oob_predictions(self, X, y):\n        \"\"\"Compute and set the OOB score.\n\n        Parameters\n        ----------\n        X : array-like of shape (n_samples, n_features)\n            The data matrix.\n        y : ndarray of shape (n_samples, n_outputs)\n            The target matrix.\n\n        Returns\n        -------\n        oob_pred : ndarray of shape (n_samples, n_classes, n_outputs) or \\\n                (n_samples, 1, n_outputs)\n            The OOB predictions.\n        \"\"\"\n        # Prediction requires X to be in CSR format\n        if issparse(X):\n            X = X.tocsr()\n\n        n_samples = y.shape[0]\n        n_outputs = self.n_outputs_\n\n        if is_classifier(self) and hasattr(self, \"n_classes_\"):\n            # n_classes_ is a ndarray at this stage\n            # all the supported type of target will have the same number of\n            # classes in all outputs\n            oob_pred_shape = (n_samples, self.n_classes_[0], n_outputs)\n        else:\n            # for regression, n_classes_ does not exist and we create an empty\n            # axis to be consistent with the classification case and make\n            # the array operations compatible with the 2 settings\n            oob_pred_shape = (n_samples, 1, n_outputs)\n\n        oob_pred = np.zeros(shape=oob_pred_shape, dtype=np.float64)\n        n_oob_pred = np.zeros((n_samples, n_outputs), dtype=np.int64)\n\n        for sampler, estimator in zip(self.samplers_, self.estimators_):\n            X_resample = X[sampler.sample_indices_]\n            y_resample = y[sampler.sample_indices_]\n\n            n_sample_subset = y_resample.shape[0]\n            n_samples_bootstrap = _get_n_samples_bootstrap(\n                n_sample_subset, self.max_samples\n            )\n\n            unsampled_indices = _generate_unsampled_indices(\n                estimator.random_state, n_sample_subset, n_samples_bootstrap\n            )\n\n            y_pred = self._get_oob_predictions(\n                estimator, X_resample[unsampled_indices, :]\n            )\n\n            indices = sampler.sample_indices_[unsampled_indices]\n            oob_pred[indices, ...] += y_pred\n            n_oob_pred[indices, :] += 1\n\n        for k in range(n_outputs):\n            if (n_oob_pred == 0).any():\n                warn(\n                    (\n                        \"Some inputs do not have OOB scores. This probably means \"\n                        \"too few trees were used to compute any reliable OOB \"\n                        \"estimates.\"\n                    ),\n                    UserWarning,\n                )\n                n_oob_pred[n_oob_pred == 0] = 1\n            oob_pred[..., k] /= n_oob_pred[..., [k]]\n\n        return oob_pred\n\n    def _more_tags(self):\n        return {\"multioutput\": False, \"multilabel\": False}\n\n    def __sklearn_tags__(self):\n        tags = super().__sklearn_tags__()\n        tags.target_tags.multi_output = False\n        tags.classifier_tags.multi_label = False\n        tags.input_tags.allow_nan = sklearn_version >= parse_version(\"1.4\")\n        return tags\n"
  },
  {
    "path": "imblearn/ensemble/_weight_boosting.py",
    "content": "import copy\nimport numbers\nimport warnings\nfrom copy import deepcopy\n\nimport numpy as np\nfrom sklearn.base import clone\nfrom sklearn.ensemble import AdaBoostClassifier\nfrom sklearn.ensemble._base import _set_random_states\nfrom sklearn.tree import DecisionTreeClassifier\nfrom sklearn.utils import _safe_indexing\nfrom sklearn.utils._param_validation import Hidden, Interval, StrOptions\nfrom sklearn.utils.fixes import parse_version\nfrom sklearn.utils.validation import has_fit_parameter\nfrom sklearn_compat._sklearn_compat import sklearn_version\nfrom sklearn_compat.base import _fit_context\n\nfrom imblearn.ensemble._common import _adaboost_classifier_parameter_constraints\nfrom imblearn.pipeline import make_pipeline\nfrom imblearn.under_sampling import RandomUnderSampler\nfrom imblearn.under_sampling.base import BaseUnderSampler\nfrom imblearn.utils import Substitution, check_target_type\nfrom imblearn.utils._docstring import _random_state_docstring\n\n\n@Substitution(\n    sampling_strategy=BaseUnderSampler._sampling_strategy_docstring,\n    random_state=_random_state_docstring,\n)\nclass RUSBoostClassifier(AdaBoostClassifier):\n    \"\"\"Random under-sampling integrated in the learning of AdaBoost.\n\n    During learning, the problem of class balancing is alleviated by random\n    under-sampling the sample at each iteration of the boosting algorithm.\n\n    Read more in the :ref:`User Guide <boosting>`.\n\n    .. versionadded:: 0.4\n\n    Parameters\n    ----------\n    estimator : estimator object, default=None\n        The base estimator from which the boosted ensemble is built.\n        Support for sample weighting is required, as well as proper\n        ``classes_`` and ``n_classes_`` attributes. If ``None``, then\n        the base estimator is ``DecisionTreeClassifier(max_depth=1)``.\n\n        .. versionadded:: 0.12\n\n    n_estimators : int, default=50\n        The maximum number of estimators at which boosting is terminated.\n        In case of perfect fit, the learning procedure is stopped early.\n\n    learning_rate : float, default=1.0\n        Learning rate shrinks the contribution of each classifier by\n        ``learning_rate``. There is a trade-off between ``learning_rate`` and\n        ``n_estimators``.\n\n    algorithm : {{'SAMME', 'SAMME.R'}}, default='SAMME.R'\n        If 'SAMME.R' then use the SAMME.R real boosting algorithm.\n        ``base_estimator`` must support calculation of class probabilities.\n        If 'SAMME' then use the SAMME discrete boosting algorithm.\n        The SAMME.R algorithm typically converges faster than SAMME,\n        achieving a lower test error with fewer boosting iterations.\n\n        .. deprecated:: 0.12\n            `\"SAMME.R\"` is deprecated and will be removed in version 0.14.\n            '\"SAMME\"' will become the default.\n\n    {sampling_strategy}\n\n    replacement : bool, default=False\n        Whether or not to sample randomly with replacement or not.\n\n    {random_state}\n\n    Attributes\n    ----------\n    estimator_ : estimator\n        The base estimator from which the ensemble is grown.\n\n        .. versionadded:: 0.10\n\n    estimators_ : list of classifiers\n        The collection of fitted sub-estimators.\n\n    base_sampler_ : :class:`~imblearn.under_sampling.RandomUnderSampler`\n        The base sampler used to generate the subsequent samplers.\n\n    samplers_ : list of :class:`~imblearn.under_sampling.RandomUnderSampler`\n        The collection of fitted samplers.\n\n    pipelines_ : list of Pipeline\n        The collection of fitted pipelines (samplers + trees).\n\n    classes_ : ndarray of shape (n_classes,)\n        The classes labels.\n\n    n_classes_ : int\n        The number of classes.\n\n    estimator_weights_ : ndarray of shape (n_estimator,)\n        Weights for each estimator in the boosted ensemble.\n\n    estimator_errors_ : ndarray of shape (n_estimator,)\n        Classification error for each estimator in the boosted\n        ensemble.\n\n    feature_importances_ : ndarray of shape (n_features,)\n        The feature importances if supported by the ``base_estimator``.\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.9\n\n    See Also\n    --------\n    BalancedBaggingClassifier : Bagging classifier for which each base\n        estimator is trained on a balanced bootstrap.\n\n    BalancedRandomForestClassifier : Random forest applying random-under\n        sampling to balance the different bootstraps.\n\n    EasyEnsembleClassifier : Ensemble of AdaBoost classifier trained on\n        balanced bootstraps.\n\n    References\n    ----------\n    .. [1] Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A.\n       \"RUSBoost: A hybrid approach to alleviating class imbalance.\" IEEE\n       Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans\n       40.1 (2010): 185-197.\n\n    Examples\n    --------\n    >>> from imblearn.ensemble import RUSBoostClassifier\n    >>> from sklearn.datasets import make_classification\n    >>>\n    >>> X, y = make_classification(n_samples=1000, n_classes=3,\n    ...                            n_informative=4, weights=[0.2, 0.3, 0.5],\n    ...                            random_state=0)\n    >>> clf = RUSBoostClassifier(random_state=0)\n    >>> clf.fit(X, y)\n    RUSBoostClassifier(...)\n    >>> clf.predict(X)\n    array([...])\n    \"\"\"\n\n    # make a deepcopy to not modify the original dictionary\n    if sklearn_version >= parse_version(\"1.4\"):\n        _parameter_constraints = copy.deepcopy(\n            AdaBoostClassifier._parameter_constraints\n        )\n    else:\n        _parameter_constraints = copy.deepcopy(\n            _adaboost_classifier_parameter_constraints\n        )\n\n    _parameter_constraints.update(\n        {\n            \"algorithm\": [\n                StrOptions({\"SAMME\", \"SAMME.R\"}),\n                Hidden(StrOptions({\"deprecated\"})),\n            ],\n            \"sampling_strategy\": [\n                Interval(numbers.Real, 0, 1, closed=\"right\"),\n                StrOptions({\"auto\", \"majority\", \"not minority\", \"not majority\", \"all\"}),\n                dict,\n                callable,\n            ],\n            \"replacement\": [\"boolean\"],\n        }\n    )\n    # TODO: remove when minimum supported version of scikit-learn is 1.4\n    if \"base_estimator\" in _parameter_constraints:\n        del _parameter_constraints[\"base_estimator\"]\n\n    def __init__(\n        self,\n        estimator=None,\n        *,\n        n_estimators=50,\n        learning_rate=1.0,\n        algorithm=\"deprecated\",\n        sampling_strategy=\"auto\",\n        replacement=False,\n        random_state=None,\n    ):\n        super().__init__(\n            n_estimators=n_estimators,\n            learning_rate=learning_rate,\n            random_state=random_state,\n        )\n        self.algorithm = algorithm\n        self.estimator = estimator\n        self.sampling_strategy = sampling_strategy\n        self.replacement = replacement\n\n    @_fit_context(prefer_skip_nested_validation=False)\n    def fit(self, X, y, sample_weight=None):\n        \"\"\"Build a boosted classifier from the training set (X, y).\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix} of shape (n_samples, n_features)\n            The training input samples. Sparse matrix can be CSC, CSR, COO,\n            DOK, or LIL. DOK and LIL are converted to CSR.\n\n        y : array-like of shape (n_samples,)\n            The target values (class labels).\n\n        sample_weight : array-like of shape (n_samples,), default=None\n            Sample weights. If None, the sample weights are initialized to\n            ``1 / n_samples``.\n\n        Returns\n        -------\n        self : object\n            Returns self.\n        \"\"\"\n        self._validate_params()\n        check_target_type(y)\n        self.samplers_ = []\n        self.pipelines_ = []\n        super().fit(X, y, sample_weight)\n        return self\n\n    def _validate_estimator(self):\n        \"\"\"Check the estimator and the n_estimator attribute.\n\n        Sets the `estimator_` attributes.\n        \"\"\"\n        default = DecisionTreeClassifier(max_depth=1)\n        if self.estimator is not None:\n            self.estimator_ = clone(self.estimator)\n        else:\n            self.estimator_ = clone(default)\n\n        #  SAMME-R requires predict_proba-enabled estimators\n        if self.algorithm == \"SAMME.R\":\n            if not hasattr(self.estimator_, \"predict_proba\"):\n                raise TypeError(\n                    \"AdaBoostClassifier with algorithm='SAMME.R' requires \"\n                    \"that the weak learner supports the calculation of class \"\n                    \"probabilities with a predict_proba method.\\n\"\n                    \"Please change the base estimator or set \"\n                    \"algorithm='SAMME' instead.\"\n                )\n        if not has_fit_parameter(self.estimator_, \"sample_weight\"):\n            raise ValueError(\n                f\"{self.estimator_.__class__.__name__} doesn't support sample_weight.\"\n            )\n\n        self.base_sampler_ = RandomUnderSampler(\n            sampling_strategy=self.sampling_strategy,\n            replacement=self.replacement,\n        )\n\n    def _make_sampler_estimator(self, append=True, random_state=None):\n        \"\"\"Make and configure a copy of the `base_estimator_` attribute.\n        Warning: This method should be used to properly instantiate new\n        sub-estimators.\n        \"\"\"\n        estimator = clone(self.estimator_)\n        estimator.set_params(**{p: getattr(self, p) for p in self.estimator_params})\n        sampler = clone(self.base_sampler_)\n\n        if random_state is not None:\n            _set_random_states(estimator, random_state)\n            _set_random_states(sampler, random_state)\n\n        if append:\n            self.estimators_.append(estimator)\n            self.samplers_.append(sampler)\n            self.pipelines_.append(\n                make_pipeline(deepcopy(sampler), deepcopy(estimator))\n            )\n\n        return estimator, sampler\n\n    def _boost_real(self, iboost, X, y, sample_weight, random_state):\n        \"\"\"Implement a single boost using the SAMME.R real algorithm.\"\"\"\n        estimator, sampler = self._make_sampler_estimator(random_state=random_state)\n\n        X_res, y_res = sampler.fit_resample(X, y)\n        sample_weight_res = _safe_indexing(sample_weight, sampler.sample_indices_)\n        estimator.fit(X_res, y_res, sample_weight=sample_weight_res)\n\n        y_predict_proba = estimator.predict_proba(X)\n\n        if iboost == 0:\n            self.classes_ = getattr(estimator, \"classes_\", None)\n            self.n_classes_ = len(self.classes_)\n\n        y_predict = self.classes_.take(np.argmax(y_predict_proba, axis=1), axis=0)\n\n        # Instances incorrectly classified\n        incorrect = y_predict != y\n\n        # Error fraction\n        estimator_error = np.mean(np.average(incorrect, weights=sample_weight, axis=0))\n\n        # Stop if classification is perfect\n        if estimator_error <= 0:\n            return sample_weight, 1.0, 0.0\n\n        # Construct y coding as described in Zhu et al [2]:\n        #\n        #    y_k = 1 if c == k else -1 / (K - 1)\n        #\n        # where K == n_classes_ and c, k in [0, K) are indices along the second\n        # axis of the y coding with c being the index corresponding to the true\n        # class label.\n        n_classes = self.n_classes_\n        classes = self.classes_\n        y_codes = np.array([-1.0 / (n_classes - 1), 1.0])\n        y_coding = y_codes.take(classes == y[:, np.newaxis])\n\n        # Displace zero probabilities so the log is defined.\n        # Also fix negative elements which may occur with\n        # negative sample weights.\n        proba = y_predict_proba  # alias for readability\n        np.clip(proba, np.finfo(proba.dtype).eps, None, out=proba)\n\n        # Boost weight using multi-class AdaBoost SAMME.R alg\n        estimator_weight = (\n            -1.0\n            * self.learning_rate\n            * ((n_classes - 1.0) / n_classes)\n            * (y_coding * np.log(y_predict_proba)).sum(axis=1)\n        )\n\n        # Only boost the weights if it will fit again\n        if not iboost == self.n_estimators - 1:\n            # Only boost positive weights\n            sample_weight *= np.exp(\n                estimator_weight * ((sample_weight > 0) | (estimator_weight < 0))\n            )\n\n        return sample_weight, 1.0, estimator_error\n\n    def _boost_discrete(self, iboost, X, y, sample_weight, random_state):\n        \"\"\"Implement a single boost using the SAMME discrete algorithm.\"\"\"\n        estimator, sampler = self._make_sampler_estimator(random_state=random_state)\n\n        X_res, y_res = sampler.fit_resample(X, y)\n        sample_weight_res = _safe_indexing(sample_weight, sampler.sample_indices_)\n        estimator.fit(X_res, y_res, sample_weight=sample_weight_res)\n\n        y_predict = estimator.predict(X)\n\n        if iboost == 0:\n            self.classes_ = getattr(estimator, \"classes_\", None)\n            self.n_classes_ = len(self.classes_)\n\n        # Instances incorrectly classified\n        incorrect = y_predict != y\n\n        # Error fraction\n        estimator_error = np.mean(np.average(incorrect, weights=sample_weight, axis=0))\n\n        # Stop if classification is perfect\n        if estimator_error <= 0:\n            return sample_weight, 1.0, 0.0\n\n        n_classes = self.n_classes_\n\n        # Stop if the error is at least as bad as random guessing\n        if estimator_error >= 1.0 - (1.0 / n_classes):\n            self.estimators_.pop(-1)\n            self.samplers_.pop(-1)\n            self.pipelines_.pop(-1)\n            if len(self.estimators_) == 0:\n                raise ValueError(\n                    \"BaseClassifier in AdaBoostClassifier \"\n                    \"ensemble is worse than random, ensemble \"\n                    \"can not be fit.\"\n                )\n            return None, None, None\n\n        # Boost weight using multi-class AdaBoost SAMME alg\n        estimator_weight = self.learning_rate * (\n            np.log((1.0 - estimator_error) / estimator_error) + np.log(n_classes - 1.0)\n        )\n\n        # Only boost the weights if I will fit again\n        if not iboost == self.n_estimators - 1:\n            # Only boost positive weights\n            sample_weight *= np.exp(estimator_weight * incorrect * (sample_weight > 0))\n\n        return sample_weight, estimator_weight, estimator_error\n\n    # TODO(0.14): remove this method because algorithm is deprecated.\n    def _boost(self, iboost, X, y, sample_weight, random_state):\n        if self.algorithm != \"deprecated\":\n            warnings.warn(\n                (\n                    \"`algorithm` parameter is deprecated in 0.12 and will be removed in\"\n                    \" 0.14. In the future, the SAMME algorithm will always be used.\"\n                ),\n                FutureWarning,\n            )\n        if self.algorithm == \"SAMME.R\":\n            return self._boost_real(iboost, X, y, sample_weight, random_state)\n\n        else:  # elif self.algorithm == \"SAMME\":\n            return self._boost_discrete(iboost, X, y, sample_weight, random_state)\n"
  },
  {
    "path": "imblearn/ensemble/tests/__init__.py",
    "content": ""
  },
  {
    "path": "imblearn/ensemble/tests/test_bagging.py",
    "content": "\"\"\"Test the module ensemble classifiers.\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nfrom collections import Counter\n\nimport numpy as np\nimport pytest\nfrom sklearn.cluster import KMeans\nfrom sklearn.datasets import load_iris, make_classification, make_hastie_10_2\nfrom sklearn.dummy import DummyClassifier\nfrom sklearn.feature_selection import SelectKBest\nfrom sklearn.linear_model import LogisticRegression, Perceptron\nfrom sklearn.model_selection import GridSearchCV, ParameterGrid, train_test_split\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.svm import SVC\nfrom sklearn.tree import DecisionTreeClassifier\nfrom sklearn.utils._testing import (\n    assert_allclose,\n    assert_array_almost_equal,\n    assert_array_equal,\n)\n\nfrom imblearn import FunctionSampler\nfrom imblearn.datasets import make_imbalance\nfrom imblearn.ensemble import BalancedBaggingClassifier\nfrom imblearn.over_sampling import SMOTE, RandomOverSampler\nfrom imblearn.pipeline import make_pipeline\nfrom imblearn.under_sampling import ClusterCentroids, RandomUnderSampler\n\niris = load_iris()\n\n\n@pytest.mark.parametrize(\n    \"estimator\",\n    [\n        None,\n        DummyClassifier(strategy=\"prior\"),\n        Perceptron(max_iter=1000, tol=1e-3),\n        DecisionTreeClassifier(),\n        KNeighborsClassifier(),\n        SVC(gamma=\"scale\"),\n    ],\n)\n@pytest.mark.parametrize(\n    \"params\",\n    ParameterGrid(\n        {\n            \"max_samples\": [0.5, 1.0],\n            \"max_features\": [1, 2, 4],\n            \"bootstrap\": [True, False],\n            \"bootstrap_features\": [True, False],\n        }\n    ),\n)\ndef test_balanced_bagging_classifier(estimator, params):\n    # Check classification for various parameter settings.\n    X, y = make_imbalance(\n        iris.data,\n        iris.target,\n        sampling_strategy={0: 20, 1: 25, 2: 50},\n        random_state=0,\n    )\n    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n\n    bag = BalancedBaggingClassifier(estimator=estimator, random_state=0, **params).fit(\n        X_train, y_train\n    )\n    bag.predict(X_test)\n    bag.predict_proba(X_test)\n    bag.score(X_test, y_test)\n    if hasattr(estimator, \"decision_function\"):\n        bag.decision_function(X_test)\n\n\ndef test_bootstrap_samples():\n    # Test that bootstrapping samples generate non-perfect base estimators.\n    X, y = make_imbalance(\n        iris.data,\n        iris.target,\n        sampling_strategy={0: 20, 1: 25, 2: 50},\n        random_state=0,\n    )\n    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n\n    estimator = DecisionTreeClassifier().fit(X_train, y_train)\n\n    # without bootstrap, all trees are perfect on the training set\n    # disable the resampling by passing an empty dictionary.\n    ensemble = BalancedBaggingClassifier(\n        estimator=DecisionTreeClassifier(),\n        max_samples=1.0,\n        bootstrap=False,\n        n_estimators=10,\n        sampling_strategy={},\n        random_state=0,\n    ).fit(X_train, y_train)\n\n    assert ensemble.score(X_train, y_train) == estimator.score(X_train, y_train)\n\n    # with bootstrap, trees are no longer perfect on the training set\n    ensemble = BalancedBaggingClassifier(\n        estimator=DecisionTreeClassifier(),\n        max_samples=1.0,\n        bootstrap=True,\n        random_state=0,\n    ).fit(X_train, y_train)\n\n    assert ensemble.score(X_train, y_train) < estimator.score(X_train, y_train)\n\n\ndef test_bootstrap_features():\n    # Test that bootstrapping features may generate duplicate features.\n    X, y = make_imbalance(\n        iris.data,\n        iris.target,\n        sampling_strategy={0: 20, 1: 25, 2: 50},\n        random_state=0,\n    )\n    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n\n    ensemble = BalancedBaggingClassifier(\n        estimator=DecisionTreeClassifier(),\n        max_features=1.0,\n        bootstrap_features=False,\n        random_state=0,\n    ).fit(X_train, y_train)\n\n    for features in ensemble.estimators_features_:\n        assert np.unique(features).shape[0] == X.shape[1]\n\n    ensemble = BalancedBaggingClassifier(\n        estimator=DecisionTreeClassifier(),\n        max_features=1.0,\n        bootstrap_features=True,\n        random_state=0,\n    ).fit(X_train, y_train)\n\n    unique_features = [\n        np.unique(features).shape[0] for features in ensemble.estimators_features_\n    ]\n    assert np.median(unique_features) < X.shape[1]\n\n\ndef test_probability():\n    # Predict probabilities.\n    X, y = make_imbalance(\n        iris.data,\n        iris.target,\n        sampling_strategy={0: 20, 1: 25, 2: 50},\n        random_state=0,\n    )\n    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n\n    with np.errstate(divide=\"ignore\", invalid=\"ignore\"):\n        # Normal case\n        ensemble = BalancedBaggingClassifier(\n            estimator=DecisionTreeClassifier(), random_state=0\n        ).fit(X_train, y_train)\n\n        assert_array_almost_equal(\n            np.sum(ensemble.predict_proba(X_test), axis=1),\n            np.ones(len(X_test)),\n        )\n\n        assert_array_almost_equal(\n            ensemble.predict_proba(X_test),\n            np.exp(ensemble.predict_log_proba(X_test)),\n        )\n\n        # Degenerate case, where some classes are missing\n        ensemble = BalancedBaggingClassifier(\n            estimator=LogisticRegression(solver=\"lbfgs\"),\n            random_state=0,\n            max_samples=5,\n        )\n        ensemble.fit(X_train, y_train)\n\n        assert_array_almost_equal(\n            np.sum(ensemble.predict_proba(X_test), axis=1),\n            np.ones(len(X_test)),\n        )\n\n        assert_array_almost_equal(\n            ensemble.predict_proba(X_test),\n            np.exp(ensemble.predict_log_proba(X_test)),\n        )\n\n\ndef test_oob_score_classification():\n    # Check that oob prediction is a good estimation of the generalization\n    # error.\n    X, y = make_imbalance(\n        iris.data,\n        iris.target,\n        sampling_strategy={0: 20, 1: 25, 2: 50},\n        random_state=0,\n    )\n    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n\n    for estimator in [DecisionTreeClassifier(), SVC(gamma=\"scale\")]:\n        clf = BalancedBaggingClassifier(\n            estimator=estimator,\n            n_estimators=100,\n            bootstrap=True,\n            oob_score=True,\n            random_state=0,\n        ).fit(X_train, y_train)\n\n        test_score = clf.score(X_test, y_test)\n\n        assert abs(test_score - clf.oob_score_) < 0.1\n\n        # Test with few estimators\n        with pytest.warns(UserWarning):\n            BalancedBaggingClassifier(\n                estimator=estimator,\n                n_estimators=1,\n                bootstrap=True,\n                oob_score=True,\n                random_state=0,\n            ).fit(X_train, y_train)\n\n\ndef test_single_estimator():\n    # Check singleton ensembles.\n    X, y = make_imbalance(\n        iris.data,\n        iris.target,\n        sampling_strategy={0: 20, 1: 25, 2: 50},\n        random_state=0,\n    )\n    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n\n    clf1 = BalancedBaggingClassifier(\n        estimator=KNeighborsClassifier(),\n        n_estimators=1,\n        bootstrap=False,\n        bootstrap_features=False,\n        random_state=0,\n    ).fit(X_train, y_train)\n\n    clf2 = make_pipeline(\n        RandomUnderSampler(random_state=clf1.estimators_[0].steps[0][1].random_state),\n        KNeighborsClassifier(),\n    ).fit(X_train, y_train)\n\n    assert_array_equal(clf1.predict(X_test), clf2.predict(X_test))\n\n\ndef test_gridsearch():\n    # Check that bagging ensembles can be grid-searched.\n    # Transform iris into a binary classification task\n    X, y = iris.data, iris.target.copy()\n    y[y == 2] = 1\n\n    # Grid search with scoring based on decision_function\n    parameters = {\"n_estimators\": (1, 2), \"estimator__C\": (1, 2)}\n\n    GridSearchCV(\n        BalancedBaggingClassifier(SVC(gamma=\"scale\")),\n        parameters,\n        cv=3,\n        scoring=\"roc_auc\",\n    ).fit(X, y)\n\n\ndef test_estimator():\n    # Check estimator and its default values.\n    X, y = make_imbalance(\n        iris.data,\n        iris.target,\n        sampling_strategy={0: 20, 1: 25, 2: 50},\n        random_state=0,\n    )\n    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n\n    ensemble = BalancedBaggingClassifier(None, n_jobs=3, random_state=0).fit(\n        X_train, y_train\n    )\n\n    assert isinstance(ensemble.estimator_.steps[-1][1], DecisionTreeClassifier)\n\n    ensemble = BalancedBaggingClassifier(\n        DecisionTreeClassifier(), n_jobs=3, random_state=0\n    ).fit(X_train, y_train)\n\n    assert isinstance(ensemble.estimator_.steps[-1][1], DecisionTreeClassifier)\n\n    ensemble = BalancedBaggingClassifier(\n        Perceptron(max_iter=1000, tol=1e-3), n_jobs=3, random_state=0\n    ).fit(X_train, y_train)\n\n    assert isinstance(ensemble.estimator_.steps[-1][1], Perceptron)\n\n\ndef test_bagging_with_pipeline():\n    X, y = make_imbalance(\n        iris.data,\n        iris.target,\n        sampling_strategy={0: 20, 1: 25, 2: 50},\n        random_state=0,\n    )\n    estimator = BalancedBaggingClassifier(\n        make_pipeline(SelectKBest(k=1), DecisionTreeClassifier()),\n        max_features=2,\n    )\n    estimator.fit(X, y).predict(X)\n\n\ndef test_warm_start(random_state=42):\n    # Test if fitting incrementally with warm start gives a forest of the\n    # right size and the same results as a normal fit.\n    X, y = make_hastie_10_2(n_samples=20, random_state=1)\n\n    clf_ws = None\n    for n_estimators in [5, 10]:\n        if clf_ws is None:\n            clf_ws = BalancedBaggingClassifier(\n                n_estimators=n_estimators,\n                random_state=random_state,\n                warm_start=True,\n            )\n        else:\n            clf_ws.set_params(n_estimators=n_estimators)\n        clf_ws.fit(X, y)\n        assert len(clf_ws) == n_estimators\n\n    clf_no_ws = BalancedBaggingClassifier(\n        n_estimators=10, random_state=random_state, warm_start=False\n    )\n    clf_no_ws.fit(X, y)\n\n    assert {pipe.steps[-1][1].random_state for pipe in clf_ws} == {\n        pipe.steps[-1][1].random_state for pipe in clf_no_ws\n    }\n\n\ndef test_warm_start_smaller_n_estimators():\n    # Test if warm start'ed second fit with smaller n_estimators raises error.\n    X, y = make_hastie_10_2(n_samples=20, random_state=1)\n    clf = BalancedBaggingClassifier(n_estimators=5, warm_start=True)\n    clf.fit(X, y)\n    clf.set_params(n_estimators=4)\n    with pytest.raises(ValueError):\n        clf.fit(X, y)\n\n\ndef test_warm_start_equal_n_estimators():\n    # Test that nothing happens when fitting without increasing n_estimators\n    X, y = make_hastie_10_2(n_samples=20, random_state=1)\n    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=43)\n\n    clf = BalancedBaggingClassifier(n_estimators=5, warm_start=True, random_state=83)\n    clf.fit(X_train, y_train)\n\n    y_pred = clf.predict(X_test)\n    # modify X to nonsense values, this should not change anything\n    X_train += 1.0\n\n    warn_msg = \"Warm-start fitting without increasing n_estimators does not\"\n    with pytest.warns(UserWarning, match=warn_msg):\n        clf.fit(X_train, y_train)\n    assert_array_equal(y_pred, clf.predict(X_test))\n\n\ndef test_warm_start_equivalence():\n    # warm started classifier with 5+5 estimators should be equivalent to\n    # one classifier with 10 estimators\n    X, y = make_hastie_10_2(n_samples=20, random_state=1)\n    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=43)\n\n    clf_ws = BalancedBaggingClassifier(\n        n_estimators=5, warm_start=True, random_state=3141\n    )\n    clf_ws.fit(X_train, y_train)\n    clf_ws.set_params(n_estimators=10)\n    clf_ws.fit(X_train, y_train)\n    y1 = clf_ws.predict(X_test)\n\n    clf = BalancedBaggingClassifier(\n        n_estimators=10, warm_start=False, random_state=3141\n    )\n    clf.fit(X_train, y_train)\n    y2 = clf.predict(X_test)\n\n    assert_array_almost_equal(y1, y2)\n\n\ndef test_warm_start_with_oob_score_fails():\n    # Check using oob_score and warm_start simultaneously fails\n    X, y = make_hastie_10_2(n_samples=20, random_state=1)\n    clf = BalancedBaggingClassifier(n_estimators=5, warm_start=True, oob_score=True)\n    with pytest.raises(ValueError):\n        clf.fit(X, y)\n\n\ndef test_oob_score_removed_on_warm_start():\n    X, y = make_hastie_10_2(n_samples=2000, random_state=1)\n\n    clf = BalancedBaggingClassifier(n_estimators=50, oob_score=True)\n    clf.fit(X, y)\n\n    clf.set_params(warm_start=True, oob_score=False, n_estimators=100)\n    clf.fit(X, y)\n\n    with pytest.raises(AttributeError):\n        getattr(clf, \"oob_score_\")\n\n\ndef test_oob_score_consistency():\n    # Make sure OOB scores are identical when random_state, estimator, and\n    # training data are fixed and fitting is done twice\n    X, y = make_hastie_10_2(n_samples=200, random_state=1)\n    bagging = BalancedBaggingClassifier(\n        KNeighborsClassifier(),\n        max_samples=0.5,\n        max_features=0.5,\n        oob_score=True,\n        random_state=1,\n    )\n    assert bagging.fit(X, y).oob_score_ == bagging.fit(X, y).oob_score_\n\n\ndef test_estimators_samples():\n    # Check that format of estimators_samples_ is correct and that results\n    # generated at fit time can be identically reproduced at a later time\n    # using data saved in object attributes.\n    X, y = make_hastie_10_2(n_samples=200, random_state=1)\n\n    # remap the y outside of the BalancedBaggingclassifier\n    # _, y = np.unique(y, return_inverse=True)\n    bagging = BalancedBaggingClassifier(\n        LogisticRegression(),\n        max_samples=0.5,\n        max_features=0.5,\n        random_state=1,\n        bootstrap=False,\n    )\n    bagging.fit(X, y)\n\n    # Get relevant attributes\n    estimators_samples = bagging.estimators_samples_\n    estimators_features = bagging.estimators_features_\n    estimators = bagging.estimators_\n\n    # Test for correct formatting\n    assert len(estimators_samples) == len(estimators)\n    assert len(estimators_samples[0]) == len(X) // 2\n    assert estimators_samples[0].dtype.kind == \"i\"\n\n    # Re-fit single estimator to test for consistent sampling\n    estimator_index = 0\n    estimator_samples = estimators_samples[estimator_index]\n    estimator_features = estimators_features[estimator_index]\n    estimator = estimators[estimator_index]\n\n    X_train = (X[estimator_samples])[:, estimator_features]\n    y_train = y[estimator_samples]\n\n    orig_coefs = estimator.steps[-1][1].coef_\n    estimator.fit(X_train, y_train)\n    new_coefs = estimator.steps[-1][1].coef_\n\n    assert_allclose(orig_coefs, new_coefs)\n\n\ndef test_max_samples_consistency():\n    # Make sure validated max_samples and original max_samples are identical\n    # when valid integer max_samples supplied by user\n    max_samples = 100\n    X, y = make_hastie_10_2(n_samples=2 * max_samples, random_state=1)\n    bagging = BalancedBaggingClassifier(\n        KNeighborsClassifier(),\n        max_samples=max_samples,\n        max_features=0.5,\n        random_state=1,\n    )\n    bagging.fit(X, y)\n    assert bagging._max_samples == max_samples\n\n\nclass CountDecisionTreeClassifier(DecisionTreeClassifier):\n    \"\"\"DecisionTreeClassifier that will memorize the number of samples seen\n    at fit.\"\"\"\n\n    def fit(self, X, y, sample_weight=None):\n        self.class_counts_ = Counter(y)\n        return super().fit(X, y, sample_weight=sample_weight)\n\n\n@pytest.mark.filterwarnings(\"ignore:Number of distinct clusters\")\n@pytest.mark.parametrize(\n    \"sampler, n_samples_bootstrap\",\n    [\n        (None, 15),\n        (RandomUnderSampler(), 15),  # under-sampling with sample_indices_\n        (\n            ClusterCentroids(estimator=KMeans(n_init=1)),\n            15,\n        ),  # under-sampling without sample_indices_\n        (RandomOverSampler(), 40),  # over-sampling with sample_indices_\n        (SMOTE(), 40),  # over-sampling without sample_indices_\n    ],\n)\ndef test_balanced_bagging_classifier_samplers(sampler, n_samples_bootstrap):\n    # check that we can pass any kind of sampler to a bagging classifier\n    X, y = make_imbalance(\n        iris.data,\n        iris.target,\n        sampling_strategy={0: 20, 1: 25, 2: 50},\n        random_state=0,\n    )\n    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n    clf = BalancedBaggingClassifier(\n        estimator=CountDecisionTreeClassifier(),\n        n_estimators=2,\n        sampler=sampler,\n        random_state=0,\n    )\n    clf.fit(X_train, y_train)\n    clf.predict(X_test)\n\n    # check that we have balanced class with the right counts of class\n    # sample depending on the sampling strategy\n    assert_array_equal(\n        list(clf.estimators_[0][-1].class_counts_.values()), n_samples_bootstrap\n    )\n\n\n@pytest.mark.parametrize(\"replace\", [True, False])\ndef test_balanced_bagging_classifier_with_function_sampler(replace):\n    # check that we can provide a FunctionSampler in BalancedBaggingClassifier\n    X, y = make_classification(\n        n_samples=1_000,\n        n_features=10,\n        n_classes=2,\n        weights=[0.3, 0.7],\n        random_state=0,\n    )\n\n    def roughly_balanced_bagging(X, y, replace=False):\n        \"\"\"Implementation of Roughly Balanced Bagging for binary problem.\"\"\"\n        # find the minority and majority classes\n        class_counts = Counter(y)\n        majority_class = max(class_counts, key=class_counts.get)\n        minority_class = min(class_counts, key=class_counts.get)\n\n        # compute the number of sample to draw from the majority class using\n        # a negative binomial distribution\n        n_minority_class = class_counts[minority_class]\n        n_majority_resampled = np.random.negative_binomial(n=n_minority_class, p=0.5)\n\n        # draw randomly with or without replacement\n        majority_indices = np.random.choice(\n            np.flatnonzero(y == majority_class),\n            size=n_majority_resampled,\n            replace=replace,\n        )\n        minority_indices = np.random.choice(\n            np.flatnonzero(y == minority_class),\n            size=n_minority_class,\n            replace=replace,\n        )\n        indices = np.hstack([majority_indices, minority_indices])\n\n        return X[indices], y[indices]\n\n    # Roughly Balanced Bagging\n    rbb = BalancedBaggingClassifier(\n        estimator=CountDecisionTreeClassifier(random_state=0),\n        n_estimators=2,\n        sampler=FunctionSampler(\n            func=roughly_balanced_bagging, kw_args={\"replace\": replace}\n        ),\n        random_state=0,\n    )\n    rbb.fit(X, y)\n\n    for estimator in rbb.estimators_:\n        class_counts = estimator[-1].class_counts_\n        assert (class_counts[0] / class_counts[1]) > 0.78\n"
  },
  {
    "path": "imblearn/ensemble/tests/test_easy_ensemble.py",
    "content": "\"\"\"Test the module easy ensemble.\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport numpy as np\nimport pytest\nfrom sklearn.datasets import load_iris, make_hastie_10_2\nfrom sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier\nfrom sklearn.feature_selection import SelectKBest\nfrom sklearn.model_selection import GridSearchCV, train_test_split\nfrom sklearn.utils._testing import assert_allclose, assert_array_equal\n\nfrom imblearn.datasets import make_imbalance\nfrom imblearn.ensemble import EasyEnsembleClassifier\nfrom imblearn.pipeline import make_pipeline\nfrom imblearn.under_sampling import RandomUnderSampler\n\niris = load_iris()\n\n# Generate a global dataset to use\nRND_SEED = 0\nX = np.array(\n    [\n        [0.5220963, 0.11349303],\n        [0.59091459, 0.40692742],\n        [1.10915364, 0.05718352],\n        [0.22039505, 0.26469445],\n        [1.35269503, 0.44812421],\n        [0.85117925, 1.0185556],\n        [-2.10724436, 0.70263997],\n        [-0.23627356, 0.30254174],\n        [-1.23195149, 0.15427291],\n        [-0.58539673, 0.62515052],\n    ]\n)\nY = np.array([1, 2, 2, 2, 1, 0, 1, 1, 1, 0])\n\n\n@pytest.mark.parametrize(\"n_estimators\", [10, 20])\n@pytest.mark.parametrize(\n    \"estimator\",\n    [\n        GradientBoostingClassifier(n_estimators=5),\n        GradientBoostingClassifier(n_estimators=10),\n    ],\n)\ndef test_easy_ensemble_classifier(n_estimators, estimator):\n    # Check classification for various parameter settings.\n    X, y = make_imbalance(\n        iris.data,\n        iris.target,\n        sampling_strategy={0: 20, 1: 25, 2: 50},\n        random_state=0,\n    )\n    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n\n    eec = EasyEnsembleClassifier(\n        n_estimators=n_estimators,\n        estimator=estimator,\n        n_jobs=-1,\n        random_state=RND_SEED,\n    )\n    eec.fit(X_train, y_train).score(X_test, y_test)\n    assert len(eec.estimators_) == n_estimators\n    for est in eec.estimators_:\n        assert len(est.named_steps[\"classifier\"]) == estimator.n_estimators\n    # test the different prediction function\n    eec.predict(X_test)\n    eec.predict_proba(X_test)\n    eec.predict_log_proba(X_test)\n    eec.decision_function(X_test)\n\n\ndef test_estimator():\n    # Check estimator and its default values.\n    X, y = make_imbalance(\n        iris.data,\n        iris.target,\n        sampling_strategy={0: 20, 1: 25, 2: 50},\n        random_state=0,\n    )\n    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n\n    ensemble = EasyEnsembleClassifier(2, None, n_jobs=-1, random_state=0).fit(\n        X_train, y_train\n    )\n\n    assert isinstance(ensemble.estimator_.steps[-1][1], AdaBoostClassifier)\n\n    ensemble = EasyEnsembleClassifier(\n        2, GradientBoostingClassifier(), n_jobs=-1, random_state=0\n    ).fit(X_train, y_train)\n\n    assert isinstance(ensemble.estimator_.steps[-1][1], GradientBoostingClassifier)\n\n\ndef test_bagging_with_pipeline():\n    X, y = make_imbalance(\n        iris.data,\n        iris.target,\n        sampling_strategy={0: 20, 1: 25, 2: 50},\n        random_state=0,\n    )\n    estimator = EasyEnsembleClassifier(\n        n_estimators=2,\n        estimator=make_pipeline(SelectKBest(k=1), GradientBoostingClassifier()),\n    )\n    estimator.fit(X, y).predict(X)\n\n\ndef test_warm_start(random_state=42):\n    # Test if fitting incrementally with warm start gives a forest of the\n    # right size and the same results as a normal fit.\n    X, y = make_hastie_10_2(n_samples=20, random_state=1)\n\n    clf_ws = None\n    for n_estimators in [5, 10]:\n        if clf_ws is None:\n            clf_ws = EasyEnsembleClassifier(\n                n_estimators=n_estimators,\n                random_state=random_state,\n                warm_start=True,\n            )\n        else:\n            clf_ws.set_params(n_estimators=n_estimators)\n        clf_ws.fit(X, y)\n        assert len(clf_ws) == n_estimators\n\n    clf_no_ws = EasyEnsembleClassifier(\n        n_estimators=10, random_state=random_state, warm_start=False\n    )\n    clf_no_ws.fit(X, y)\n\n    assert {pipe.steps[-1][1].random_state for pipe in clf_ws} == {\n        pipe.steps[-1][1].random_state for pipe in clf_no_ws\n    }\n\n\ndef test_warm_start_smaller_n_estimators():\n    # Test if warm start'ed second fit with smaller n_estimators raises error.\n    X, y = make_hastie_10_2(n_samples=20, random_state=1)\n    clf = EasyEnsembleClassifier(n_estimators=5, warm_start=True)\n    clf.fit(X, y)\n    clf.set_params(n_estimators=4)\n    with pytest.raises(ValueError):\n        clf.fit(X, y)\n\n\ndef test_warm_start_equal_n_estimators():\n    # Test that nothing happens when fitting without increasing n_estimators\n    X, y = make_hastie_10_2(n_samples=20, random_state=1)\n    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=43)\n\n    clf = EasyEnsembleClassifier(n_estimators=5, warm_start=True, random_state=83)\n    clf.fit(X_train, y_train)\n\n    y_pred = clf.predict(X_test)\n    # modify X to nonsense values, this should not change anything\n    X_train += 1.0\n\n    warn_msg = \"Warm-start fitting without increasing n_estimators\"\n    with pytest.warns(UserWarning, match=warn_msg):\n        clf.fit(X_train, y_train)\n    assert_array_equal(y_pred, clf.predict(X_test))\n\n\ndef test_warm_start_equivalence():\n    # warm started classifier with 5+5 estimators should be equivalent to\n    # one classifier with 10 estimators\n    X, y = make_hastie_10_2(n_samples=20, random_state=1)\n    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=43)\n\n    clf_ws = EasyEnsembleClassifier(n_estimators=5, warm_start=True, random_state=3141)\n    clf_ws.fit(X_train, y_train)\n    clf_ws.set_params(n_estimators=10)\n    clf_ws.fit(X_train, y_train)\n    y1 = clf_ws.predict(X_test)\n\n    clf = EasyEnsembleClassifier(n_estimators=10, warm_start=False, random_state=3141)\n    clf.fit(X_train, y_train)\n    y2 = clf.predict(X_test)\n\n    assert_allclose(y1, y2)\n\n\ndef test_easy_ensemble_classifier_single_estimator():\n    X, y = make_imbalance(\n        iris.data,\n        iris.target,\n        sampling_strategy={0: 20, 1: 25, 2: 50},\n        random_state=0,\n    )\n    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n\n    clf1 = EasyEnsembleClassifier(n_estimators=1, random_state=0).fit(X_train, y_train)\n    clf2 = make_pipeline(\n        RandomUnderSampler(random_state=0),\n        GradientBoostingClassifier(random_state=0),\n    ).fit(X_train, y_train)\n\n    assert_array_equal(clf1.predict(X_test), clf2.predict(X_test))\n\n\ndef test_easy_ensemble_classifier_grid_search():\n    X, y = make_imbalance(\n        iris.data,\n        iris.target,\n        sampling_strategy={0: 20, 1: 25, 2: 50},\n        random_state=0,\n    )\n\n    parameters = {\n        \"n_estimators\": [1, 2],\n        \"estimator__n_estimators\": [3, 4],\n    }\n    grid_search = GridSearchCV(\n        EasyEnsembleClassifier(estimator=GradientBoostingClassifier()),\n        parameters,\n        cv=5,\n    )\n    grid_search.fit(X, y)\n"
  },
  {
    "path": "imblearn/ensemble/tests/test_forest.py",
    "content": "import numpy as np\nimport pytest\nfrom sklearn.datasets import make_classification\nfrom sklearn.model_selection import GridSearchCV, train_test_split\nfrom sklearn.utils._testing import assert_allclose, assert_array_equal\nfrom sklearn.utils.fixes import parse_version\nfrom sklearn_compat._sklearn_compat import sklearn_version\n\nfrom imblearn.ensemble import BalancedRandomForestClassifier\n\n\n@pytest.fixture\ndef imbalanced_dataset():\n    return make_classification(\n        n_samples=10000,\n        n_features=2,\n        n_informative=2,\n        n_redundant=0,\n        n_repeated=0,\n        n_classes=3,\n        n_clusters_per_class=1,\n        weights=[0.01, 0.05, 0.94],\n        class_sep=0.8,\n        random_state=0,\n    )\n\n\ndef test_balanced_random_forest_error_warning_warm_start(imbalanced_dataset):\n    brf = BalancedRandomForestClassifier(\n        n_estimators=5, sampling_strategy=\"all\", replacement=True, bootstrap=False\n    )\n    brf.fit(*imbalanced_dataset)\n\n    with pytest.raises(ValueError, match=\"must be larger or equal to\"):\n        brf.set_params(warm_start=True, n_estimators=2)\n        brf.fit(*imbalanced_dataset)\n\n    brf.set_params(n_estimators=10)\n    brf.fit(*imbalanced_dataset)\n\n    with pytest.warns(UserWarning, match=\"Warm-start fitting without\"):\n        brf.fit(*imbalanced_dataset)\n\n\ndef test_balanced_random_forest(imbalanced_dataset):\n    n_estimators = 10\n    brf = BalancedRandomForestClassifier(\n        n_estimators=n_estimators,\n        random_state=0,\n        sampling_strategy=\"all\",\n        replacement=True,\n        bootstrap=False,\n    )\n    brf.fit(*imbalanced_dataset)\n\n    assert len(brf.samplers_) == n_estimators\n    assert len(brf.estimators_) == n_estimators\n    assert len(brf.pipelines_) == n_estimators\n    assert len(brf.feature_importances_) == imbalanced_dataset[0].shape[1]\n\n\ndef test_balanced_random_forest_attributes(imbalanced_dataset):\n    X, y = imbalanced_dataset\n    n_estimators = 10\n    brf = BalancedRandomForestClassifier(\n        n_estimators=n_estimators,\n        random_state=0,\n        sampling_strategy=\"all\",\n        replacement=True,\n        bootstrap=False,\n    )\n    brf.fit(X, y)\n\n    for idx in range(n_estimators):\n        X_res, y_res = brf.samplers_[idx].fit_resample(X, y)\n        X_res_2, y_res_2 = (\n            brf.pipelines_[idx].named_steps[\"randomundersampler\"].fit_resample(X, y)\n        )\n        assert_allclose(X_res, X_res_2)\n        assert_array_equal(y_res, y_res_2)\n\n        y_pred = brf.estimators_[idx].fit(X_res, y_res).predict(X)\n        y_pred_2 = brf.pipelines_[idx].fit(X, y).predict(X)\n        assert_array_equal(y_pred, y_pred_2)\n\n        y_pred = brf.estimators_[idx].fit(X_res, y_res).predict_proba(X)\n        y_pred_2 = brf.pipelines_[idx].fit(X, y).predict_proba(X)\n        assert_array_equal(y_pred, y_pred_2)\n\n\ndef test_balanced_random_forest_sample_weight(imbalanced_dataset):\n    rng = np.random.RandomState(42)\n    X, y = imbalanced_dataset\n    sample_weight = rng.rand(y.shape[0])\n    brf = BalancedRandomForestClassifier(\n        n_estimators=5,\n        random_state=0,\n        sampling_strategy=\"all\",\n        replacement=True,\n        bootstrap=False,\n    )\n    brf.fit(X, y, sample_weight)\n\n\n@pytest.mark.filterwarnings(\"ignore:Some inputs do not have OOB scores\")\ndef test_balanced_random_forest_oob(imbalanced_dataset):\n    X, y = imbalanced_dataset\n    X_train, X_test, y_train, y_test = train_test_split(\n        X, y, random_state=42, stratify=y\n    )\n    est = BalancedRandomForestClassifier(\n        oob_score=True,\n        random_state=0,\n        n_estimators=1000,\n        min_samples_leaf=2,\n        sampling_strategy=\"all\",\n        replacement=True,\n        bootstrap=True,\n    )\n\n    est.fit(X_train, y_train)\n    test_score = est.score(X_test, y_test)\n\n    assert abs(test_score - est.oob_score_) < 0.1\n\n    # Check warning if not enough estimators\n    est = BalancedRandomForestClassifier(\n        oob_score=True,\n        random_state=0,\n        n_estimators=1,\n        bootstrap=True,\n        sampling_strategy=\"all\",\n        replacement=True,\n    )\n    with pytest.warns(UserWarning) and np.errstate(divide=\"ignore\", invalid=\"ignore\"):\n        est.fit(X, y)\n\n\ndef test_balanced_random_forest_grid_search(imbalanced_dataset):\n    brf = BalancedRandomForestClassifier(\n        sampling_strategy=\"all\", replacement=True, bootstrap=False\n    )\n    grid = GridSearchCV(brf, {\"n_estimators\": (1, 2), \"max_depth\": (1, 2)}, cv=3)\n    grid.fit(*imbalanced_dataset)\n\n\ndef test_little_tree_with_small_max_samples():\n    rng = np.random.RandomState(1)\n\n    X = rng.randn(10000, 2)\n    y = rng.randn(10000) > 0\n\n    # First fit with no restriction on max samples\n    est1 = BalancedRandomForestClassifier(\n        n_estimators=1,\n        random_state=rng,\n        max_samples=None,\n        sampling_strategy=\"all\",\n        replacement=True,\n        bootstrap=True,\n    )\n\n    # Second fit with max samples restricted to just 2\n    est2 = BalancedRandomForestClassifier(\n        n_estimators=1,\n        random_state=rng,\n        max_samples=2,\n        sampling_strategy=\"all\",\n        replacement=True,\n        bootstrap=True,\n    )\n\n    est1.fit(X, y)\n    est2.fit(X, y)\n\n    tree1 = est1.estimators_[0].tree_\n    tree2 = est2.estimators_[0].tree_\n\n    msg = \"Tree without `max_samples` restriction should have more nodes\"\n    assert tree1.node_count > tree2.node_count, msg\n\n\ndef test_balanced_random_forest_pruning(imbalanced_dataset):\n    brf = BalancedRandomForestClassifier(\n        sampling_strategy=\"all\", replacement=True, bootstrap=False\n    )\n    brf.fit(*imbalanced_dataset)\n    n_nodes_no_pruning = brf.estimators_[0].tree_.node_count\n\n    brf_pruned = BalancedRandomForestClassifier(\n        ccp_alpha=0.015, sampling_strategy=\"all\", replacement=True, bootstrap=False\n    )\n    brf_pruned.fit(*imbalanced_dataset)\n    n_nodes_pruning = brf_pruned.estimators_[0].tree_.node_count\n\n    assert n_nodes_no_pruning > n_nodes_pruning\n\n\n@pytest.mark.parametrize(\"ratio\", [0.5, 0.1])\n@pytest.mark.filterwarnings(\"ignore:Some inputs do not have OOB scores\")\ndef test_balanced_random_forest_oob_binomial(ratio):\n    # Regression test for #655: check that the oob score is closed to 0.5\n    # a binomial experiment.\n    rng = np.random.RandomState(42)\n    n_samples = 1000\n    X = np.arange(n_samples).reshape(-1, 1)\n    y = rng.binomial(1, ratio, size=n_samples)\n\n    erf = BalancedRandomForestClassifier(\n        oob_score=True,\n        random_state=42,\n        sampling_strategy=\"not minority\",\n        replacement=False,\n        bootstrap=True,\n    )\n    erf.fit(X, y)\n    assert np.abs(erf.oob_score_ - 0.5) < 0.1\n\n\n@pytest.mark.skipif(\n    parse_version(sklearn_version.base_version) < parse_version(\"1.4\"),\n    reason=\"scikit-learn should be >= 1.4\",\n)\ndef test_missing_values_is_resilient():\n    \"\"\"Check that forest can deal with missing values and has decent performance.\"\"\"\n\n    rng = np.random.RandomState(0)\n    n_samples, n_features = 1000, 10\n    X, y = make_classification(\n        n_samples=n_samples, n_features=n_features, random_state=rng\n    )\n\n    # Create dataset with missing values\n    X_missing = X.copy()\n    X_missing[rng.choice([False, True], size=X.shape, p=[0.95, 0.05])] = np.nan\n    assert np.isnan(X_missing).any()\n\n    X_missing_train, X_missing_test, y_train, y_test = train_test_split(\n        X_missing, y, random_state=0\n    )\n\n    # Train forest with missing values\n    forest_with_missing = BalancedRandomForestClassifier(\n        sampling_strategy=\"all\",\n        replacement=True,\n        bootstrap=False,\n        random_state=rng,\n        n_estimators=50,\n    )\n    forest_with_missing.fit(X_missing_train, y_train)\n    score_with_missing = forest_with_missing.score(X_missing_test, y_test)\n\n    # Train forest without missing values\n    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n    forest = BalancedRandomForestClassifier(\n        sampling_strategy=\"all\",\n        replacement=True,\n        bootstrap=False,\n        random_state=rng,\n        n_estimators=50,\n    )\n    forest.fit(X_train, y_train)\n    score_without_missing = forest.score(X_test, y_test)\n\n    # Score is still 80 percent of the forest's score that had no missing values\n    assert score_with_missing >= 0.80 * score_without_missing\n\n\n@pytest.mark.skipif(\n    parse_version(sklearn_version.base_version) < parse_version(\"1.4\"),\n    reason=\"scikit-learn should be >= 1.4\",\n)\ndef test_missing_value_is_predictive():\n    \"\"\"Check that the forest learns when missing values are only present for\n    a predictive feature.\"\"\"\n    rng = np.random.RandomState(0)\n    n_samples = 300\n\n    X_non_predictive = rng.standard_normal(size=(n_samples, 10))\n    y = rng.randint(0, high=2, size=n_samples)\n\n    # Create a predictive feature using `y` and with some noise\n    X_random_mask = rng.choice([False, True], size=n_samples, p=[0.95, 0.05])\n    y_mask = y.astype(bool)\n    y_mask[X_random_mask] = ~y_mask[X_random_mask]\n\n    predictive_feature = rng.standard_normal(size=n_samples)\n    predictive_feature[y_mask] = np.nan\n    assert np.isnan(predictive_feature).any()\n\n    X_predictive = X_non_predictive.copy()\n    X_predictive[:, 5] = predictive_feature\n\n    (\n        X_predictive_train,\n        X_predictive_test,\n        X_non_predictive_train,\n        X_non_predictive_test,\n        y_train,\n        y_test,\n    ) = train_test_split(X_predictive, X_non_predictive, y, random_state=0)\n    forest_predictive = BalancedRandomForestClassifier(\n        sampling_strategy=\"all\", replacement=True, bootstrap=False, random_state=0\n    ).fit(X_predictive_train, y_train)\n    forest_non_predictive = BalancedRandomForestClassifier(\n        sampling_strategy=\"all\", replacement=True, bootstrap=False, random_state=0\n    ).fit(X_non_predictive_train, y_train)\n\n    predictive_test_score = forest_predictive.score(X_predictive_test, y_test)\n\n    assert predictive_test_score >= 0.75\n    assert predictive_test_score >= forest_non_predictive.score(\n        X_non_predictive_test, y_test\n    )\n"
  },
  {
    "path": "imblearn/ensemble/tests/test_weight_boosting.py",
    "content": "import numpy as np\nimport pytest\nfrom sklearn.datasets import make_classification\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.utils._testing import assert_array_equal\n\nfrom imblearn.ensemble import RUSBoostClassifier\n\n\n@pytest.fixture\ndef imbalanced_dataset():\n    return make_classification(\n        n_samples=10000,\n        n_features=3,\n        n_informative=2,\n        n_redundant=0,\n        n_repeated=0,\n        n_classes=3,\n        n_clusters_per_class=1,\n        weights=[0.01, 0.05, 0.94],\n        class_sep=0.8,\n        random_state=0,\n    )\n\n\ndef test_rusboost(imbalanced_dataset):\n    X, y = imbalanced_dataset\n    X_train, X_test, y_train, y_test = train_test_split(\n        X, y, stratify=y, random_state=1\n    )\n    classes = np.unique(y)\n\n    n_estimators = 500\n    rusboost = RUSBoostClassifier(n_estimators=n_estimators, random_state=0)\n    rusboost.fit(X_train, y_train)\n    assert_array_equal(classes, rusboost.classes_)\n\n    # check that we have an ensemble of samplers and estimators with a\n    # consistent size\n    assert len(rusboost.estimators_) > 1\n    assert len(rusboost.estimators_) == len(rusboost.samplers_)\n    assert len(rusboost.pipelines_) == len(rusboost.samplers_)\n\n    # each sampler in the ensemble should have different random state\n    assert len({sampler.random_state for sampler in rusboost.samplers_}) == len(\n        rusboost.samplers_\n    )\n    # each estimator in the ensemble should have different random state\n    assert len({est.random_state for est in rusboost.estimators_}) == len(\n        rusboost.estimators_\n    )\n\n    # check the consistency of the feature importances\n    assert len(rusboost.feature_importances_) == imbalanced_dataset[0].shape[1]\n\n    # check the consistency of the prediction outpus\n    y_pred = rusboost.predict_proba(X_test)\n    assert y_pred.shape[1] == len(classes)\n    assert rusboost.decision_function(X_test).shape[1] == len(classes)\n\n    score = rusboost.score(X_test, y_test)\n    assert score > 0.6, f\"Failed with score {score}\"\n\n    y_pred = rusboost.predict(X_test)\n    assert y_pred.shape == y_test.shape\n\n\ndef test_rusboost_sample_weight(imbalanced_dataset):\n    X, y = imbalanced_dataset\n    sample_weight = np.ones_like(y)\n    rusboost = RUSBoostClassifier(random_state=0)\n\n    # Predictions should be the same when sample_weight are all ones\n    y_pred_sample_weight = rusboost.fit(X, y, sample_weight).predict(X)\n    y_pred_no_sample_weight = rusboost.fit(X, y).predict(X)\n\n    assert_array_equal(y_pred_sample_weight, y_pred_no_sample_weight)\n\n    rng = np.random.RandomState(42)\n    sample_weight = rng.rand(y.shape[0])\n    y_pred_sample_weight = rusboost.fit(X, y, sample_weight).predict(X)\n\n    with pytest.raises(AssertionError):\n        assert_array_equal(y_pred_no_sample_weight, y_pred_sample_weight)\n\n\n@pytest.mark.parametrize(\"algorithm\", [\"SAMME\", \"SAMME.R\"])\ndef test_rusboost_algorithm(imbalanced_dataset, algorithm):\n    X, y = imbalanced_dataset\n\n    rusboost = RUSBoostClassifier(algorithm=algorithm)\n    warn_msg = \"`algorithm` parameter is deprecated in 0.12 and will be removed\"\n    with pytest.warns(FutureWarning, match=warn_msg):\n        rusboost.fit(X, y)\n"
  },
  {
    "path": "imblearn/exceptions.py",
    "content": "\"\"\"\nThe :mod:`imblearn.exceptions` module includes all custom warnings and error\nclasses and functions used across imbalanced-learn.\n\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\n\ndef raise_isinstance_error(variable_name, possible_type, variable):\n    \"\"\"Raise consistent error message for isinstance() function.\n\n    Parameters\n    ----------\n    variable_name : str\n        The name of the variable.\n\n    possible_type : type\n        The possible type of the variable.\n\n    variable : object\n        The variable to check.\n\n    Raises\n    ------\n    ValueError\n        If the instance is not of the possible type.\n    \"\"\"\n    raise ValueError(\n        f\"{variable_name} has to be one of {possible_type}. \"\n        f\"Got {type(variable)} instead.\"\n    )\n"
  },
  {
    "path": "imblearn/keras/__init__.py",
    "content": "\"\"\"The :mod:`imblearn.keras` provides utilities to deal with imbalanced dataset\nin keras.\"\"\"\n\nfrom imblearn.keras._generator import BalancedBatchGenerator, balanced_batch_generator\n\n__all__ = [\"BalancedBatchGenerator\", \"balanced_batch_generator\"]\n"
  },
  {
    "path": "imblearn/keras/_generator.py",
    "content": "\"\"\"Implement generators for ``keras`` which will balance the data.\"\"\"\n\n\n# This is a trick to avoid an error during tests collection with pytest. We\n# avoid the error when importing the package raise the error at the moment of\n# creating the instance.\n# This is a trick to avoid an error during tests collection with pytest. We\n# avoid the error when importing the package raise the error at the moment of\n# creating the instance.\ndef import_keras():\n    \"\"\"Try to import keras from keras and tensorflow.\n\n    This is possible to import the sequence from keras or tensorflow.\n    \"\"\"\n\n    def import_from_keras():\n        try:\n            import keras  # noqa\n\n            if hasattr(keras.utils, \"Sequence\"):\n                return (keras.utils.Sequence,), True\n            else:\n                return (keras.utils.PyDataset,), True\n        except ImportError:\n            return tuple(), False\n\n    def import_from_tensforflow():\n        try:\n            from tensorflow import keras\n\n            if hasattr(keras.utils, \"Sequence\"):\n                return (keras.utils.Sequence,), True\n            else:\n                return (keras.utils.PyDataset,), True\n        except ImportError:\n            return tuple(), False\n\n    ParentClassKeras, has_keras_k = import_from_keras()\n    ParentClassTensorflow, has_keras_tf = import_from_tensforflow()\n    has_keras = has_keras_k or has_keras_tf\n    if has_keras:\n        if has_keras_k:\n            ParentClass = ParentClassKeras\n        else:\n            ParentClass = ParentClassTensorflow\n    else:\n        ParentClass = (object,)\n    return ParentClass, has_keras\n\n\nParentClass, HAS_KERAS = import_keras()\n\nfrom scipy.sparse import issparse  # noqa\nfrom sklearn.base import clone  # noqa\nfrom sklearn.utils import _safe_indexing  # noqa\nfrom sklearn.utils import check_random_state  # noqa\n\nfrom imblearn.tensorflow import balanced_batch_generator as tf_bbg  # noqa\nfrom imblearn.under_sampling import RandomUnderSampler  # noqa\nfrom imblearn.utils import Substitution  # noqa\nfrom imblearn.utils._docstring import _random_state_docstring  # noqa\n\n\nclass BalancedBatchGenerator(*ParentClass):  # type: ignore\n    \"\"\"Create balanced batches when training a keras model.\n\n    Create a keras ``Sequence`` which is given to ``fit``. The\n    sampler defines the sampling strategy used to balance the dataset ahead of\n    creating the batch. The sampler should have an attribute\n    ``sample_indices_``.\n\n    .. versionadded:: 0.4\n\n    Parameters\n    ----------\n    X : ndarray of shape (n_samples, n_features)\n        Original imbalanced dataset.\n\n    y : ndarray of shape (n_samples,) or (n_samples, n_classes)\n        Associated targets.\n\n    sample_weight : ndarray of shape (n_samples,)\n        Sample weight.\n\n    sampler : sampler object, default=None\n        A sampler instance which has an attribute ``sample_indices_``.\n        By default, the sampler used is a\n        :class:`~imblearn.under_sampling.RandomUnderSampler`.\n\n    batch_size : int, default=32\n        Number of samples per gradient update.\n\n    keep_sparse : bool, default=False\n        Either or not to conserve or not the sparsity of the input (i.e. ``X``,\n        ``y``, ``sample_weight``). By default, the returned batches will be\n        dense.\n\n    random_state : int, RandomState instance or None, default=None\n        Control the randomization of the algorithm:\n\n        - If int, ``random_state`` is the seed used by the random number\n          generator;\n        - If ``RandomState`` instance, random_state is the random number\n          generator;\n        - If ``None``, the random number generator is the ``RandomState``\n          instance used by ``np.random``.\n\n    Attributes\n    ----------\n    sampler_ : sampler object\n        The sampler used to balance the dataset.\n\n    indices_ : ndarray of shape (n_samples, n_features)\n        The indices of the samples selected during sampling.\n\n    Examples\n    --------\n    >>> from sklearn.datasets import load_iris\n    >>> iris = load_iris()\n    >>> from imblearn.datasets import make_imbalance\n    >>> class_dict = dict()\n    >>> class_dict[0] = 30; class_dict[1] = 50; class_dict[2] = 40\n    >>> X, y = make_imbalance(iris.data, iris.target, sampling_strategy=class_dict)\n    >>> import tensorflow\n    >>> y = tensorflow.keras.utils.to_categorical(y, 3)\n    >>> model = tensorflow.keras.models.Sequential()\n    >>> model.add(\n    ...     tensorflow.keras.layers.Dense(\n    ...         y.shape[1], input_dim=X.shape[1], activation='softmax'\n    ...     )\n    ... )\n    >>> model.compile(optimizer='sgd', loss='categorical_crossentropy',\n    ...               metrics=['accuracy'])\n    >>> from imblearn.keras import BalancedBatchGenerator\n    >>> from imblearn.under_sampling import NearMiss\n    >>> training_generator = BalancedBatchGenerator(\n    ...     X, y, sampler=NearMiss(), batch_size=10, random_state=42)\n    >>> callback_history = model.fit(training_generator, epochs=10, verbose=0)\n    \"\"\"\n\n    # flag for keras sequence duck-typing\n    use_sequence_api = True\n\n    def __init__(\n        self,\n        X,\n        y,\n        *,\n        sample_weight=None,\n        sampler=None,\n        batch_size=32,\n        keep_sparse=False,\n        random_state=None,\n    ):\n        if not HAS_KERAS:\n            raise ImportError(\"'No module named 'keras'\")\n        self.X = X\n        self.y = y\n        self.sample_weight = sample_weight\n        self.sampler = sampler\n        self.batch_size = batch_size\n        self.keep_sparse = keep_sparse\n        self.random_state = random_state\n        self._sample()\n\n    def _sample(self):\n        random_state = check_random_state(self.random_state)\n        if self.sampler is None:\n            self.sampler_ = RandomUnderSampler(random_state=random_state)\n        else:\n            self.sampler_ = clone(self.sampler)\n        self.sampler_.fit_resample(self.X, self.y)\n        if not hasattr(self.sampler_, \"sample_indices_\"):\n            raise ValueError(\"'sampler' needs to have an attribute 'sample_indices_'.\")\n        self.indices_ = self.sampler_.sample_indices_\n        # shuffle the indices since the sampler are packing them by class\n        random_state.shuffle(self.indices_)\n\n    def __len__(self):\n        return int(self.indices_.size // self.batch_size)\n\n    def __getitem__(self, index):\n        X_resampled = _safe_indexing(\n            self.X,\n            self.indices_[index * self.batch_size : (index + 1) * self.batch_size],\n        )\n        y_resampled = _safe_indexing(\n            self.y,\n            self.indices_[index * self.batch_size : (index + 1) * self.batch_size],\n        )\n        if issparse(X_resampled) and not self.keep_sparse:\n            X_resampled = X_resampled.toarray()\n        if self.sample_weight is not None:\n            sample_weight_resampled = _safe_indexing(\n                self.sample_weight,\n                self.indices_[index * self.batch_size : (index + 1) * self.batch_size],\n            )\n\n        if self.sample_weight is None:\n            return X_resampled, y_resampled\n        else:\n            return X_resampled, y_resampled, sample_weight_resampled\n\n\n@Substitution(random_state=_random_state_docstring)\ndef balanced_batch_generator(\n    X,\n    y,\n    *,\n    sample_weight=None,\n    sampler=None,\n    batch_size=32,\n    keep_sparse=False,\n    random_state=None,\n):\n    \"\"\"Create a balanced batch generator to train keras model.\n\n    Returns a generator --- as well as the number of step per epoch --- which\n    is given to ``fit``. The sampler defines the sampling strategy\n    used to balance the dataset ahead of creating the batch. The sampler should\n    have an attribute ``sample_indices_``.\n\n    Parameters\n    ----------\n    X : ndarray of shape (n_samples, n_features)\n        Original imbalanced dataset.\n\n    y : ndarray of shape (n_samples,) or (n_samples, n_classes)\n        Associated targets.\n\n    sample_weight : ndarray of shape (n_samples,), default=None\n        Sample weight.\n\n    sampler : sampler object, default=None\n        A sampler instance which has an attribute ``sample_indices_``.\n        By default, the sampler used is a\n        :class:`~imblearn.under_sampling.RandomUnderSampler`.\n\n    batch_size : int, default=32\n        Number of samples per gradient update.\n\n    keep_sparse : bool, default=False\n        Either or not to conserve or not the sparsity of the input (i.e. ``X``,\n        ``y``, ``sample_weight``). By default, the returned batches will be\n        dense.\n\n    {random_state}\n\n    Returns\n    -------\n    generator : generator of tuple\n        Generate batch of data. The tuple generated are either (X_batch,\n        y_batch) or (X_batch, y_batch, sampler_weight_batch).\n\n    steps_per_epoch : int\n        The number of samples per epoch. Required by ``fit_generator`` in\n        keras.\n\n    Examples\n    --------\n    >>> from sklearn.datasets import load_iris\n    >>> X, y = load_iris(return_X_y=True)\n    >>> from imblearn.datasets import make_imbalance\n    >>> class_dict = dict()\n    >>> class_dict[0] = 30; class_dict[1] = 50; class_dict[2] = 40\n    >>> from imblearn.datasets import make_imbalance\n    >>> X, y = make_imbalance(X, y, sampling_strategy=class_dict)\n    >>> import tensorflow\n    >>> y = tensorflow.keras.utils.to_categorical(y, 3)\n    >>> model = tensorflow.keras.models.Sequential()\n    >>> model.add(\n    ...     tensorflow.keras.layers.Dense(\n    ...         y.shape[1], input_dim=X.shape[1], activation='softmax'\n    ...     )\n    ... )\n    >>> model.compile(optimizer='sgd', loss='categorical_crossentropy',\n    ...               metrics=['accuracy'])\n    >>> from imblearn.keras import balanced_batch_generator\n    >>> from imblearn.under_sampling import NearMiss\n    >>> training_generator, steps_per_epoch = balanced_batch_generator(\n    ...     X, y, sampler=NearMiss(), batch_size=10, random_state=42)\n    >>> callback_history = model.fit(training_generator,\n    ...                              steps_per_epoch=steps_per_epoch,\n    ...                              epochs=10, verbose=0)\n    \"\"\"\n\n    return tf_bbg(\n        X=X,\n        y=y,\n        sample_weight=sample_weight,\n        sampler=sampler,\n        batch_size=batch_size,\n        keep_sparse=keep_sparse,\n        random_state=random_state,\n    )\n"
  },
  {
    "path": "imblearn/keras/tests/__init__.py",
    "content": ""
  },
  {
    "path": "imblearn/keras/tests/test_generator.py",
    "content": "import numpy as np\nimport pytest\nfrom scipy import sparse\nfrom sklearn.cluster import KMeans\nfrom sklearn.datasets import load_iris\nfrom sklearn.preprocessing import LabelBinarizer\n\nkeras = pytest.importorskip(\"keras\")\nfrom keras.layers import Dense  # noqa: E402\nfrom keras.models import Sequential  # noqa: E402\n\nfrom imblearn.datasets import make_imbalance  # noqa: E402\nfrom imblearn.keras import (  # noqa: E402\n    BalancedBatchGenerator,\n    balanced_batch_generator,\n)\nfrom imblearn.over_sampling import RandomOverSampler  # noqa: E402\nfrom imblearn.under_sampling import ClusterCentroids, NearMiss  # noqa: E402\n\n\n@pytest.fixture\ndef data():\n    iris = load_iris()\n    X, y = make_imbalance(\n        iris.data, iris.target, sampling_strategy={0: 30, 1: 50, 2: 40}\n    )\n    X = X.astype(np.float32)\n    y = LabelBinarizer().fit_transform(y).astype(np.int32)\n    return X, y\n\n\ndef _build_keras_model(n_classes, n_features):\n    model = Sequential()\n    model.add(Dense(n_classes, input_dim=n_features, activation=\"softmax\"))\n    model.compile(\n        optimizer=\"sgd\", loss=\"categorical_crossentropy\", metrics=[\"accuracy\"]\n    )\n    return model\n\n\ndef test_balanced_batch_generator_class_no_return_indices(data):\n    with pytest.raises(ValueError, match=\"needs to have an attribute\"):\n        BalancedBatchGenerator(\n            *data, sampler=ClusterCentroids(estimator=KMeans(n_init=1)), batch_size=10\n        )\n\n\n@pytest.mark.filterwarnings(\"ignore:`wait_time` is not used\")  # keras 2.2.4\n@pytest.mark.parametrize(\n    \"sampler, sample_weight\",\n    [\n        (None, None),\n        (RandomOverSampler(), None),\n        (NearMiss(), None),\n        (None, np.random.uniform(size=120)),\n    ],\n)\ndef test_balanced_batch_generator_class(data, sampler, sample_weight):\n    X, y = data\n    model = _build_keras_model(y.shape[1], X.shape[1])\n    training_generator = BalancedBatchGenerator(\n        X,\n        y,\n        sample_weight=sample_weight,\n        sampler=sampler,\n        batch_size=10,\n        random_state=42,\n    )\n    model.fit(training_generator, epochs=10)\n\n\n@pytest.mark.parametrize(\"keep_sparse\", [True, False])\ndef test_balanced_batch_generator_class_sparse(data, keep_sparse):\n    X, y = data\n    training_generator = BalancedBatchGenerator(\n        sparse.csr_matrix(X),\n        y,\n        batch_size=10,\n        keep_sparse=keep_sparse,\n        random_state=42,\n    )\n    for idx in range(len(training_generator)):\n        X_batch, _ = training_generator.__getitem__(idx)\n        if keep_sparse:\n            assert sparse.issparse(X_batch)\n        else:\n            assert not sparse.issparse(X_batch)\n\n\ndef test_balanced_batch_generator_function_no_return_indices(data):\n    with pytest.raises(ValueError, match=\"needs to have an attribute\"):\n        balanced_batch_generator(\n            *data,\n            sampler=ClusterCentroids(estimator=KMeans(n_init=10)),\n            batch_size=10,\n            random_state=42,\n        )\n\n\n@pytest.mark.filterwarnings(\"ignore:`wait_time` is not used\")  # keras 2.2.4\n@pytest.mark.parametrize(\n    \"sampler, sample_weight\",\n    [\n        (None, None),\n        (RandomOverSampler(), None),\n        (NearMiss(), None),\n        (None, np.random.uniform(size=120).astype(np.float32)),\n    ],\n)\ndef test_balanced_batch_generator_function(data, sampler, sample_weight):\n    X, y = data\n    model = _build_keras_model(y.shape[1], X.shape[1])\n    training_generator, steps_per_epoch = balanced_batch_generator(\n        X,\n        y,\n        sample_weight=sample_weight,\n        sampler=sampler,\n        batch_size=10,\n        random_state=42,\n    )\n    print(next(training_generator))\n    model.fit(\n        training_generator,\n        steps_per_epoch=steps_per_epoch,\n        epochs=10,\n    )\n\n\n@pytest.mark.parametrize(\"keep_sparse\", [True, False])\ndef test_balanced_batch_generator_function_sparse(data, keep_sparse):\n    X, y = data\n    training_generator, steps_per_epoch = balanced_batch_generator(\n        sparse.csr_matrix(X),\n        y,\n        keep_sparse=keep_sparse,\n        batch_size=10,\n        random_state=42,\n    )\n    for _ in range(steps_per_epoch):\n        X_batch, _ = next(training_generator)\n        if keep_sparse:\n            assert sparse.issparse(X_batch)\n        else:\n            assert not sparse.issparse(X_batch)\n"
  },
  {
    "path": "imblearn/metrics/__init__.py",
    "content": "\"\"\"\nThe :mod:`imblearn.metrics` module includes score functions, performance\nmetrics and pairwise metrics and distance computations.\n\"\"\"\n\nfrom imblearn.metrics._classification import (\n    classification_report_imbalanced,\n    geometric_mean_score,\n    macro_averaged_mean_absolute_error,\n    make_index_balanced_accuracy,\n    sensitivity_score,\n    sensitivity_specificity_support,\n    specificity_score,\n)\n\n__all__ = [\n    \"sensitivity_specificity_support\",\n    \"sensitivity_score\",\n    \"specificity_score\",\n    \"geometric_mean_score\",\n    \"make_index_balanced_accuracy\",\n    \"classification_report_imbalanced\",\n    \"macro_averaged_mean_absolute_error\",\n]\n"
  },
  {
    "path": "imblearn/metrics/_classification.py",
    "content": "\"\"\"Metrics to assess performance on a classification task given class\npredictions. The available metrics are complementary from the metrics available\nin scikit-learn.\n\nFunctions named as ``*_score`` return a scalar value to maximize: the higher\nthe better\n\nFunction named as ``*_error`` or ``*_loss`` return a scalar value to minimize:\nthe lower the better\n\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Dariusz Brzezinski\n# License: MIT\n\nimport functools\nimport numbers\nimport warnings\nfrom inspect import signature\n\nimport numpy as np\nimport scipy as sp\nfrom sklearn.metrics import mean_absolute_error, precision_recall_fscore_support\nfrom sklearn.metrics._classification import _prf_divide\nfrom sklearn.preprocessing import LabelEncoder\nfrom sklearn.utils._param_validation import Interval, StrOptions\nfrom sklearn.utils.multiclass import unique_labels\nfrom sklearn.utils.validation import check_consistent_length, column_or_1d\nfrom sklearn_compat.metrics._classification import _check_targets\nfrom sklearn_compat.utils._param_validation import validate_params\n\n\n@validate_params(\n    {\n        \"y_true\": [\"array-like\"],\n        \"y_pred\": [\"array-like\"],\n        \"labels\": [\"array-like\", None],\n        \"pos_label\": [str, numbers.Integral, None],\n        \"average\": [\n            None,\n            StrOptions({\"binary\", \"micro\", \"macro\", \"weighted\", \"samples\"}),\n        ],\n        \"warn_for\": [\"array-like\"],\n        \"sample_weight\": [\"array-like\", None],\n    },\n    prefer_skip_nested_validation=True,\n)\ndef sensitivity_specificity_support(\n    y_true,\n    y_pred,\n    *,\n    labels=None,\n    pos_label=1,\n    average=None,\n    warn_for=(\"sensitivity\", \"specificity\"),\n    sample_weight=None,\n):\n    \"\"\"Compute sensitivity, specificity, and support for each class.\n\n    The sensitivity is the ratio ``tp / (tp + fn)`` where ``tp`` is the number\n    of true positives and ``fn`` the number of false negatives. The sensitivity\n    quantifies the ability to avoid false negatives_[1].\n\n    The specificity is the ratio ``tn / (tn + fp)`` where ``tn`` is the number\n    of true negatives and ``fn`` the number of false negatives. The specificity\n    quantifies the ability to avoid false positives_[1].\n\n    The support is the number of occurrences of each class in ``y_true``.\n\n    If ``pos_label is None`` and in binary classification, this function\n    returns the average sensitivity and specificity if ``average``\n    is one of ``'weighted'``.\n\n    Read more in the :ref:`User Guide <sensitivity_specificity>`.\n\n    Parameters\n    ----------\n    y_true : array-like of shape (n_samples,)\n        Ground truth (correct) target values.\n\n    y_pred : array-like of shape (n_samples,)\n        Estimated targets as returned by a classifier.\n\n    labels : array-like, default=None\n        The set of labels to include when ``average != 'binary'``, and their\n        order if ``average is None``. Labels present in the data can be\n        excluded, for example to calculate a multiclass average ignoring a\n        majority negative class, while labels not present in the data will\n        result in 0 components in a macro average. For multilabel targets,\n        labels are column indices. By default, all labels in ``y_true`` and\n        ``y_pred`` are used in sorted order.\n\n    pos_label : str, int or None, default=1\n        The class to report if ``average='binary'`` and the data is binary.\n        If ``pos_label is None`` and in binary classification, this function\n        returns the average sensitivity and specificity if ``average``\n        is one of ``'weighted'``.\n        If the data are multiclass, this will be ignored;\n        setting ``labels=[pos_label]`` and ``average != 'binary'`` will report\n        scores for that label only.\n\n    average : str, default=None\n        If ``None``, the scores for each class are returned. Otherwise, this\n        determines the type of averaging performed on the data:\n\n        ``'binary'``:\n            Only report results for the class specified by ``pos_label``.\n            This is applicable only if targets (``y_{true,pred}``) are binary.\n        ``'micro'``:\n            Calculate metrics globally by counting the total true positives,\n            false negatives and false positives.\n        ``'macro'``:\n            Calculate metrics for each label, and find their unweighted\n            mean.  This does not take label imbalance into account.\n        ``'weighted'``:\n            Calculate metrics for each label, and find their average, weighted\n            by support (the number of true instances for each label). This\n            alters 'macro' to account for label imbalance; it can result in an\n            F-score that is not between precision and recall.\n        ``'samples'``:\n            Calculate metrics for each instance, and find their average (only\n            meaningful for multilabel classification where this differs from\n            :func:`accuracy_score`).\n\n    warn_for : tuple or set of {{\"sensitivity\", \"specificity\"}}, for internal use\n        This determines which warnings will be made in the case that this\n        function is being used to return only one of its metrics.\n\n    sample_weight : array-like of shape (n_samples,), default=None\n        Sample weights.\n\n    Returns\n    -------\n    sensitivity : float (if `average is None`) or ndarray of \\\n            shape (n_unique_labels,)\n        The sensitivity metric.\n\n    specificity : float (if `average is None`) or ndarray of \\\n            shape (n_unique_labels,)\n        The specificity metric.\n\n    support : int (if `average is None`) or ndarray of \\\n            shape (n_unique_labels,)\n        The number of occurrences of each label in ``y_true``.\n\n    References\n    ----------\n    .. [1] `Wikipedia entry for the Sensitivity and specificity\n           <https://en.wikipedia.org/wiki/Sensitivity_and_specificity>`_\n\n    Examples\n    --------\n    >>> import numpy as np\n    >>> from imblearn.metrics import sensitivity_specificity_support\n    >>> y_true = np.array(['cat', 'dog', 'pig', 'cat', 'dog', 'pig'])\n    >>> y_pred = np.array(['cat', 'pig', 'dog', 'cat', 'cat', 'dog'])\n    >>> sensitivity_specificity_support(y_true, y_pred, average='macro')\n    (0.33..., 0.66..., None)\n    >>> sensitivity_specificity_support(y_true, y_pred, average='micro')\n    (0.33..., 0.66..., None)\n    >>> sensitivity_specificity_support(y_true, y_pred, average='weighted')\n    (0.33..., 0.66..., None)\n    \"\"\"\n    average_options = (None, \"micro\", \"macro\", \"weighted\", \"samples\")\n    if average not in average_options and average != \"binary\":\n        raise ValueError(\"average has to be one of \" + str(average_options))\n\n    y_type, y_true, y_pred, sample_weight = _check_targets(\n        y_true, y_pred, sample_weight=sample_weight\n    )\n    present_labels = unique_labels(y_true, y_pred)\n\n    if average == \"binary\":\n        if y_type == \"binary\":\n            if pos_label not in present_labels:\n                if len(present_labels) < 2:\n                    # Only negative labels\n                    return (0.0, 0.0, 0)\n                else:\n                    raise ValueError(\n                        f\"pos_label={pos_label!r} is not a valid label:\"\n                        f\" {present_labels!r}\"\n                    )\n            labels = [pos_label]\n        else:\n            raise ValueError(\n                f\"Target is {y_type} but average='binary'. Please \"\n                \"choose another average setting.\"\n            )\n    elif pos_label not in (None, 1):\n        warnings.warn(\n            (\n                f\"Note that pos_label (set to {pos_label!r}) is ignored when \"\n                f\"average != 'binary' (got {average!r}). You may use \"\n                \"labels=[pos_label] to specify a single positive class.\"\n            ),\n            UserWarning,\n        )\n\n    if labels is None:\n        labels = present_labels\n        n_labels = None\n    else:\n        n_labels = len(labels)\n        labels = np.hstack(\n            [labels, np.setdiff1d(present_labels, labels, assume_unique=True)]\n        )\n\n    # Calculate tp_sum, pred_sum, true_sum ###\n\n    if y_type.startswith(\"multilabel\"):\n        raise ValueError(\"imblearn does not support multilabel\")\n    elif average == \"samples\":\n        raise ValueError(\n            \"Sample-based precision, recall, fscore is \"\n            \"not meaningful outside multilabel \"\n            \"classification. See the accuracy_score instead.\"\n        )\n    else:\n        le = LabelEncoder()\n        le.fit(labels)\n        y_true = le.transform(y_true)\n        y_pred = le.transform(y_pred)\n        sorted_labels = le.classes_\n\n        # labels are now from 0 to len(labels) - 1 -> use bincount\n        tp = y_true == y_pred\n        tp_bins = y_true[tp]\n        if sample_weight is not None:\n            tp_bins_weights = np.asarray(sample_weight)[tp]\n        else:\n            tp_bins_weights = None\n\n        if len(tp_bins):\n            tp_sum = np.bincount(\n                tp_bins, weights=tp_bins_weights, minlength=len(labels)\n            )\n        else:\n            # Pathological case\n            true_sum = pred_sum = tp_sum = np.zeros(len(labels))\n        if len(y_pred):\n            pred_sum = np.bincount(y_pred, weights=sample_weight, minlength=len(labels))\n        if len(y_true):\n            true_sum = np.bincount(y_true, weights=sample_weight, minlength=len(labels))\n\n        # Compute the true negative\n        tn_sum = y_true.size - (pred_sum + true_sum - tp_sum)\n\n        # Retain only selected labels\n        indices = np.searchsorted(sorted_labels, labels[:n_labels])\n        tp_sum = tp_sum[indices]\n        true_sum = true_sum[indices]\n        pred_sum = pred_sum[indices]\n        tn_sum = tn_sum[indices]\n\n    if average == \"micro\":\n        tp_sum = np.array([tp_sum.sum()])\n        pred_sum = np.array([pred_sum.sum()])\n        true_sum = np.array([true_sum.sum()])\n        tn_sum = np.array([tn_sum.sum()])\n\n    # Finally, we have all our sufficient statistics. Divide! #\n\n    with np.errstate(divide=\"ignore\", invalid=\"ignore\"):\n        # Divide, and on zero-division, set scores to 0 and warn:\n\n        # Oddly, we may get an \"invalid\" rather than a \"divide\" error\n        # here.\n        specificity = _prf_divide(\n            tn_sum,\n            tn_sum + pred_sum - tp_sum,\n            \"specificity\",\n            \"predicted\",\n            average,\n            warn_for,\n        )\n        sensitivity = _prf_divide(\n            tp_sum, true_sum, \"sensitivity\", \"true\", average, warn_for\n        )\n\n    # Average the results\n\n    if average == \"weighted\":\n        weights = true_sum\n        if weights.sum() == 0:\n            return 0, 0, None\n    elif average == \"samples\":\n        weights = sample_weight\n    else:\n        weights = None\n\n    if average is not None:\n        assert average != \"binary\" or len(specificity) == 1\n        specificity = np.average(specificity, weights=weights)\n        sensitivity = np.average(sensitivity, weights=weights)\n        true_sum = None  # return no support\n\n    return sensitivity, specificity, true_sum\n\n\n@validate_params(\n    {\n        \"y_true\": [\"array-like\"],\n        \"y_pred\": [\"array-like\"],\n        \"labels\": [\"array-like\", None],\n        \"pos_label\": [str, numbers.Integral, None],\n        \"average\": [\n            None,\n            StrOptions({\"binary\", \"micro\", \"macro\", \"weighted\", \"samples\"}),\n        ],\n        \"sample_weight\": [\"array-like\", None],\n    },\n    prefer_skip_nested_validation=True,\n)\ndef sensitivity_score(\n    y_true,\n    y_pred,\n    *,\n    labels=None,\n    pos_label=1,\n    average=\"binary\",\n    sample_weight=None,\n):\n    \"\"\"Compute the sensitivity.\n\n    The sensitivity is the ratio ``tp / (tp + fn)`` where ``tp`` is the number\n    of true positives and ``fn`` the number of false negatives. The sensitivity\n    quantifies the ability to avoid false negatives.\n\n    The best value is 1 and the worst value is 0.\n\n    Read more in the :ref:`User Guide <sensitivity_specificity>`.\n\n    Parameters\n    ----------\n    y_true : array-like of shape (n_samples,)\n        Ground truth (correct) target values.\n\n    y_pred : array-like of shape (n_samples,)\n        Estimated targets as returned by a classifier.\n\n    labels : array-like, default=None\n        The set of labels to include when ``average != 'binary'``, and their\n        order if ``average is None``. Labels present in the data can be\n        excluded, for example to calculate a multiclass average ignoring a\n        majority negative class, while labels not present in the data will\n        result in 0 components in a macro average.\n\n    pos_label : str, int or None, default=1\n        The class to report if ``average='binary'`` and the data is binary.\n        If ``pos_label is None`` and in binary classification, this function\n        returns the average sensitivity if ``average`` is one of ``'weighted'``.\n        If the data are multiclass, this will be ignored;\n        setting ``labels=[pos_label]`` and ``average != 'binary'`` will report\n        scores for that label only.\n\n    average : str, default=None\n        If ``None``, the scores for each class are returned. Otherwise, this\n        determines the type of averaging performed on the data:\n\n        ``'binary'``:\n            Only report results for the class specified by ``pos_label``.\n            This is applicable only if targets (``y_{true,pred}``) are binary.\n        ``'micro'``:\n            Calculate metrics globally by counting the total true positives,\n            false negatives and false positives.\n        ``'macro'``:\n            Calculate metrics for each label, and find their unweighted\n            mean.  This does not take label imbalance into account.\n        ``'weighted'``:\n            Calculate metrics for each label, and find their average, weighted\n            by support (the number of true instances for each label). This\n            alters 'macro' to account for label imbalance; it can result in an\n            F-score that is not between precision and recall.\n        ``'samples'``:\n            Calculate metrics for each instance, and find their average (only\n            meaningful for multilabel classification where this differs from\n            :func:`accuracy_score`).\n\n    sample_weight : array-like of shape (n_samples,), default=None\n        Sample weights.\n\n    Returns\n    -------\n    specificity : float (if `average is None`) or ndarray of \\\n            shape (n_unique_labels,)\n        The specifcity metric.\n\n    Examples\n    --------\n    >>> import numpy as np\n    >>> from imblearn.metrics import sensitivity_score\n    >>> y_true = [0, 1, 2, 0, 1, 2]\n    >>> y_pred = [0, 2, 1, 0, 0, 1]\n    >>> sensitivity_score(y_true, y_pred, average='macro')\n    0.33...\n    >>> sensitivity_score(y_true, y_pred, average='micro')\n    0.33...\n    >>> sensitivity_score(y_true, y_pred, average='weighted')\n    0.33...\n    >>> sensitivity_score(y_true, y_pred, average=None)\n    array([1., 0., 0.])\n    \"\"\"\n    s, _, _ = sensitivity_specificity_support(\n        y_true,\n        y_pred,\n        labels=labels,\n        pos_label=pos_label,\n        average=average,\n        warn_for=(\"sensitivity\",),\n        sample_weight=sample_weight,\n    )\n\n    return s\n\n\n@validate_params(\n    {\n        \"y_true\": [\"array-like\"],\n        \"y_pred\": [\"array-like\"],\n        \"labels\": [\"array-like\", None],\n        \"pos_label\": [str, numbers.Integral, None],\n        \"average\": [\n            None,\n            StrOptions({\"binary\", \"micro\", \"macro\", \"weighted\", \"samples\"}),\n        ],\n        \"sample_weight\": [\"array-like\", None],\n    },\n    prefer_skip_nested_validation=True,\n)\ndef specificity_score(\n    y_true,\n    y_pred,\n    *,\n    labels=None,\n    pos_label=1,\n    average=\"binary\",\n    sample_weight=None,\n):\n    \"\"\"Compute the specificity.\n\n    The specificity is the ratio ``tn / (tn + fp)`` where ``tn`` is the number\n    of true negatives and ``fp`` the number of false positives. The specificity\n    quantifies the ability to avoid false positives.\n\n    The best value is 1 and the worst value is 0.\n\n    Read more in the :ref:`User Guide <sensitivity_specificity>`.\n\n    Parameters\n    ----------\n    y_true : array-like of shape (n_samples,)\n        Ground truth (correct) target values.\n\n    y_pred : array-like of shape (n_samples,)\n        Estimated targets as returned by a classifier.\n\n    labels : array-like, default=None\n        The set of labels to include when ``average != 'binary'``, and their\n        order if ``average is None``. Labels present in the data can be\n        excluded, for example to calculate a multiclass average ignoring a\n        majority negative class, while labels not present in the data will\n        result in 0 components in a macro average.\n\n    pos_label : str, int or None, default=1\n        The class to report if ``average='binary'`` and the data is binary.\n        If ``pos_label is None`` and in binary classification, this function\n        returns the average specificity if ``average`` is one of ``'weighted'``.\n        If the data are multiclass, this will be ignored;\n        setting ``labels=[pos_label]`` and ``average != 'binary'`` will report\n        scores for that label only.\n\n    average : str, default=None\n        If ``None``, the scores for each class are returned. Otherwise, this\n        determines the type of averaging performed on the data:\n\n        ``'binary'``:\n            Only report results for the class specified by ``pos_label``.\n            This is applicable only if targets (``y_{true,pred}``) are binary.\n        ``'micro'``:\n            Calculate metrics globally by counting the total true positives,\n            false negatives and false positives.\n        ``'macro'``:\n            Calculate metrics for each label, and find their unweighted\n            mean.  This does not take label imbalance into account.\n        ``'weighted'``:\n            Calculate metrics for each label, and find their average, weighted\n            by support (the number of true instances for each label). This\n            alters 'macro' to account for label imbalance; it can result in an\n            F-score that is not between precision and recall.\n        ``'samples'``:\n            Calculate metrics for each instance, and find their average (only\n            meaningful for multilabel classification where this differs from\n            :func:`accuracy_score`).\n\n    sample_weight : array-like of shape (n_samples,), default=None\n        Sample weights.\n\n    Returns\n    -------\n    specificity : float (if `average is None`) or ndarray of \\\n            shape (n_unique_labels,)\n        The specificity metric.\n\n    Examples\n    --------\n    >>> import numpy as np\n    >>> from imblearn.metrics import specificity_score\n    >>> y_true = [0, 1, 2, 0, 1, 2]\n    >>> y_pred = [0, 2, 1, 0, 0, 1]\n    >>> specificity_score(y_true, y_pred, average='macro')\n    0.66...\n    >>> specificity_score(y_true, y_pred, average='micro')\n    0.66...\n    >>> specificity_score(y_true, y_pred, average='weighted')\n    0.66...\n    >>> specificity_score(y_true, y_pred, average=None)\n    array([0.75, 0.5 , 0.75])\n    \"\"\"\n    _, s, _ = sensitivity_specificity_support(\n        y_true,\n        y_pred,\n        labels=labels,\n        pos_label=pos_label,\n        average=average,\n        warn_for=(\"specificity\",),\n        sample_weight=sample_weight,\n    )\n\n    return s\n\n\n@validate_params(\n    {\n        \"y_true\": [\"array-like\"],\n        \"y_pred\": [\"array-like\"],\n        \"labels\": [\"array-like\", None],\n        \"pos_label\": [str, numbers.Integral, None],\n        \"average\": [\n            None,\n            StrOptions(\n                {\"binary\", \"micro\", \"macro\", \"weighted\", \"samples\", \"multiclass\"}\n            ),\n        ],\n        \"sample_weight\": [\"array-like\", None],\n        \"correction\": [Interval(numbers.Real, 0, None, closed=\"left\")],\n    },\n    prefer_skip_nested_validation=True,\n)\ndef geometric_mean_score(\n    y_true,\n    y_pred,\n    *,\n    labels=None,\n    pos_label=1,\n    average=\"multiclass\",\n    sample_weight=None,\n    correction=0.0,\n):\n    \"\"\"Compute the geometric mean.\n\n    The geometric mean (G-mean) is the root of the product of class-wise\n    sensitivity. This measure tries to maximize the accuracy on each of the\n    classes while keeping these accuracies balanced. For binary classification\n    G-mean is the squared root of the product of the sensitivity\n    and specificity. For multi-class problems it is a higher root of the\n    product of sensitivity for each class.\n\n    For compatibility with other imbalance performance measures, G-mean can be\n    calculated for each class separately on a one-vs-rest basis when\n    ``average != 'multiclass'``.\n\n    The best value is 1 and the worst value is 0. Traditionally if at least one\n    class is unrecognized by the classifier, G-mean resolves to zero. To\n    alleviate this property, for highly multi-class the sensitivity of\n    unrecognized classes can be \"corrected\" to be a user specified value\n    (instead of zero). This option works only if ``average == 'multiclass'``.\n\n    Read more in the :ref:`User Guide <imbalanced_metrics>`.\n\n    Parameters\n    ----------\n    y_true : array-like of shape (n_samples,)\n        Ground truth (correct) target values.\n\n    y_pred : array-like of shape (n_samples,)\n        Estimated targets as returned by a classifier.\n\n    labels : array-like, default=None\n        The set of labels to include when ``average != 'binary'``, and their\n        order if ``average is None``. Labels present in the data can be\n        excluded, for example to calculate a multiclass average ignoring a\n        majority negative class, while labels not present in the data will\n        result in 0 components in a macro average.\n\n    pos_label : str, int or None, default=1\n        The class to report if ``average='binary'`` and the data is binary.\n        If ``pos_label is None`` and in binary classification, this function\n        returns the average geometric mean if ``average`` is one of\n        ``'weighted'``.\n        If the data are multiclass, this will be ignored;\n        setting ``labels=[pos_label]`` and ``average != 'binary'`` will report\n        scores for that label only.\n\n    average : str or None, default='multiclass'\n        If ``None``, the scores for each class are returned. Otherwise, this\n        determines the type of averaging performed on the data:\n\n        ``'binary'``:\n            Only report results for the class specified by ``pos_label``.\n            This is applicable only if targets (``y_{true,pred}``) are binary.\n        ``'micro'``:\n            Calculate metrics globally by counting the total true positives,\n            false negatives and false positives.\n        ``'macro'``:\n            Calculate metrics for each label, and find their unweighted\n            mean.  This does not take label imbalance into account.\n        ``'multiclass'``:\n            No average is taken.\n        ``'weighted'``:\n            Calculate metrics for each label, and find their average, weighted\n            by support (the number of true instances for each label). This\n            alters 'macro' to account for label imbalance; it can result in an\n            F-score that is not between precision and recall.\n        ``'samples'``:\n            Calculate metrics for each instance, and find their average (only\n            meaningful for multilabel classification where this differs from\n            :func:`accuracy_score`).\n\n    sample_weight : array-like of shape (n_samples,), default=None\n        Sample weights.\n\n    correction : float, default=0.0\n        Substitutes sensitivity of unrecognized classes from zero to a given\n        value.\n\n    Returns\n    -------\n    geometric_mean : float\n        Returns the geometric mean.\n\n    Notes\n    -----\n    See :ref:`sphx_glr_auto_examples_evaluation_plot_metrics.py`.\n\n    References\n    ----------\n    .. [1] Kubat, M. and Matwin, S. \"Addressing the curse of\n       imbalanced training sets: one-sided selection\" ICML (1997)\n\n    .. [2] Barandela, R., Sánchez, J. S., Garcıa, V., & Rangel, E. \"Strategies\n       for learning in class imbalance problems\", Pattern Recognition,\n       36(3), (2003), pp 849-851.\n\n    Examples\n    --------\n    >>> from imblearn.metrics import geometric_mean_score\n    >>> y_true = [0, 1, 2, 0, 1, 2]\n    >>> y_pred = [0, 2, 1, 0, 0, 1]\n    >>> geometric_mean_score(y_true, y_pred)\n    0.0\n    >>> geometric_mean_score(y_true, y_pred, correction=0.001)\n    0.010...\n    >>> geometric_mean_score(y_true, y_pred, average='macro')\n    0.471...\n    >>> geometric_mean_score(y_true, y_pred, average='micro')\n    0.471...\n    >>> geometric_mean_score(y_true, y_pred, average='weighted')\n    0.471...\n    >>> geometric_mean_score(y_true, y_pred, average=None)\n    array([0.866...,  0.       ,  0.       ])\n    \"\"\"\n    if average is None or average != \"multiclass\":\n        sen, spe, _ = sensitivity_specificity_support(\n            y_true,\n            y_pred,\n            labels=labels,\n            pos_label=pos_label,\n            average=average,\n            warn_for=(\"specificity\", \"specificity\"),\n            sample_weight=sample_weight,\n        )\n\n        return np.sqrt(sen * spe)\n    else:\n        present_labels = unique_labels(y_true, y_pred)\n\n        if labels is None:\n            labels = present_labels\n            n_labels = None\n        else:\n            n_labels = len(labels)\n            labels = np.hstack(\n                [labels, np.setdiff1d(present_labels, labels, assume_unique=True)]\n            )\n\n        le = LabelEncoder()\n        le.fit(labels)\n        y_true = le.transform(y_true)\n        y_pred = le.transform(y_pred)\n        sorted_labels = le.classes_\n\n        # labels are now from 0 to len(labels) - 1 -> use bincount\n        tp = y_true == y_pred\n        tp_bins = y_true[tp]\n\n        if sample_weight is not None:\n            tp_bins_weights = np.asarray(sample_weight)[tp]\n        else:\n            tp_bins_weights = None\n\n        if len(tp_bins):\n            tp_sum = np.bincount(\n                tp_bins, weights=tp_bins_weights, minlength=len(labels)\n            )\n        else:\n            # Pathological case\n            true_sum = tp_sum = np.zeros(len(labels))\n        if len(y_true):\n            true_sum = np.bincount(y_true, weights=sample_weight, minlength=len(labels))\n\n        # Retain only selected labels\n        indices = np.searchsorted(sorted_labels, labels[:n_labels])\n        tp_sum = tp_sum[indices]\n        true_sum = true_sum[indices]\n\n        with np.errstate(divide=\"ignore\", invalid=\"ignore\"):\n            recall = _prf_divide(tp_sum, true_sum, \"recall\", \"true\", None, \"recall\")\n        recall[recall == 0] = correction\n\n        with np.errstate(divide=\"ignore\", invalid=\"ignore\"):\n            gmean = sp.stats.gmean(recall)\n        # old version of scipy return MaskedConstant instead of 0.0\n        if isinstance(gmean, np.ma.core.MaskedConstant):\n            return 0.0\n        return gmean\n\n\n@validate_params(\n    {\"alpha\": [numbers.Real], \"squared\": [\"boolean\"]},\n    prefer_skip_nested_validation=True,\n)\ndef make_index_balanced_accuracy(*, alpha=0.1, squared=True):\n    \"\"\"Balance any scoring function using the index balanced accuracy.\n\n    This factory function wraps scoring function to express it as the\n    index balanced accuracy (IBA). You need to use this function to\n    decorate any scoring function.\n\n    Only metrics requiring ``y_pred`` can be corrected with the index\n    balanced accuracy. ``y_score`` cannot be used since the dominance\n    cannot be computed.\n\n    Read more in the :ref:`User Guide <imbalanced_metrics>`.\n\n    Parameters\n    ----------\n    alpha : float, default=0.1\n        Weighting factor.\n\n    squared : bool, default=True\n        If ``squared`` is True, then the metric computed will be squared\n        before to be weighted.\n\n    Returns\n    -------\n    iba_scoring_func : callable,\n        Returns the scoring metric decorated which will automatically compute\n        the index balanced accuracy.\n\n    Notes\n    -----\n    See :ref:`sphx_glr_auto_examples_evaluation_plot_metrics.py`.\n\n    References\n    ----------\n    .. [1] García, Vicente, Javier Salvador Sánchez, and Ramón Alberto\n       Mollineda. \"On the effectiveness of preprocessing methods when dealing\n       with different levels of class imbalance.\" Knowledge-Based Systems 25.1\n       (2012): 13-21.\n\n    Examples\n    --------\n    >>> from imblearn.metrics import geometric_mean_score as gmean\n    >>> from imblearn.metrics import make_index_balanced_accuracy as iba\n    >>> gmean = iba(alpha=0.1, squared=True)(gmean)\n    >>> y_true = [1, 0, 0, 1, 0, 1]\n    >>> y_pred = [0, 0, 1, 1, 0, 1]\n    >>> print(gmean(y_true, y_pred, average=None))\n    [0.44...  0.44...]\n    \"\"\"\n\n    def decorate(scoring_func):\n        @functools.wraps(scoring_func)\n        def compute_score(*args, **kwargs):\n            signature_scoring_func = signature(scoring_func)\n            params_scoring_func = set(signature_scoring_func.parameters.keys())\n\n            # check that the scoring function does not need a score\n            # and only a prediction\n            prohibitied_y_pred = set([\"y_score\", \"y_prob\", \"y2\"])\n            if prohibitied_y_pred.intersection(params_scoring_func):\n                raise AttributeError(\n                    f\"The function {scoring_func.__name__} has an unsupported\"\n                    \" attribute. Metric with`y_pred` are the\"\n                    \" only supported metrics is the only\"\n                    \" supported.\"\n                )\n\n            args_scoring_func = signature_scoring_func.bind(*args, **kwargs)\n            args_scoring_func.apply_defaults()\n            _score = scoring_func(*args_scoring_func.args, **args_scoring_func.kwargs)\n            if squared:\n                _score = np.power(_score, 2)\n\n            signature_sens_spec = signature(sensitivity_specificity_support)\n            params_sens_spec = set(signature_sens_spec.parameters.keys())\n            common_params = params_sens_spec.intersection(\n                set(args_scoring_func.arguments.keys())\n            )\n\n            args_sens_spec = {k: args_scoring_func.arguments[k] for k in common_params}\n\n            if scoring_func.__name__ == \"geometric_mean_score\":\n                if \"average\" in args_sens_spec:\n                    if args_sens_spec[\"average\"] == \"multiclass\":\n                        args_sens_spec[\"average\"] = \"macro\"\n            elif (\n                scoring_func.__name__ == \"accuracy_score\"\n                or scoring_func.__name__ == \"jaccard_score\"\n            ):\n                # We do not support multilabel so the only average supported\n                # is binary\n                args_sens_spec[\"average\"] = \"binary\"\n\n            sensitivity, specificity, _ = sensitivity_specificity_support(\n                **args_sens_spec\n            )\n\n            dominance = sensitivity - specificity\n            return (1.0 + alpha * dominance) * _score\n\n        return compute_score\n\n    return decorate\n\n\n@validate_params(\n    {\n        \"y_true\": [\"array-like\"],\n        \"y_pred\": [\"array-like\"],\n        \"labels\": [\"array-like\", None],\n        \"target_names\": [\"array-like\", None],\n        \"sample_weight\": [\"array-like\", None],\n        \"digits\": [Interval(numbers.Integral, 0, None, closed=\"left\")],\n        \"alpha\": [numbers.Real],\n        \"output_dict\": [\"boolean\"],\n        \"zero_division\": [\n            StrOptions({\"warn\"}),\n            Interval(numbers.Integral, 0, 1, closed=\"both\"),\n        ],\n    },\n    prefer_skip_nested_validation=True,\n)\ndef classification_report_imbalanced(\n    y_true,\n    y_pred,\n    *,\n    labels=None,\n    target_names=None,\n    sample_weight=None,\n    digits=2,\n    alpha=0.1,\n    output_dict=False,\n    zero_division=\"warn\",\n):\n    \"\"\"Build a classification report based on metrics used with imbalanced dataset.\n\n    Specific metrics have been proposed to evaluate the classification\n    performed on imbalanced dataset. This report compiles the\n    state-of-the-art metrics: precision/recall/specificity, geometric\n    mean, and index balanced accuracy of the\n    geometric mean.\n\n    Read more in the :ref:`User Guide <classification_report>`.\n\n    Parameters\n    ----------\n    y_true : 1d array-like, or label indicator array / sparse matrix\n        Ground truth (correct) target values.\n\n    y_pred : 1d array-like, or label indicator array / sparse matrix\n        Estimated targets as returned by a classifier.\n\n    labels : array-like of shape (n_labels,), default=None\n        Optional list of label indices to include in the report.\n\n    target_names : list of str of shape (n_labels,), default=None\n        Optional display names matching the labels (same order).\n\n    sample_weight : array-like of shape (n_samples,), default=None\n        Sample weights.\n\n    digits : int, default=2\n        Number of digits for formatting output floating point values.\n        When ``output_dict`` is ``True``, this will be ignored and the\n        returned values will not be rounded.\n\n    alpha : float, default=0.1\n        Weighting factor.\n\n    output_dict : bool, default=False\n        If True, return output as dict.\n\n        .. versionadded:: 0.8\n\n    zero_division : \"warn\" or {0, 1}, default=\"warn\"\n        Sets the value to return when there is a zero division. If set to\n        \"warn\", this acts as 0, but warnings are also raised.\n\n        .. versionadded:: 0.8\n\n    Returns\n    -------\n    report : string / dict\n        Text summary of the precision, recall, specificity, geometric mean,\n        and index balanced accuracy.\n        Dictionary returned if output_dict is True. Dictionary has the\n        following structure::\n\n            {'label 1': {'pre':0.5,\n                         'rec':1.0,\n                         ...\n                        },\n             'label 2': { ... },\n              ...\n            }\n\n    Examples\n    --------\n    >>> import numpy as np\n    >>> from imblearn.metrics import classification_report_imbalanced\n    >>> y_true = [0, 1, 2, 2, 2]\n    >>> y_pred = [0, 0, 2, 2, 1]\n    >>> target_names = ['class 0', 'class 1', 'class 2']\n    >>> print(classification_report_imbalanced(y_true, y_pred, \\\n    target_names=target_names))\n                       pre       rec       spe        f1       geo       iba\\\n       sup\n    <BLANKLINE>\n        class 0       0.50      1.00      0.75      0.67      0.87      0.77\\\n         1\n        class 1       0.00      0.00      0.75      0.00      0.00      0.00\\\n         1\n        class 2       1.00      0.67      1.00      0.80      0.82      0.64\\\n         3\n    <BLANKLINE>\n    avg / total       0.70      0.60      0.90      0.61      0.66      0.54\\\n         5\n    <BLANKLINE>\n    \"\"\"\n\n    if labels is None:\n        labels = unique_labels(y_true, y_pred)\n    else:\n        labels = np.asarray(labels)\n\n    last_line_heading = \"avg / total\"\n\n    if target_names is None:\n        target_names = [f\"{label}\" for label in labels]\n    name_width = max(len(cn) for cn in target_names)\n    width = max(name_width, len(last_line_heading), digits)\n\n    headers = [\"pre\", \"rec\", \"spe\", \"f1\", \"geo\", \"iba\", \"sup\"]\n    fmt = \"%% %ds\" % width  # first column: class name\n    fmt += \"  \"\n    fmt += \" \".join([\"% 9s\" for _ in headers])\n    fmt += \"\\n\"\n\n    headers = [\"\"] + headers\n    report = fmt % tuple(headers)\n    report += \"\\n\"\n\n    # Compute the different metrics\n    # Precision/recall/f1\n    precision, recall, f1, support = precision_recall_fscore_support(\n        y_true,\n        y_pred,\n        labels=labels,\n        average=None,\n        sample_weight=sample_weight,\n        zero_division=zero_division,\n    )\n    # Specificity\n    specificity = specificity_score(\n        y_true,\n        y_pred,\n        labels=labels,\n        average=None,\n        sample_weight=sample_weight,\n    )\n    # Geometric mean\n    geo_mean = geometric_mean_score(\n        y_true,\n        y_pred,\n        labels=labels,\n        average=None,\n        sample_weight=sample_weight,\n    )\n    # Index balanced accuracy\n    iba_gmean = make_index_balanced_accuracy(alpha=alpha, squared=True)(\n        geometric_mean_score\n    )\n    iba = iba_gmean(\n        y_true,\n        y_pred,\n        labels=labels,\n        average=None,\n        sample_weight=sample_weight,\n    )\n\n    report_dict = {}\n    for i, label in enumerate(labels):\n        report_dict_label = {}\n        values = [target_names[i]]\n        for score_name, score_value in zip(\n            headers[1:-1],\n            [\n                precision[i],\n                recall[i],\n                specificity[i],\n                f1[i],\n                geo_mean[i],\n                iba[i],\n            ],\n        ):\n            values += [\"{0:0.{1}f}\".format(score_value, digits)]\n            report_dict_label[score_name] = score_value\n        values += [f\"{support[i]}\"]\n        report_dict_label[headers[-1]] = support[i]\n        report += fmt % tuple(values)\n\n        report_dict[target_names[i]] = report_dict_label\n\n    report += \"\\n\"\n\n    # compute averages\n    values = [last_line_heading]\n    for score_name, score_value in zip(\n        headers[1:-1],\n        [\n            np.average(precision, weights=support),\n            np.average(recall, weights=support),\n            np.average(specificity, weights=support),\n            np.average(f1, weights=support),\n            np.average(geo_mean, weights=support),\n            np.average(iba, weights=support),\n        ],\n    ):\n        values += [\"{0:0.{1}f}\".format(score_value, digits)]\n        report_dict[f\"avg_{score_name}\"] = score_value\n    values += [f\"{np.sum(support)}\"]\n    report += fmt % tuple(values)\n    report_dict[\"total_support\"] = np.sum(support)\n\n    if output_dict:\n        return report_dict\n    return report\n\n\n@validate_params(\n    {\n        \"y_true\": [\"array-like\"],\n        \"y_pred\": [\"array-like\"],\n        \"sample_weight\": [\"array-like\", None],\n    },\n    prefer_skip_nested_validation=True,\n)\ndef macro_averaged_mean_absolute_error(y_true, y_pred, *, sample_weight=None):\n    \"\"\"Compute Macro-Averaged MAE for imbalanced ordinal classification.\n\n    This function computes each MAE for each class and average them,\n    giving an equal weight to each class.\n\n    Read more in the :ref:`User Guide <macro_averaged_mean_absolute_error>`.\n\n    .. versionadded:: 0.8\n\n    Parameters\n    ----------\n    y_true : array-like of shape (n_samples,) or (n_samples, n_outputs)\n        Ground truth (correct) target values.\n\n    y_pred : array-like of shape (n_samples,) or (n_samples, n_outputs)\n        Estimated targets as returned by a classifier.\n\n    sample_weight : array-like of shape (n_samples,), default=None\n        Sample weights.\n\n    Returns\n    -------\n    loss : float or ndarray of floats\n        Macro-Averaged MAE output is non-negative floating point.\n        The best value is 0.0.\n\n    Examples\n    --------\n    >>> import numpy as np\n    >>> from sklearn.metrics import mean_absolute_error\n    >>> from imblearn.metrics import macro_averaged_mean_absolute_error\n    >>> y_true_balanced = [1, 1, 2, 2]\n    >>> y_true_imbalanced = [1, 2, 2, 2]\n    >>> y_pred = [1, 2, 1, 2]\n    >>> mean_absolute_error(y_true_balanced, y_pred)\n    0.5\n    >>> mean_absolute_error(y_true_imbalanced, y_pred)\n    0.25\n    >>> macro_averaged_mean_absolute_error(y_true_balanced, y_pred)\n    0.5\n    >>> macro_averaged_mean_absolute_error(y_true_imbalanced, y_pred)\n    0.16...\n    \"\"\"\n    _, y_true, y_pred, sample_weight = _check_targets(y_true, y_pred, sample_weight)\n    if sample_weight is not None:\n        sample_weight = column_or_1d(sample_weight)\n    else:\n        sample_weight = np.ones(y_true.shape)\n    check_consistent_length(y_true, y_pred, sample_weight)\n    labels = unique_labels(y_true, y_pred)\n    mae = []\n    for possible_class in labels:\n        indices = np.flatnonzero(y_true == possible_class)\n\n        mae.append(\n            mean_absolute_error(\n                y_true[indices],\n                y_pred[indices],\n                sample_weight=sample_weight[indices],\n            )\n        )\n\n    return np.sum(mae) / len(mae)\n"
  },
  {
    "path": "imblearn/metrics/pairwise.py",
    "content": "\"\"\"Metrics to perform pairwise computation.\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\nimport numbers\n\nimport numpy as np\nfrom scipy.spatial import distance_matrix\nfrom sklearn.base import BaseEstimator\nfrom sklearn.utils import check_consistent_length\nfrom sklearn.utils._param_validation import StrOptions\nfrom sklearn.utils.multiclass import unique_labels\nfrom sklearn.utils.validation import check_is_fitted\nfrom sklearn_compat.base import _fit_context\nfrom sklearn_compat.utils.validation import check_array, validate_data\n\n\nclass ValueDifferenceMetric(BaseEstimator):\n    r\"\"\"Class implementing the Value Difference Metric.\n\n    This metric computes the distance between samples containing only\n    categorical features. The distance between feature values of two samples is\n    defined as:\n\n    .. math::\n       \\delta(x, y) = \\sum_{c=1}^{C} |p(c|x_{f}) - p(c|y_{f})|^{k} \\ ,\n\n    where :math:`x` and :math:`y` are two samples and :math:`f` a given\n    feature, :math:`C` is the number of classes, :math:`p(c|x_{f})` is the\n    conditional probability that the output class is :math:`c` given that\n    the feature value :math:`f` has the value :math:`x` and :math:`k` an\n    exponent usually defined to 1 or 2.\n\n    The distance for the feature vectors :math:`X` and :math:`Y` is\n    subsequently defined as:\n\n    .. math::\n       \\Delta(X, Y) = \\sum_{f=1}^{F} \\delta(X_{f}, Y_{f})^{r} \\ ,\n\n    where :math:`F` is the number of feature and :math:`r` an exponent usually\n    defined equal to 1 or 2.\n\n    The definition of this distance was propoed in [1]_.\n\n    Read more in the :ref:`User Guide <vdm>`.\n\n    .. versionadded:: 0.8\n\n    Parameters\n    ----------\n    n_categories : \"auto\" or array-like of shape (n_features,), default=\"auto\"\n        The number of unique categories per features. If `\"auto\"`, the number\n        of categories will be computed from `X` at `fit`. Otherwise, you can\n        provide an array-like of such counts to avoid computation. You can use\n        the fitted attribute `categories_` of the\n        :class:`~sklearn.preprocesssing.OrdinalEncoder` to deduce these counts.\n\n    k : int, default=1\n        Exponent used to compute the distance between feature value.\n\n    r : int, default=2\n        Exponent used to compute the distance between the feature vector.\n\n    Attributes\n    ----------\n    n_categories_ : ndarray of shape (n_features,)\n        The number of categories per features.\n\n    proba_per_class_ : list of ndarray of shape (n_categories, n_classes)\n        List of length `n_features` containing the conditional probabilities\n        for each category given a class.\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.10\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.10\n\n    See Also\n    --------\n    sklearn.neighbors.DistanceMetric : Interface for fast metric computation.\n\n    Notes\n    -----\n    The input data `X` are expected to be encoded by an\n    :class:`~sklearn.preprocessing.OrdinalEncoder` and the data type is used\n    should be `np.int32`. If other data types are given, `X` will be converted\n    to `np.int32`.\n\n    References\n    ----------\n    .. [1] Stanfill, Craig, and David Waltz. \"Toward memory-based reasoning.\"\n       Communications of the ACM 29.12 (1986): 1213-1228.\n\n    Examples\n    --------\n    >>> import numpy as np\n    >>> X = np.array([\"green\"] * 10 + [\"red\"] * 10 + [\"blue\"] * 10).reshape(-1, 1)\n    >>> y = [1] * 8 + [0] * 5 + [1] * 7 + [0] * 9 + [1]\n    >>> from sklearn.preprocessing import OrdinalEncoder\n    >>> encoder = OrdinalEncoder(dtype=np.int32)\n    >>> X_encoded = encoder.fit_transform(X)\n    >>> from imblearn.metrics.pairwise import ValueDifferenceMetric\n    >>> vdm = ValueDifferenceMetric().fit(X_encoded, y)\n    >>> pairwise_distance = vdm.pairwise(X_encoded)\n    >>> pairwise_distance.shape\n    (30, 30)\n    >>> X_test = np.array([\"green\", \"red\", \"blue\"]).reshape(-1, 1)\n    >>> X_test_encoded = encoder.transform(X_test)\n    >>> vdm.pairwise(X_test_encoded)\n    array([[0.  ,  0.04,  1.96],\n           [0.04,  0.  ,  1.44],\n           [1.96,  1.44,  0.  ]])\n    \"\"\"\n\n    _parameter_constraints: dict = {\n        \"n_categories\": [StrOptions({\"auto\"}), \"array-like\"],\n        \"k\": [numbers.Integral],\n        \"r\": [numbers.Integral],\n    }\n\n    def __init__(self, *, n_categories=\"auto\", k=1, r=2):\n        self.n_categories = n_categories\n        self.k = k\n        self.r = r\n\n    @_fit_context(prefer_skip_nested_validation=True)\n    def fit(self, X, y):\n        \"\"\"Compute the necessary statistics from the training set.\n\n        Parameters\n        ----------\n        X : ndarray of shape (n_samples, n_features), dtype=np.int32\n            The input data. The data are expected to be encoded with a\n            :class:`~sklearn.preprocessing.OrdinalEncoder`.\n\n        y : ndarray of shape (n_features,)\n            The target.\n\n        Returns\n        -------\n        self : object\n            Return the instance itself.\n        \"\"\"\n        self._validate_params()\n        check_consistent_length(X, y)\n        X, y = validate_data(self, X=X, y=y, reset=True, dtype=np.int32)\n        X = check_array(X, ensure_non_negative=True)\n\n        if isinstance(self.n_categories, str) and self.n_categories == \"auto\":\n            # categories are expected to be encoded from 0 to n_categories - 1\n            self.n_categories_ = X.max(axis=0) + 1\n        else:\n            if len(self.n_categories) != self.n_features_in_:\n                raise ValueError(\n                    \"The length of n_categories is not consistent with the \"\n                    f\"number of feature in X. Got {len(self.n_categories)} \"\n                    f\"elements in n_categories and {self.n_features_in_} in \"\n                    \"X.\"\n                )\n            self.n_categories_ = np.asarray(self.n_categories)\n        classes = unique_labels(y)\n\n        # list of length n_features of ndarray (n_categories, n_classes)\n        # compute the counts\n        self.proba_per_class_ = [\n            np.empty(shape=(n_cat, len(classes)), dtype=np.float64)\n            for n_cat in self.n_categories_\n        ]\n        for feature_idx in range(self.n_features_in_):\n            for klass_idx, klass in enumerate(classes):\n                self.proba_per_class_[feature_idx][:, klass_idx] = np.bincount(\n                    X[y == klass, feature_idx],\n                    minlength=self.n_categories_[feature_idx],\n                )\n\n        # normalize by the summing over the classes\n        with np.errstate(invalid=\"ignore\"):\n            # silence potential warning due to in-place division by zero\n            for feature_idx in range(self.n_features_in_):\n                self.proba_per_class_[feature_idx] /= (\n                    self.proba_per_class_[feature_idx].sum(axis=1).reshape(-1, 1)\n                )\n                np.nan_to_num(self.proba_per_class_[feature_idx], copy=False)\n\n        return self\n\n    def pairwise(self, X, Y=None):\n        \"\"\"Compute the VDM distance pairwise.\n\n        Parameters\n        ----------\n        X : ndarray of shape (n_samples, n_features), dtype=np.int32\n            The input data. The data are expected to be encoded with a\n            :class:`~sklearn.preprocessing.OrdinalEncoder`.\n\n        Y : ndarray of shape (n_samples, n_features), dtype=np.int32\n            The input data. The data are expected to be encoded with a\n            :class:`~sklearn.preprocessing.OrdinalEncoder`.\n\n        Returns\n        -------\n        distance_matrix : ndarray of shape (n_samples, n_samples)\n            The VDM pairwise distance.\n        \"\"\"\n        check_is_fitted(self)\n        X = check_array(X, ensure_non_negative=True, dtype=np.int32)\n        n_samples_X = X.shape[0]\n\n        if Y is not None:\n            Y = check_array(Y, ensure_non_negative=True, dtype=np.int32)\n            n_samples_Y = Y.shape[0]\n        else:\n            n_samples_Y = n_samples_X\n\n        distance = np.zeros(shape=(n_samples_X, n_samples_Y), dtype=np.float64)\n        for feature_idx in range(self.n_features_in_):\n            proba_feature_X = self.proba_per_class_[feature_idx][X[:, feature_idx]]\n            if Y is not None:\n                proba_feature_Y = self.proba_per_class_[feature_idx][Y[:, feature_idx]]\n            else:\n                proba_feature_Y = proba_feature_X\n            distance += (\n                distance_matrix(proba_feature_X, proba_feature_Y, p=self.k) ** self.r\n            )\n        return distance\n\n    def _more_tags(self):\n        return {\n            \"requires_positive_X\": True,  # X should be encoded with OrdinalEncoder\n        }\n\n    def __sklearn_tags__(self):\n        tags = super().__sklearn_tags__()\n        tags.input_tags.positive_only = True\n        return tags\n"
  },
  {
    "path": "imblearn/metrics/tests/__init__.py",
    "content": ""
  },
  {
    "path": "imblearn/metrics/tests/test_classification.py",
    "content": "\"\"\"Testing the metric for classification with imbalanced dataset\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nfrom functools import partial\n\nimport numpy as np\nimport pytest\nfrom sklearn import datasets, svm\nfrom sklearn.metrics import (\n    accuracy_score,\n    average_precision_score,\n    brier_score_loss,\n    cohen_kappa_score,\n    jaccard_score,\n    precision_score,\n    recall_score,\n    roc_auc_score,\n)\nfrom sklearn.preprocessing import label_binarize\nfrom sklearn.utils._testing import (\n    assert_allclose,\n    assert_array_equal,\n)\nfrom sklearn.utils.validation import check_random_state\n\nfrom imblearn.metrics import (\n    classification_report_imbalanced,\n    geometric_mean_score,\n    macro_averaged_mean_absolute_error,\n    make_index_balanced_accuracy,\n    sensitivity_score,\n    sensitivity_specificity_support,\n    specificity_score,\n)\n\nRND_SEED = 42\nR_TOL = 1e-2\n\n###############################################################################\n# Utilities for testing\n\n\ndef make_prediction(dataset=None, binary=False):\n    \"\"\"Make some classification predictions on a toy dataset using a SVC\n    If binary is True restrict to a binary classification problem instead of a\n    multiclass classification problem\n    \"\"\"\n\n    if dataset is None:\n        # import some data to play with\n        dataset = datasets.load_iris()\n\n    X = dataset.data\n    y = dataset.target\n\n    if binary:\n        # restrict to a binary classification task\n        X, y = X[y < 2], y[y < 2]\n\n    n_samples, n_features = X.shape\n    p = np.arange(n_samples)\n\n    rng = check_random_state(37)\n    rng.shuffle(p)\n    X, y = X[p], y[p]\n    half = int(n_samples / 2)\n\n    # add noisy features to make the problem harder and avoid perfect results\n    rng = np.random.RandomState(0)\n    X = np.c_[X, rng.randn(n_samples, 200 * n_features)]\n\n    # run classifier, get class probabilities and label predictions\n    clf = svm.SVC(kernel=\"linear\", probability=True, random_state=0)\n    probas_pred = clf.fit(X[:half], y[:half]).predict_proba(X[half:])\n\n    if binary:\n        # only interested in probabilities of the positive case\n        # XXX: do we really want a special API for the binary case?\n        probas_pred = probas_pred[:, 1]\n\n    y_pred = clf.predict(X[half:])\n    y_true = y[half:]\n\n    return y_true, y_pred, probas_pred\n\n\n###############################################################################\n# Tests\n\n\ndef test_sensitivity_specificity_score_binary():\n    y_true, y_pred, _ = make_prediction(binary=True)\n\n    # detailed measures for each class\n    sen, spe, sup = sensitivity_specificity_support(y_true, y_pred, average=None)\n    assert_allclose(sen, [0.88, 0.68], rtol=R_TOL)\n    assert_allclose(spe, [0.68, 0.88], rtol=R_TOL)\n    assert_array_equal(sup, [25, 25])\n\n    # individual scoring function that can be used for grid search: in the\n    # binary class case the score is the value of the measure for the positive\n    # class (e.g. label == 1). This is deprecated for average != 'binary'.\n    for kwargs in ({}, {\"average\": \"binary\"}):\n        sen = sensitivity_score(y_true, y_pred, **kwargs)\n        assert sen == pytest.approx(0.68, rel=R_TOL)\n\n        spe = specificity_score(y_true, y_pred, **kwargs)\n        assert spe == pytest.approx(0.88, rel=R_TOL)\n\n\n@pytest.mark.filterwarnings(\"ignore:Specificity is ill-defined\")\n@pytest.mark.parametrize(\n    \"y_pred, expected_sensitivity, expected_specificity\",\n    [(([1, 1], [1, 1]), 1.0, 0.0), (([-1, -1], [-1, -1]), 0.0, 0.0)],\n)\ndef test_sensitivity_specificity_f_binary_single_class(\n    y_pred, expected_sensitivity, expected_specificity\n):\n    # Such a case may occur with non-stratified cross-validation\n    assert sensitivity_score(*y_pred) == expected_sensitivity\n    assert specificity_score(*y_pred) == expected_specificity\n\n\n@pytest.mark.parametrize(\n    \"average, expected_specificty\",\n    [\n        (None, [1.0, 0.67, 1.0, 1.0, 1.0]),\n        (\"macro\", np.mean([1.0, 0.67, 1.0, 1.0, 1.0])),\n        (\"micro\", 15 / 16),\n    ],\n)\ndef test_sensitivity_specificity_extra_labels(average, expected_specificty):\n    y_true = [1, 3, 3, 2]\n    y_pred = [1, 1, 3, 2]\n\n    actual = specificity_score(y_true, y_pred, labels=[0, 1, 2, 3, 4], average=average)\n    assert_allclose(expected_specificty, actual, rtol=R_TOL)\n\n\ndef test_sensitivity_specificity_ignored_labels():\n    y_true = [1, 1, 2, 3]\n    y_pred = [1, 3, 3, 3]\n\n    specificity_13 = partial(specificity_score, y_true, y_pred, labels=[1, 3])\n    specificity_all = partial(specificity_score, y_true, y_pred, labels=None)\n\n    assert_allclose([1.0, 0.33], specificity_13(average=None), rtol=R_TOL)\n    assert_allclose(np.mean([1.0, 0.33]), specificity_13(average=\"macro\"), rtol=R_TOL)\n    assert_allclose(\n        np.average([1.0, 0.33], weights=[2.0, 1.0]),\n        specificity_13(average=\"weighted\"),\n        rtol=R_TOL,\n    )\n    assert_allclose(3.0 / (3.0 + 2.0), specificity_13(average=\"micro\"), rtol=R_TOL)\n\n    # ensure the above were meaningful tests:\n    for each in [\"macro\", \"weighted\", \"micro\"]:\n        assert specificity_13(average=each) != specificity_all(average=each)\n\n\ndef test_sensitivity_specificity_error_multilabels():\n    y_true = [1, 3, 3, 2]\n    y_pred = [1, 1, 3, 2]\n    y_true_bin = label_binarize(y_true, classes=np.arange(5))\n    y_pred_bin = label_binarize(y_pred, classes=np.arange(5))\n\n    with pytest.raises(ValueError):\n        sensitivity_score(y_true_bin, y_pred_bin)\n\n\ndef test_sensitivity_specificity_support_errors():\n    y_true, y_pred, _ = make_prediction(binary=True)\n\n    # Bad pos_label\n    with pytest.raises(ValueError):\n        sensitivity_specificity_support(y_true, y_pred, pos_label=2, average=\"binary\")\n\n    # Bad average option\n    with pytest.raises(ValueError):\n        sensitivity_specificity_support([0, 1, 2], [1, 2, 0], average=\"mega\")\n\n\ndef test_sensitivity_specificity_unused_pos_label():\n    # but average != 'binary'; even if data is binary\n    msg = r\"use labels=\\[pos_label\\] to specify a single\"\n    with pytest.warns(UserWarning, match=msg):\n        sensitivity_specificity_support(\n            [1, 2, 1], [1, 2, 2], pos_label=2, average=\"macro\"\n        )\n\n\ndef test_geometric_mean_support_binary():\n    y_true, y_pred, _ = make_prediction(binary=True)\n\n    # compute the geometric mean for the binary problem\n    geo_mean = geometric_mean_score(y_true, y_pred)\n\n    assert_allclose(geo_mean, 0.77, rtol=R_TOL)\n\n\n@pytest.mark.filterwarnings(\"ignore:Recall is ill-defined\")\n@pytest.mark.parametrize(\n    \"y_true, y_pred, correction, expected_gmean\",\n    [\n        ([0, 0, 1, 1], [0, 0, 1, 1], 0.0, 1.0),\n        ([0, 0, 0, 0], [1, 1, 1, 1], 0.0, 0.0),\n        ([0, 0, 0, 0], [0, 0, 0, 0], 0.001, 1.0),\n        ([0, 0, 0, 0], [1, 1, 1, 1], 0.001, 0.001),\n        ([0, 0, 1, 1], [0, 1, 1, 0], 0.001, 0.5),\n        (\n            [0, 1, 2, 0, 1, 2],\n            [0, 2, 1, 0, 0, 1],\n            0.001,\n            (0.001**2) ** (1 / 3),\n        ),\n        ([0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5], 0.001, 1),\n        ([0, 1, 1, 1, 1, 0], [0, 0, 1, 1, 1, 1], 0.001, (0.5 * 0.75) ** 0.5),\n    ],\n)\ndef test_geometric_mean_multiclass(y_true, y_pred, correction, expected_gmean):\n    gmean = geometric_mean_score(y_true, y_pred, correction=correction)\n    assert gmean == pytest.approx(expected_gmean, rel=R_TOL)\n\n\n@pytest.mark.filterwarnings(\"ignore:Recall is ill-defined\")\n@pytest.mark.parametrize(\n    \"y_true, y_pred, average, expected_gmean\",\n    [\n        ([0, 1, 2, 0, 1, 2], [0, 2, 1, 0, 0, 1], \"macro\", 0.471),\n        ([0, 1, 2, 0, 1, 2], [0, 2, 1, 0, 0, 1], \"micro\", 0.471),\n        ([0, 1, 2, 0, 1, 2], [0, 2, 1, 0, 0, 1], \"weighted\", 0.471),\n        ([0, 1, 2, 0, 1, 2], [0, 2, 1, 0, 0, 1], None, [0.8660254, 0.0, 0.0]),\n    ],\n)\ndef test_geometric_mean_average(y_true, y_pred, average, expected_gmean):\n    gmean = geometric_mean_score(y_true, y_pred, average=average)\n    assert gmean == pytest.approx(expected_gmean, rel=R_TOL)\n\n\n@pytest.mark.parametrize(\n    \"y_true, y_pred, sample_weight, average, expected_gmean\",\n    [\n        ([0, 1, 2, 0, 1, 2], [0, 1, 1, 0, 0, 1], None, \"multiclass\", 0.707),\n        (\n            [0, 1, 2, 0, 1, 2],\n            [0, 1, 1, 0, 0, 1],\n            [1, 2, 1, 1, 2, 1],\n            \"multiclass\",\n            0.707,\n        ),\n        (\n            [0, 1, 2, 0, 1, 2],\n            [0, 1, 1, 0, 0, 1],\n            [1, 2, 1, 1, 2, 1],\n            \"weighted\",\n            0.333,\n        ),\n    ],\n)\ndef test_geometric_mean_sample_weight(\n    y_true, y_pred, sample_weight, average, expected_gmean\n):\n    gmean = geometric_mean_score(\n        y_true,\n        y_pred,\n        labels=[0, 1],\n        sample_weight=sample_weight,\n        average=average,\n    )\n    assert gmean == pytest.approx(expected_gmean, rel=R_TOL)\n\n\n@pytest.mark.parametrize(\n    \"average, expected_gmean\",\n    [\n        (\"multiclass\", 0.41),\n        (None, [0.85, 0.29, 0.7]),\n        (\"macro\", 0.68),\n        (\"weighted\", 0.65),\n    ],\n)\ndef test_geometric_mean_score_prediction(average, expected_gmean):\n    y_true, y_pred, _ = make_prediction(binary=False)\n\n    gmean = geometric_mean_score(y_true, y_pred, average=average)\n    assert gmean == pytest.approx(expected_gmean, rel=R_TOL)\n\n\ndef test_iba_geo_mean_binary():\n    y_true, y_pred, _ = make_prediction(binary=True)\n\n    iba_gmean = make_index_balanced_accuracy(alpha=0.5, squared=True)(\n        geometric_mean_score\n    )\n    iba = iba_gmean(y_true, y_pred)\n\n    assert_allclose(iba, 0.5948, rtol=R_TOL)\n\n\ndef _format_report(report):\n    return \" \".join(report.split())\n\n\ndef test_classification_report_imbalanced_multiclass():\n    iris = datasets.load_iris()\n    y_true, y_pred, _ = make_prediction(dataset=iris, binary=False)\n\n    # print classification report with class names\n    expected_report = (\n        \"pre rec spe f1 geo iba sup setosa 0.83 0.79 0.92 \"\n        \"0.81 0.85 0.72 24 versicolor 0.33 0.10 0.86 0.15 \"\n        \"0.29 0.08 31 virginica 0.42 0.90 0.55 0.57 0.70 \"\n        \"0.51 20 avg / total 0.51 0.53 0.80 0.47 0.58 0.40 75\"\n    )\n\n    report = classification_report_imbalanced(\n        y_true,\n        y_pred,\n        labels=np.arange(len(iris.target_names)),\n        target_names=iris.target_names,\n    )\n    assert _format_report(report) == expected_report\n    # print classification report with label detection\n    expected_report = (\n        \"pre rec spe f1 geo iba sup 0 0.83 0.79 0.92 0.81 \"\n        \"0.85 0.72 24 1 0.33 0.10 0.86 0.15 0.29 0.08 31 \"\n        \"2 0.42 0.90 0.55 0.57 0.70 0.51 20 avg / total \"\n        \"0.51 0.53 0.80 0.47 0.58 0.40 75\"\n    )\n\n    report = classification_report_imbalanced(y_true, y_pred)\n    assert _format_report(report) == expected_report\n\n\ndef test_classification_report_imbalanced_multiclass_with_digits():\n    iris = datasets.load_iris()\n    y_true, y_pred, _ = make_prediction(dataset=iris, binary=False)\n\n    # print classification report with class names\n    expected_report = (\n        \"pre rec spe f1 geo iba sup setosa 0.82609 0.79167 \"\n        \"0.92157 0.80851 0.85415 0.72010 24 versicolor \"\n        \"0.33333 0.09677 0.86364 0.15000 0.28910 0.07717 \"\n        \"31 virginica 0.41860 0.90000 0.54545 0.57143 0.70065 \"\n        \"0.50831 20 avg / total 0.51375 0.53333 0.79733 \"\n        \"0.47310 0.57966 0.39788 75\"\n    )\n    report = classification_report_imbalanced(\n        y_true,\n        y_pred,\n        labels=np.arange(len(iris.target_names)),\n        target_names=iris.target_names,\n        digits=5,\n    )\n    assert _format_report(report) == expected_report\n    # print classification report with label detection\n    expected_report = (\n        \"pre rec spe f1 geo iba sup 0 0.83 0.79 0.92 0.81 \"\n        \"0.85 0.72 24 1 0.33 0.10 0.86 0.15 0.29 0.08 31 \"\n        \"2 0.42 0.90 0.55 0.57 0.70 0.51 20 avg / total 0.51 \"\n        \"0.53 0.80 0.47 0.58 0.40 75\"\n    )\n    report = classification_report_imbalanced(y_true, y_pred)\n    assert _format_report(report) == expected_report\n\n\ndef test_classification_report_imbalanced_multiclass_with_string_label():\n    y_true, y_pred, _ = make_prediction(binary=False)\n\n    y_true = np.array([\"blue\", \"green\", \"red\"])[y_true]\n    y_pred = np.array([\"blue\", \"green\", \"red\"])[y_pred]\n\n    expected_report = (\n        \"pre rec spe f1 geo iba sup blue 0.83 0.79 0.92 0.81 \"\n        \"0.85 0.72 24 green 0.33 0.10 0.86 0.15 0.29 0.08 31 \"\n        \"red 0.42 0.90 0.55 0.57 0.70 0.51 20 avg / total \"\n        \"0.51 0.53 0.80 0.47 0.58 0.40 75\"\n    )\n    report = classification_report_imbalanced(y_true, y_pred)\n    assert _format_report(report) == expected_report\n\n    expected_report = (\n        \"pre rec spe f1 geo iba sup a 0.83 0.79 0.92 0.81 0.85 \"\n        \"0.72 24 b 0.33 0.10 0.86 0.15 0.29 0.08 31 c 0.42 \"\n        \"0.90 0.55 0.57 0.70 0.51 20 avg / total 0.51 0.53 \"\n        \"0.80 0.47 0.58 0.40 75\"\n    )\n    report = classification_report_imbalanced(\n        y_true, y_pred, target_names=[\"a\", \"b\", \"c\"]\n    )\n    assert _format_report(report) == expected_report\n\n\ndef test_classification_report_imbalanced_multiclass_with_unicode_label():\n    y_true, y_pred, _ = make_prediction(binary=False)\n\n    labels = np.array([\"blue\\xa2\", \"green\\xa2\", \"red\\xa2\"])\n    y_true = labels[y_true]\n    y_pred = labels[y_pred]\n\n    expected_report = (\n        \"pre rec spe f1 geo iba sup blue¢ 0.83 0.79 0.92 0.81 \"\n        \"0.85 0.72 24 green¢ 0.33 0.10 0.86 0.15 0.29 0.08 31 \"\n        \"red¢ 0.42 0.90 0.55 0.57 0.70 0.51 20 avg / total \"\n        \"0.51 0.53 0.80 0.47 0.58 0.40 75\"\n    )\n    report = classification_report_imbalanced(y_true, y_pred)\n    assert _format_report(report) == expected_report\n\n\ndef test_classification_report_imbalanced_multiclass_with_long_string_label():\n    y_true, y_pred, _ = make_prediction(binary=False)\n\n    labels = np.array([\"blue\", \"green\" * 5, \"red\"])\n    y_true = labels[y_true]\n    y_pred = labels[y_pred]\n\n    expected_report = (\n        \"pre rec spe f1 geo iba sup blue 0.83 0.79 0.92 0.81 \"\n        \"0.85 0.72 24 greengreengreengreengreen 0.33 0.10 \"\n        \"0.86 0.15 0.29 0.08 31 red 0.42 0.90 0.55 0.57 0.70 \"\n        \"0.51 20 avg / total 0.51 0.53 0.80 0.47 0.58 0.40 75\"\n    )\n\n    report = classification_report_imbalanced(y_true, y_pred)\n    assert _format_report(report) == expected_report\n\n\n@pytest.mark.parametrize(\n    \"score, expected_score\",\n    [\n        (accuracy_score, 0.54756),\n        (jaccard_score, 0.33176),\n        (precision_score, 0.65025),\n        (recall_score, 0.41616),\n    ],\n)\ndef test_iba_sklearn_metrics(score, expected_score):\n    y_true, y_pred, _ = make_prediction(binary=True)\n\n    score_iba = make_index_balanced_accuracy(alpha=0.5, squared=True)(score)\n    score = score_iba(y_true, y_pred)\n    assert score == pytest.approx(expected_score)\n\n\n@pytest.mark.parametrize(\n    \"score_loss\",\n    [average_precision_score, brier_score_loss, cohen_kappa_score, roc_auc_score],\n)\ndef test_iba_error_y_score_prob_error(score_loss):\n    y_true, y_pred, _ = make_prediction(binary=True)\n\n    aps = make_index_balanced_accuracy(alpha=0.5, squared=True)(score_loss)\n    with pytest.raises((AttributeError, TypeError)):\n        aps(y_true, y_pred)\n\n\ndef test_classification_report_imbalanced_dict_with_target_names():\n    iris = datasets.load_iris()\n    y_true, y_pred, _ = make_prediction(dataset=iris, binary=False)\n\n    report = classification_report_imbalanced(\n        y_true,\n        y_pred,\n        labels=np.arange(len(iris.target_names)),\n        target_names=iris.target_names,\n        output_dict=True,\n    )\n    outer_keys = set(report.keys())\n    inner_keys = set(report[\"setosa\"].keys())\n\n    expected_outer_keys = {\n        \"setosa\",\n        \"versicolor\",\n        \"virginica\",\n        \"avg_pre\",\n        \"avg_rec\",\n        \"avg_spe\",\n        \"avg_f1\",\n        \"avg_geo\",\n        \"avg_iba\",\n        \"total_support\",\n    }\n    expected_inner_keys = {\"spe\", \"f1\", \"sup\", \"rec\", \"geo\", \"iba\", \"pre\"}\n\n    assert outer_keys == expected_outer_keys\n    assert inner_keys == expected_inner_keys\n\n\ndef test_classification_report_imbalanced_dict_without_target_names():\n    iris = datasets.load_iris()\n    y_true, y_pred, _ = make_prediction(dataset=iris, binary=False)\n    report = classification_report_imbalanced(\n        y_true,\n        y_pred,\n        labels=np.arange(len(iris.target_names)),\n        output_dict=True,\n    )\n    outer_keys = set(report.keys())\n    inner_keys = set(report[\"0\"].keys())\n\n    expected_outer_keys = {\n        \"0\",\n        \"1\",\n        \"2\",\n        \"avg_pre\",\n        \"avg_rec\",\n        \"avg_spe\",\n        \"avg_f1\",\n        \"avg_geo\",\n        \"avg_iba\",\n        \"total_support\",\n    }\n    expected_inner_keys = {\"spe\", \"f1\", \"sup\", \"rec\", \"geo\", \"iba\", \"pre\"}\n\n    assert outer_keys == expected_outer_keys\n    assert inner_keys == expected_inner_keys\n\n\n@pytest.mark.parametrize(\n    \"y_true, y_pred, expected_ma_mae\",\n    [\n        ([1, 1, 1, 2, 2, 2], [1, 2, 1, 2, 1, 2], 0.333),\n        ([1, 1, 1, 1, 1, 2], [1, 2, 1, 2, 1, 2], 0.2),\n        ([1, 1, 1, 2, 2, 2, 3, 3, 3], [1, 3, 1, 2, 1, 1, 2, 3, 3], 0.555),\n        ([1, 1, 1, 1, 1, 1, 2, 3, 3], [1, 3, 1, 2, 1, 1, 2, 3, 3], 0.166),\n    ],\n)\ndef test_macro_averaged_mean_absolute_error(y_true, y_pred, expected_ma_mae):\n    ma_mae = macro_averaged_mean_absolute_error(y_true, y_pred)\n    assert ma_mae == pytest.approx(expected_ma_mae, rel=R_TOL)\n\n\ndef test_macro_averaged_mean_absolute_error_sample_weight():\n    y_true = [1, 1, 1, 2, 2, 2]\n    y_pred = [1, 2, 1, 2, 1, 2]\n\n    ma_mae_no_weights = macro_averaged_mean_absolute_error(y_true, y_pred)\n\n    sample_weight = [1, 1, 1, 1, 1, 1]\n    ma_mae_unit_weights = macro_averaged_mean_absolute_error(\n        y_true,\n        y_pred,\n        sample_weight=sample_weight,\n    )\n\n    assert ma_mae_unit_weights == pytest.approx(ma_mae_no_weights)\n"
  },
  {
    "path": "imblearn/metrics/tests/test_pairwise.py",
    "content": "\"\"\"Test for the metrics that perform pairwise distance computation.\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\nimport numpy as np\nimport pytest\nfrom sklearn.exceptions import NotFittedError\nfrom sklearn.preprocessing import LabelEncoder, OrdinalEncoder\nfrom sklearn.utils._testing import _convert_container\n\nfrom imblearn.metrics.pairwise import ValueDifferenceMetric\n\n\n@pytest.fixture\ndef data():\n    rng = np.random.RandomState(0)\n\n    feature_1 = [\"A\"] * 10 + [\"B\"] * 20 + [\"C\"] * 30\n    feature_2 = [\"A\"] * 40 + [\"B\"] * 20\n    feature_3 = [\"A\"] * 20 + [\"B\"] * 20 + [\"C\"] * 10 + [\"D\"] * 10\n    X = np.array([feature_1, feature_2, feature_3], dtype=object).T\n    rng.shuffle(X)\n    y = rng.randint(low=0, high=2, size=X.shape[0])\n    y_labels = np.array([\"not apple\", \"apple\"], dtype=object)\n    y = y_labels[y]\n    return X, y\n\n\n@pytest.mark.parametrize(\"dtype\", [np.int32, np.int64, np.float32, np.float64])\n@pytest.mark.parametrize(\"k, r\", [(1, 1), (1, 2), (2, 1), (2, 2)])\n@pytest.mark.parametrize(\"y_type\", [\"list\", \"array\"])\n@pytest.mark.parametrize(\"encode_label\", [True, False])\ndef test_value_difference_metric(data, dtype, k, r, y_type, encode_label):\n    # Check basic feature of the metric:\n    # * the shape of the distance matrix is (n_samples, n_samples)\n    # * computing pairwise distance of X is the same than explicitely between\n    #   X and X.\n    X, y = data\n    y = _convert_container(y, y_type)\n    if encode_label:\n        y = LabelEncoder().fit_transform(y)\n\n    encoder = OrdinalEncoder(dtype=dtype)\n    X_encoded = encoder.fit_transform(X)\n\n    vdm = ValueDifferenceMetric(k=k, r=r)\n    vdm.fit(X_encoded, y)\n\n    dist_1 = vdm.pairwise(X_encoded)\n    dist_2 = vdm.pairwise(X_encoded, X_encoded)\n\n    np.testing.assert_allclose(dist_1, dist_2)\n    assert dist_1.shape == (X.shape[0], X.shape[0])\n    assert dist_2.shape == (X.shape[0], X.shape[0])\n\n\n@pytest.mark.parametrize(\"dtype\", [np.int32, np.int64, np.float32, np.float64])\n@pytest.mark.parametrize(\"k, r\", [(1, 1), (1, 2), (2, 1), (2, 2)])\n@pytest.mark.parametrize(\"y_type\", [\"list\", \"array\"])\n@pytest.mark.parametrize(\"encode_label\", [True, False])\ndef test_value_difference_metric_property(dtype, k, r, y_type, encode_label):\n    # Check the property of the vdm distance. Let's check the property\n    # described in \"Improved Heterogeneous Distance Functions\", D.R. Wilson and\n    # T.R. Martinez, Journal of Artificial Intelligence Research 6 (1997) 1-34\n    # https://arxiv.org/pdf/cs/9701101.pdf\n    #\n    # \"if an attribute color has three values red, green and blue, and the\n    # application is to identify whether or not an object is an apple, red and\n    # green would be considered closer than red and blue because the former two\n    # both have similar correlations with the output class apple.\"\n\n    # defined our feature\n    X = np.array([\"green\"] * 10 + [\"red\"] * 10 + [\"blue\"] * 10).reshape(-1, 1)\n    # 0 - not an apple / 1 - an apple\n    y = np.array([1] * 8 + [0] * 5 + [1] * 7 + [0] * 9 + [1])\n    y_labels = np.array([\"not apple\", \"apple\"], dtype=object)\n    y = y_labels[y]\n    y = _convert_container(y, y_type)\n    if encode_label:\n        y = LabelEncoder().fit_transform(y)\n\n    encoder = OrdinalEncoder(dtype=dtype)\n    X_encoded = encoder.fit_transform(X)\n\n    vdm = ValueDifferenceMetric(k=k, r=r)\n    vdm.fit(X_encoded, y)\n\n    sample_green = encoder.transform([[\"green\"]])\n    sample_red = encoder.transform([[\"red\"]])\n    sample_blue = encoder.transform([[\"blue\"]])\n\n    for sample in (sample_green, sample_red, sample_blue):\n        # computing the distance between a sample of the same category should\n        # give a null distance\n        dist = vdm.pairwise(sample).squeeze()\n        assert dist == pytest.approx(0)\n\n    # check the property explained in the introduction example\n    dist_1 = vdm.pairwise(sample_green, sample_red).squeeze()\n    dist_2 = vdm.pairwise(sample_blue, sample_red).squeeze()\n    dist_3 = vdm.pairwise(sample_blue, sample_green).squeeze()\n\n    # green and red are very close\n    # blue is closer to red than green\n    assert dist_1 < dist_2\n    assert dist_1 < dist_3\n    assert dist_2 < dist_3\n\n\ndef test_value_difference_metric_categories(data):\n    # Check that \"auto\" is equivalent to provide the number categories\n    # beforehand\n    X, y = data\n\n    encoder = OrdinalEncoder(dtype=np.int32)\n    X_encoded = encoder.fit_transform(X)\n    n_categories = np.array([len(cat) for cat in encoder.categories_])\n\n    vdm_auto = ValueDifferenceMetric().fit(X_encoded, y)\n    vdm_categories = ValueDifferenceMetric(n_categories=n_categories)\n    vdm_categories.fit(X_encoded, y)\n\n    np.testing.assert_array_equal(vdm_auto.n_categories_, n_categories)\n    np.testing.assert_array_equal(vdm_auto.n_categories_, vdm_categories.n_categories_)\n\n\ndef test_value_difference_metric_categories_error(data):\n    # Check that we raise an error if n_categories is inconsistent with the\n    # number of features in X\n    X, y = data\n\n    encoder = OrdinalEncoder(dtype=np.int32)\n    X_encoded = encoder.fit_transform(X)\n    n_categories = [1, 2]\n\n    vdm = ValueDifferenceMetric(n_categories=n_categories)\n    err_msg = \"The length of n_categories is not consistent with the number\"\n    with pytest.raises(ValueError, match=err_msg):\n        vdm.fit(X_encoded, y)\n\n\ndef test_value_difference_metric_missing_categories(data):\n    # Check that we don't get issue when a category is missing between 0\n    # n_categories - 1\n    X, y = data\n\n    encoder = OrdinalEncoder(dtype=np.int32)\n    X_encoded = encoder.fit_transform(X)\n    n_categories = np.array([len(cat) for cat in encoder.categories_])\n\n    # remove a categories that could be between 0 and n_categories\n    X_encoded[X_encoded[:, -1] == 1] = 0\n    np.testing.assert_array_equal(np.unique(X_encoded[:, -1]), [0, 2, 3])\n\n    vdm = ValueDifferenceMetric(n_categories=n_categories)\n    vdm.fit(X_encoded, y)\n\n    for n_cats, proba in zip(n_categories, vdm.proba_per_class_):\n        assert proba.shape == (n_cats, len(np.unique(y)))\n\n\ndef test_value_difference_value_unfitted(data):\n    # Check that we raise a NotFittedError when `fit` is not not called before\n    # pairwise.\n    X, y = data\n\n    encoder = OrdinalEncoder(dtype=np.int32)\n    X_encoded = encoder.fit_transform(X)\n\n    with pytest.raises(NotFittedError):\n        ValueDifferenceMetric().pairwise(X_encoded)\n"
  },
  {
    "path": "imblearn/metrics/tests/test_score_objects.py",
    "content": "\"\"\"Test for score\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport pytest\nfrom sklearn.datasets import make_blobs\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.metrics import make_scorer\nfrom sklearn.model_selection import GridSearchCV, train_test_split\n\nfrom imblearn.metrics import (\n    geometric_mean_score,\n    make_index_balanced_accuracy,\n    sensitivity_score,\n    specificity_score,\n)\n\nR_TOL = 1e-2\n\n\n@pytest.fixture\ndef data():\n    X, y = make_blobs(random_state=0, centers=2)\n    return train_test_split(X, y, random_state=0)\n\n\n@pytest.mark.parametrize(\n    \"score, expected_score\",\n    [\n        (sensitivity_score, 0.90),\n        (specificity_score, 0.90),\n        (geometric_mean_score, 0.90),\n        (make_index_balanced_accuracy()(geometric_mean_score), 0.82),\n    ],\n)\n@pytest.mark.parametrize(\"average\", [\"macro\", \"weighted\", \"micro\"])\ndef test_scorer_common_average(data, score, expected_score, average):\n    X_train, X_test, y_train, _ = data\n\n    scorer = make_scorer(score, pos_label=None, average=average)\n    grid = GridSearchCV(\n        LogisticRegression(),\n        param_grid={\"C\": [1, 10]},\n        scoring=scorer,\n        cv=3,\n    )\n    grid.fit(X_train, y_train).predict(X_test)\n\n    assert grid.best_score_ >= expected_score\n\n\n@pytest.mark.parametrize(\n    \"score, average, expected_score\",\n    [\n        (sensitivity_score, \"binary\", 0.94),\n        (specificity_score, \"binary\", 0.89),\n        (geometric_mean_score, \"multiclass\", 0.90),\n        (\n            make_index_balanced_accuracy()(geometric_mean_score),\n            \"multiclass\",\n            0.82,\n        ),\n    ],\n)\ndef test_scorer_default_average(data, score, average, expected_score):\n    X_train, X_test, y_train, _ = data\n\n    scorer = make_scorer(score, pos_label=1, average=average)\n    grid = GridSearchCV(\n        LogisticRegression(),\n        param_grid={\"C\": [1, 10]},\n        scoring=scorer,\n        cv=3,\n    )\n    grid.fit(X_train, y_train).predict(X_test)\n\n    assert grid.best_score_ >= expected_score\n"
  },
  {
    "path": "imblearn/model_selection/__init__.py",
    "content": "\"\"\"\nThe :mod:`imblearn.model_selection` provides methods to split the dataset into\ntraining and test sets.\n\"\"\"\n\nfrom imblearn.model_selection._split import InstanceHardnessCV\n\n__all__ = [\"InstanceHardnessCV\"]\n"
  },
  {
    "path": "imblearn/model_selection/_split.py",
    "content": "import warnings\n\nimport numpy as np\nfrom sklearn.base import clone\nfrom sklearn.model_selection import LeaveOneGroupOut, cross_val_predict\nfrom sklearn.model_selection._split import BaseCrossValidator\nfrom sklearn.utils.multiclass import type_of_target\nfrom sklearn.utils.validation import _num_samples\n\n\nclass InstanceHardnessCV(BaseCrossValidator):\n    \"\"\"Instance-hardness cross-validation splitter.\n\n    Cross-validation splitter that distributes samples with large instance hardness\n    equally over the folds. The instance hardness is internally estimated by using\n    `estimator` and stratified cross-validation.\n\n    Read more in the :ref:`User Guide <instance_hardness_threshold_cv>`.\n\n    Parameters\n    ----------\n    estimator : estimator object\n        Classifier to be used to estimate instance hardness of the samples.\n        This classifier should implement `predict_proba`.\n\n    n_splits : int, default=5\n        Number of folds. Must be at least 2.\n\n    pos_label : int, float, bool or str, default=None\n        The class considered the positive class when selecting the probability\n        representing the instance hardness. If None, the positive class is\n        automatically inferred by the estimator as `estimator.classes_[1]`.\n\n    Examples\n    --------\n    >>> from imblearn.model_selection import InstanceHardnessCV\n    >>> from sklearn.datasets import make_classification\n    >>> from sklearn.model_selection import cross_validate\n    >>> from sklearn.linear_model import LogisticRegression\n    >>> X, y = make_classification(weights=[0.9, 0.1], class_sep=2,\n    ... n_informative=3, n_redundant=1, flip_y=0.05, n_samples=1000, random_state=10)\n    >>> estimator = LogisticRegression()\n    >>> ih_cv = InstanceHardnessCV(estimator)\n    >>> cv_result = cross_validate(estimator, X, y, cv=ih_cv)\n    >>> print(f\"Standard deviation of test_scores: {cv_result['test_score'].std():.3f}\")\n    Standard deviation of test_scores: 0.00...\n    \"\"\"\n\n    def __init__(self, estimator, *, n_splits=5, pos_label=None):\n        self.estimator = estimator\n        self.n_splits = n_splits\n        self.pos_label = pos_label\n\n    def split(self, X, y, groups=None):\n        \"\"\"Generate indices to split data into training and test set.\n\n        Parameters\n        ----------\n        X : array-like of shape (n_samples, n_features)\n            Training data, where `n_samples` is the number of samples\n            and `n_features` is the number of features.\n\n        y : array-like of shape (n_samples,)\n            The target variable for supervised learning problems.\n\n        groups : object\n            Always ignored, exists for compatibility.\n\n        Yields\n        ------\n        train : ndarray\n            The training set indices for that split.\n\n        test : ndarray\n            The testing set indices for that split.\n        \"\"\"\n        if groups is not None:\n            warnings.warn(\n                f\"The groups parameter is ignored by {self.__class__.__name__}\",\n                UserWarning,\n            )\n\n        classes = np.unique(y)\n        y_type = type_of_target(y)\n        if y_type != \"binary\":\n            raise ValueError(\"InstanceHardnessCV only supports binary classification.\")\n        if self.pos_label is None:\n            pos_label = 1\n        else:\n            pos_label = np.flatnonzero(classes == self.pos_label)[0]\n\n        y_proba = cross_val_predict(\n            clone(self.estimator), X, y, cv=self.n_splits, method=\"predict_proba\"\n        )\n        # sorting first on y and then by the instance hardness\n        sorted_indices = np.lexsort((y_proba[:, pos_label], y))\n        groups = np.empty(_num_samples(X), dtype=int)\n        groups[sorted_indices] = np.arange(_num_samples(X)) % self.n_splits\n        cv = LeaveOneGroupOut()\n        yield from cv.split(X, y, groups)\n\n    def get_n_splits(self, X=None, y=None, groups=None):\n        \"\"\"Returns the number of splitting iterations in the cross-validator.\n\n        Parameters\n        ----------\n        X: object\n            Always ignored, exists for compatibility.\n\n        y: object\n            Always ignored, exists for compatibility.\n\n        groups: object\n            Always ignored, exists for compatibility.\n\n        Returns\n        -------\n        n_splits: int\n            Returns the number of splitting iterations in the cross-validator.\n        \"\"\"\n        return self.n_splits\n"
  },
  {
    "path": "imblearn/model_selection/tests/__init__.py",
    "content": ""
  },
  {
    "path": "imblearn/model_selection/tests/test_split.py",
    "content": "import numpy as np\nimport pytest\nfrom sklearn.datasets import make_classification\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.metrics import make_scorer, precision_score\nfrom sklearn.model_selection import cross_validate\nfrom sklearn.utils._testing import assert_allclose\n\nfrom imblearn.model_selection import InstanceHardnessCV\n\n\n@pytest.fixture\ndef data():\n    return make_classification(\n        weights=[0.5, 0.5],\n        class_sep=0.5,\n        n_informative=3,\n        n_redundant=1,\n        flip_y=0.05,\n        n_samples=50,\n        random_state=10,\n    )\n\n\ndef test_groups_parameter_warning(data):\n    \"\"\"Test that a warning is raised when groups parameter is provided.\"\"\"\n    X, y = data\n    ih_cv = InstanceHardnessCV(estimator=LogisticRegression(), n_splits=3)\n\n    warning_msg = \"The groups parameter is ignored by InstanceHardnessCV\"\n    with pytest.warns(UserWarning, match=warning_msg):\n        list(ih_cv.split(X, y, groups=np.ones_like(y)))\n\n\ndef test_error_on_multiclass():\n    \"\"\"Test that an error is raised when the target is not binary.\"\"\"\n    X, y = make_classification(n_classes=3, n_clusters_per_class=1)\n    err_msg = \"InstanceHardnessCV only supports binary classification.\"\n    with pytest.raises(ValueError, match=err_msg):\n        next(InstanceHardnessCV(estimator=LogisticRegression()).split(X, y))\n\n\ndef test_default_params(data):\n    \"\"\"Test that the default parameters are used.\"\"\"\n    X, y = data\n    ih_cv = InstanceHardnessCV(estimator=LogisticRegression(), n_splits=3)\n    cv_result = cross_validate(\n        LogisticRegression(), X, y, cv=ih_cv, scoring=\"precision\"\n    )\n    assert_allclose(cv_result[\"test_score\"], [0.625, 0.6, 0.625], atol=1e-6, rtol=1e-6)\n\n\n@pytest.mark.parametrize(\"dtype_target\", [None, object])\ndef test_target_string_labels(data, dtype_target):\n    \"\"\"Test that the target can be a string array.\"\"\"\n    X, y = data\n    labels = np.array([\"a\", \"b\"], dtype=dtype_target)\n    y = labels[y]\n    ih_cv = InstanceHardnessCV(estimator=LogisticRegression(), n_splits=3)\n    cv_result = cross_validate(\n        LogisticRegression(),\n        X,\n        y,\n        cv=ih_cv,\n        scoring=make_scorer(precision_score, pos_label=\"b\"),\n    )\n    assert_allclose(cv_result[\"test_score\"], [0.625, 0.6, 0.625], atol=1e-6, rtol=1e-6)\n\n\n@pytest.mark.parametrize(\"dtype_target\", [None, object])\ndef test_target_string_pos_label(data, dtype_target):\n    \"\"\"Test that the `pos_label` parameter can be used to select the positive class.\n\n    Here, changing the `pos_label` will change the instance hardness and thus the\n    `cv_result`.\n    \"\"\"\n    X, y = data\n    labels = np.array([\"a\", \"b\"], dtype=dtype_target)\n    y = labels[y]\n    ih_cv = InstanceHardnessCV(\n        estimator=LogisticRegression(), pos_label=\"a\", n_splits=3\n    )\n    cv_result = cross_validate(\n        LogisticRegression(),\n        X,\n        y,\n        cv=ih_cv,\n        scoring=make_scorer(precision_score, pos_label=\"a\"),\n    )\n    assert_allclose(\n        cv_result[\"test_score\"], [0.666667, 0.666667, 0.4], atol=1e-6, rtol=1e-6\n    )\n\n\n@pytest.mark.parametrize(\"n_splits\", [2, 3, 4])\ndef test_n_splits(n_splits):\n    \"\"\"Test that the number of splits is correctly set.\"\"\"\n    ih_cv = InstanceHardnessCV(estimator=LogisticRegression(), n_splits=n_splits)\n    assert ih_cv.get_n_splits() == n_splits\n"
  },
  {
    "path": "imblearn/over_sampling/__init__.py",
    "content": "\"\"\"\nThe :mod:`imblearn.over_sampling` provides a set of method to\nperform over-sampling.\n\"\"\"\n\nfrom imblearn.over_sampling._adasyn import ADASYN\nfrom imblearn.over_sampling._random_over_sampler import RandomOverSampler\nfrom imblearn.over_sampling._smote import (\n    SMOTE,\n    SMOTEN,\n    SMOTENC,\n    SVMSMOTE,\n    BorderlineSMOTE,\n    KMeansSMOTE,\n)\n\n__all__ = [\n    \"ADASYN\",\n    \"RandomOverSampler\",\n    \"KMeansSMOTE\",\n    \"SMOTE\",\n    \"BorderlineSMOTE\",\n    \"SVMSMOTE\",\n    \"SMOTENC\",\n    \"SMOTEN\",\n]\n"
  },
  {
    "path": "imblearn/over_sampling/_adasyn.py",
    "content": "\"\"\"Class to perform over-sampling using ADASYN.\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport numbers\n\nimport numpy as np\nfrom scipy import sparse\nfrom sklearn.utils import _safe_indexing, check_random_state\nfrom sklearn.utils._param_validation import HasMethods, Interval\n\nfrom imblearn.over_sampling.base import BaseOverSampler\nfrom imblearn.utils import Substitution, check_neighbors_object\nfrom imblearn.utils._docstring import _random_state_docstring\n\n\n@Substitution(\n    sampling_strategy=BaseOverSampler._sampling_strategy_docstring,\n    random_state=_random_state_docstring,\n)\nclass ADASYN(BaseOverSampler):\n    \"\"\"Oversample using Adaptive Synthetic (ADASYN) algorithm.\n\n    This method is similar to SMOTE but it generates different number of\n    samples depending on an estimate of the local distribution of the class\n    to be oversampled.\n\n    Read more in the :ref:`User Guide <smote_adasyn>`.\n\n    Parameters\n    ----------\n    {sampling_strategy}\n\n    {random_state}\n\n    n_neighbors : int or estimator object, default=5\n        The nearest neighbors used to define the neighborhood of samples to use\n        to generate the synthetic samples. You can pass:\n\n        - an `int` corresponding to the number of neighbors to use. A\n          `~sklearn.neighbors.NearestNeighbors` instance will be fitted in this\n          case.\n        - an instance of a compatible nearest neighbors algorithm that should\n          implement both methods `kneighbors` and `kneighbors_graph`. For\n          instance, it could correspond to a\n          :class:`~sklearn.neighbors.NearestNeighbors` but could be extended to\n          any compatible class.\n\n    Attributes\n    ----------\n    sampling_strategy_ : dict\n        Dictionary containing the information to sample the dataset. The keys\n        corresponds to the class labels from which to sample and the values\n        are the number of samples to sample.\n\n    nn_ : estimator object\n        Validated K-nearest Neighbours estimator linked to the parameter `n_neighbors`.\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.10\n\n    See Also\n    --------\n    SMOTE : Over-sample using SMOTE.\n\n    SMOTENC : Over-sample using SMOTE for continuous and categorical features.\n\n    SMOTEN : Over-sample using the SMOTE variant specifically for categorical\n        features only.\n\n    SVMSMOTE : Over-sample using SVM-SMOTE variant.\n\n    BorderlineSMOTE : Over-sample using Borderline-SMOTE variant.\n\n    Notes\n    -----\n    The implementation is based on [1]_.\n\n    Supports multi-class resampling. A one-vs.-rest scheme is used.\n\n    References\n    ----------\n    .. [1] He, Haibo, Yang Bai, Edwardo A. Garcia, and Shutao Li. \"ADASYN:\n       Adaptive synthetic sampling approach for imbalanced learning,\" In IEEE\n       International Joint Conference on Neural Networks (IEEE World Congress\n       on Computational Intelligence), pp. 1322-1328, 2008.\n\n    Examples\n    --------\n    >>> from collections import Counter\n    >>> from sklearn.datasets import make_classification\n    >>> from imblearn.over_sampling import ADASYN\n    >>> X, y = make_classification(n_classes=2, class_sep=2,\n    ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,\n    ... n_features=20, n_clusters_per_class=1, n_samples=1000,\n    ... random_state=10)\n    >>> print('Original dataset shape %s' % Counter(y))\n    Original dataset shape Counter({{1: 900, 0: 100}})\n    >>> ada = ADASYN(random_state=42)\n    >>> X_res, y_res = ada.fit_resample(X, y)\n    >>> print('Resampled dataset shape %s' % Counter(y_res))\n    Resampled dataset shape Counter({{0: 904, 1: 900}})\n    \"\"\"\n\n    _parameter_constraints: dict = {\n        **BaseOverSampler._parameter_constraints,\n        \"n_neighbors\": [\n            Interval(numbers.Integral, 1, None, closed=\"left\"),\n            HasMethods([\"kneighbors\", \"kneighbors_graph\"]),\n        ],\n    }\n\n    def __init__(\n        self,\n        *,\n        sampling_strategy=\"auto\",\n        random_state=None,\n        n_neighbors=5,\n    ):\n        super().__init__(sampling_strategy=sampling_strategy)\n        self.random_state = random_state\n        self.n_neighbors = n_neighbors\n\n    def _validate_estimator(self):\n        \"\"\"Create the necessary objects for ADASYN\"\"\"\n        self.nn_ = check_neighbors_object(\n            \"n_neighbors\", self.n_neighbors, additional_neighbor=1\n        )\n\n    def _fit_resample(self, X, y):\n        self._validate_estimator()\n        random_state = check_random_state(self.random_state)\n\n        X_resampled = [X.copy()]\n        y_resampled = [y.copy()]\n\n        for class_sample, n_samples in self.sampling_strategy_.items():\n            if n_samples == 0:\n                continue\n            target_class_indices = np.flatnonzero(y == class_sample)\n            X_class = _safe_indexing(X, target_class_indices)\n\n            self.nn_.fit(X)\n            nns = self.nn_.kneighbors(X_class, return_distance=False)[:, 1:]\n            # The ratio is computed using a one-vs-rest manner. Using majority\n            # in multi-class would lead to slightly different results at the\n            # cost of introducing a new parameter.\n            n_neighbors = self.nn_.n_neighbors - 1\n            ratio_nn = np.sum(y[nns] != class_sample, axis=1) / n_neighbors\n            if not np.sum(ratio_nn):\n                raise RuntimeError(\n                    \"Not any neigbours belong to the majority\"\n                    \" class. This case will induce a NaN case\"\n                    \" with a division by zero. ADASYN is not\"\n                    \" suited for this specific dataset.\"\n                    \" Use SMOTE instead.\"\n                )\n            ratio_nn /= np.sum(ratio_nn)\n            n_samples_generate = np.rint(ratio_nn * n_samples).astype(int)\n            # rounding may cause new amount for n_samples\n            n_samples = np.sum(n_samples_generate)\n            if not n_samples:\n                raise ValueError(\n                    \"No samples will be generated with the provided ratio settings.\"\n                )\n\n            # the nearest neighbors need to be fitted only on the current class\n            # to find the class NN to generate new samples\n            self.nn_.fit(X_class)\n            nns = self.nn_.kneighbors(X_class, return_distance=False)[:, 1:]\n\n            enumerated_class_indices = np.arange(len(target_class_indices))\n            rows = np.repeat(enumerated_class_indices, n_samples_generate)\n            cols = random_state.choice(n_neighbors, size=n_samples)\n            diffs = X_class[nns[rows, cols]] - X_class[rows]\n            steps = random_state.uniform(size=(n_samples, 1))\n\n            if sparse.issparse(X):\n                sparse_func = type(X).__name__\n                steps = getattr(sparse, sparse_func)(steps)\n                X_new = X_class[rows] + steps.multiply(diffs)\n            else:\n                X_new = X_class[rows] + steps * diffs\n\n            X_new = X_new.astype(X.dtype)\n            y_new = np.full(n_samples, fill_value=class_sample, dtype=y.dtype)\n            X_resampled.append(X_new)\n            y_resampled.append(y_new)\n\n        if sparse.issparse(X):\n            X_resampled = sparse.vstack(X_resampled, format=X.format)\n        else:\n            X_resampled = np.vstack(X_resampled)\n        y_resampled = np.hstack(y_resampled)\n\n        return X_resampled, y_resampled\n\n    def _more_tags(self):\n        return {\n            \"X_types\": [\"2darray\"],\n        }\n\n    def __sklearn_tags__(self):\n        tags = super().__sklearn_tags__()\n        return tags\n"
  },
  {
    "path": "imblearn/over_sampling/_random_over_sampler.py",
    "content": "\"\"\"Class to perform random over-sampling.\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nfrom collections.abc import Mapping\nfrom numbers import Real\n\nimport numpy as np\nfrom scipy import sparse\nfrom sklearn.utils import _safe_indexing, check_array, check_random_state\nfrom sklearn.utils._param_validation import Interval\nfrom sklearn.utils.sparsefuncs import mean_variance_axis\nfrom sklearn_compat.utils.validation import validate_data\n\nfrom imblearn.over_sampling.base import BaseOverSampler\nfrom imblearn.utils import Substitution, check_target_type\nfrom imblearn.utils._docstring import _random_state_docstring\nfrom imblearn.utils._validation import _check_X\n\n\n@Substitution(\n    sampling_strategy=BaseOverSampler._sampling_strategy_docstring,\n    random_state=_random_state_docstring,\n)\nclass RandomOverSampler(BaseOverSampler):\n    \"\"\"Class to perform random over-sampling.\n\n    Object to over-sample the minority class(es) by picking samples at random\n    with replacement. The bootstrap can be generated in a smoothed manner.\n\n    Read more in the :ref:`User Guide <random_over_sampler>`.\n\n    Parameters\n    ----------\n    {sampling_strategy}\n\n    {random_state}\n\n    shrinkage : float or dict, default=None\n        Parameter controlling the shrinkage applied to the covariance matrix.\n        when a smoothed bootstrap is generated. The options are:\n\n        - if `None`, a normal bootstrap will be generated without perturbation.\n          It is equivalent to `shrinkage=0` as well;\n        - if a `float` is given, the shrinkage factor will be used for all\n          classes to generate the smoothed bootstrap;\n        - if a `dict` is given, the shrinkage factor will specific for each\n          class. The key correspond to the targeted class and the value is\n          the shrinkage factor.\n\n        The value needs of the shrinkage parameter needs to be higher or equal\n        to 0.\n\n        .. versionadded:: 0.8\n\n    Attributes\n    ----------\n    sampling_strategy_ : dict\n        Dictionary containing the information to sample the dataset. The keys\n        corresponds to the class labels from which to sample and the values\n        are the number of samples to sample.\n\n    sample_indices_ : ndarray of shape (n_new_samples,)\n        Indices of the samples selected.\n\n        .. versionadded:: 0.4\n\n    shrinkage_ : dict or None\n        The per-class shrinkage factor used to generate the smoothed bootstrap\n        sample. When `shrinkage=None` a normal bootstrap will be generated.\n\n        .. versionadded:: 0.8\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.10\n\n    See Also\n    --------\n    BorderlineSMOTE : Over-sample using the borderline-SMOTE variant.\n\n    SMOTE : Over-sample using SMOTE.\n\n    SMOTENC : Over-sample using SMOTE for continuous and categorical features.\n\n    SMOTEN : Over-sample using the SMOTE variant specifically for categorical\n        features only.\n\n    SVMSMOTE : Over-sample using SVM-SMOTE variant.\n\n    ADASYN : Over-sample using ADASYN.\n\n    KMeansSMOTE : Over-sample applying a clustering before to oversample using\n        SMOTE.\n\n    Notes\n    -----\n    Supports multi-class resampling by sampling each class independently.\n    Supports heterogeneous data as object array containing string and numeric\n    data.\n\n    When generating a smoothed bootstrap, this method is also known as Random\n    Over-Sampling Examples (ROSE) [1]_.\n\n    .. warning::\n       Since smoothed bootstrap are generated by adding a small perturbation\n       to the drawn samples, this method is not adequate when working with\n       sparse matrices.\n\n    References\n    ----------\n    .. [1] G Menardi, N. Torelli, \"Training and assessing classification\n       rules with imbalanced data,\" Data Mining and Knowledge\n       Discovery, 28(1), pp.92-122, 2014.\n\n    Examples\n    --------\n    >>> from collections import Counter\n    >>> from sklearn.datasets import make_classification\n    >>> from imblearn.over_sampling import RandomOverSampler\n    >>> X, y = make_classification(n_classes=2, class_sep=2,\n    ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,\n    ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)\n    >>> print('Original dataset shape %s' % Counter(y))\n    Original dataset shape Counter({{1: 900, 0: 100}})\n    >>> ros = RandomOverSampler(random_state=42)\n    >>> X_res, y_res = ros.fit_resample(X, y)\n    >>> print('Resampled dataset shape %s' % Counter(y_res))\n    Resampled dataset shape Counter({{0: 900, 1: 900}})\n    \"\"\"\n\n    _parameter_constraints: dict = {\n        **BaseOverSampler._parameter_constraints,\n        \"shrinkage\": [Interval(Real, 0, None, closed=\"left\"), dict, None],\n    }\n\n    def __init__(\n        self,\n        *,\n        sampling_strategy=\"auto\",\n        random_state=None,\n        shrinkage=None,\n    ):\n        super().__init__(sampling_strategy=sampling_strategy)\n        self.random_state = random_state\n        self.shrinkage = shrinkage\n\n    def _check_X_y(self, X, y):\n        y, binarize_y = check_target_type(y, indicate_one_vs_all=True)\n        X = _check_X(X)\n        validate_data(self, X=X, y=y, reset=True, skip_check_array=True)\n        return X, y, binarize_y\n\n    def _fit_resample(self, X, y):\n        random_state = check_random_state(self.random_state)\n\n        if isinstance(self.shrinkage, Real):\n            self.shrinkage_ = {\n                klass: self.shrinkage for klass in self.sampling_strategy_\n            }\n        elif self.shrinkage is None or isinstance(self.shrinkage, Mapping):\n            self.shrinkage_ = self.shrinkage\n\n        if self.shrinkage_ is not None:\n            missing_shrinkage_keys = (\n                self.sampling_strategy_.keys() - self.shrinkage_.keys()\n            )\n            if missing_shrinkage_keys:\n                raise ValueError(\n                    \"`shrinkage` should contain a shrinkage factor for \"\n                    \"each class that will be resampled. The missing \"\n                    f\"classes are: {repr(missing_shrinkage_keys)}\"\n                )\n\n            for klass, shrink_factor in self.shrinkage_.items():\n                if shrink_factor < 0:\n                    raise ValueError(\n                        \"The shrinkage factor needs to be >= 0. \"\n                        f\"Got {shrink_factor} for class {klass}.\"\n                    )\n\n            # smoothed bootstrap imposes to make numerical operation; we need\n            # to be sure to have only numerical data in X\n            try:\n                X = check_array(X, accept_sparse=[\"csr\", \"csc\"], dtype=\"numeric\")\n            except ValueError as exc:\n                raise ValueError(\n                    \"When shrinkage is not None, X needs to contain only \"\n                    \"numerical data to later generate a smoothed bootstrap \"\n                    \"sample.\"\n                ) from exc\n\n        X_resampled = [X.copy()]\n        y_resampled = [y.copy()]\n\n        sample_indices = range(X.shape[0])\n        for class_sample, num_samples in self.sampling_strategy_.items():\n            target_class_indices = np.flatnonzero(y == class_sample)\n            bootstrap_indices = random_state.choice(\n                target_class_indices,\n                size=num_samples,\n                replace=True,\n            )\n            sample_indices = np.append(sample_indices, bootstrap_indices)\n            if self.shrinkage_ is not None:\n                # generate a smoothed bootstrap with a perturbation\n                n_samples, n_features = X.shape\n                smoothing_constant = (4 / ((n_features + 2) * n_samples)) ** (\n                    1 / (n_features + 4)\n                )\n                if sparse.issparse(X):\n                    _, X_class_variance = mean_variance_axis(\n                        X[target_class_indices, :],\n                        axis=0,\n                    )\n                    X_class_scale = np.sqrt(X_class_variance, out=X_class_variance)\n                else:\n                    X_class_scale = np.std(X[target_class_indices, :], axis=0)\n                smoothing_matrix = np.diagflat(\n                    self.shrinkage_[class_sample] * smoothing_constant * X_class_scale\n                )\n                X_new = random_state.randn(num_samples, n_features)\n                X_new = X_new.dot(smoothing_matrix) + X[bootstrap_indices, :]\n                if sparse.issparse(X):\n                    X_new = sparse.csr_matrix(X_new, dtype=X.dtype)\n                X_resampled.append(X_new)\n            else:\n                # generate a bootstrap\n                X_resampled.append(_safe_indexing(X, bootstrap_indices))\n\n            y_resampled.append(_safe_indexing(y, bootstrap_indices))\n\n        self.sample_indices_ = np.array(sample_indices)\n\n        if sparse.issparse(X):\n            X_resampled = sparse.vstack(X_resampled, format=X.format)\n        else:\n            X_resampled = np.vstack(X_resampled)\n        y_resampled = np.hstack(y_resampled)\n\n        return X_resampled, y_resampled\n\n    def _more_tags(self):\n        return {\n            \"X_types\": [\"2darray\", \"string\", \"sparse\", \"dataframe\"],\n            \"sample_indices\": True,\n            \"allow_nan\": True,\n            \"_xfail_checks\": {\n                \"check_complex_data\": \"Robust to this type of data.\",\n            },\n        }\n\n    def __sklearn_tags__(self):\n        tags = super().__sklearn_tags__()\n        tags.input_tags.allow_nan = True\n        tags.input_tags.string = True\n        tags.sampler_tags.sample_indices = True\n        return tags\n"
  },
  {
    "path": "imblearn/over_sampling/_smote/__init__.py",
    "content": "from imblearn.over_sampling._smote.base import SMOTE, SMOTEN, SMOTENC\nfrom imblearn.over_sampling._smote.cluster import KMeansSMOTE\nfrom imblearn.over_sampling._smote.filter import SVMSMOTE, BorderlineSMOTE\n\n__all__ = [\n    \"SMOTE\",\n    \"SMOTEN\",\n    \"SMOTENC\",\n    \"KMeansSMOTE\",\n    \"BorderlineSMOTE\",\n    \"SVMSMOTE\",\n]\n"
  },
  {
    "path": "imblearn/over_sampling/_smote/base.py",
    "content": "\"\"\"Base class and original SMOTE methods for over-sampling\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Fernando Nogueira\n#          Christos Aridas\n#          Dzianis Dudnik\n# License: MIT\n\nimport math\nimport numbers\nimport warnings\n\nimport numpy as np\nfrom scipy import sparse\nfrom scipy.stats import mode\nfrom sklearn.base import clone\nfrom sklearn.exceptions import DataConversionWarning\nfrom sklearn.preprocessing import OneHotEncoder, OrdinalEncoder\nfrom sklearn.utils import (\n    _safe_indexing,\n    check_array,\n    check_random_state,\n)\nfrom sklearn.utils._param_validation import HasMethods, Interval, StrOptions\nfrom sklearn.utils.sparsefuncs_fast import (\n    csr_mean_variance_axis0,\n)\nfrom sklearn.utils.validation import _num_features\nfrom sklearn_compat.utils._dataframe import is_pandas_df\nfrom sklearn_compat.utils._indexing import _get_column_indices\nfrom sklearn_compat.utils.validation import validate_data\n\nfrom imblearn.metrics.pairwise import ValueDifferenceMetric\nfrom imblearn.over_sampling.base import BaseOverSampler\nfrom imblearn.utils import Substitution, check_neighbors_object, check_target_type\nfrom imblearn.utils._docstring import _random_state_docstring\nfrom imblearn.utils._validation import _check_X\n\n\nclass BaseSMOTE(BaseOverSampler):\n    \"\"\"Base class for the different SMOTE algorithms.\"\"\"\n\n    _parameter_constraints: dict = {\n        **BaseOverSampler._parameter_constraints,\n        \"k_neighbors\": [\n            Interval(numbers.Integral, 1, None, closed=\"left\"),\n            HasMethods([\"kneighbors\", \"kneighbors_graph\"]),\n        ],\n    }\n\n    def __init__(\n        self,\n        sampling_strategy=\"auto\",\n        random_state=None,\n        k_neighbors=5,\n    ):\n        super().__init__(sampling_strategy=sampling_strategy)\n        self.random_state = random_state\n        self.k_neighbors = k_neighbors\n\n    def _validate_estimator(self):\n        \"\"\"Check the NN estimators shared across the different SMOTE\n        algorithms.\n        \"\"\"\n        self.nn_k_ = check_neighbors_object(\n            \"k_neighbors\", self.k_neighbors, additional_neighbor=1\n        )\n\n    def _make_samples(\n        self, X, y_dtype, y_type, nn_data, nn_num, n_samples, step_size=1.0, y=None\n    ):\n        \"\"\"A support function that returns artificial samples constructed along\n        the line connecting nearest neighbours.\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix} of shape (n_samples, n_features)\n            Points from which the points will be created.\n\n        y_dtype : dtype\n            The data type of the targets.\n\n        y_type : str or int\n            The minority target value, just so the function can return the\n            target values for the synthetic variables with correct length in\n            a clear format.\n\n        nn_data : ndarray of shape (n_samples_all, n_features)\n            Data set carrying all the neighbours to be used\n\n        nn_num : ndarray of shape (n_samples_all, k_nearest_neighbours)\n            The nearest neighbours of each sample in `nn_data`.\n\n        n_samples : int\n            The number of samples to generate.\n\n        step_size : float, default=1.0\n            The step size to create samples.\n\n        y : ndarray of shape (n_samples_all,), default=None\n            The true target associated with `nn_data`. Used by Borderline SMOTE-2 to\n            weight the distances in the sample generation process.\n\n        Returns\n        -------\n        X_new : {ndarray, sparse matrix} of shape (n_samples_new, n_features)\n            Synthetically generated samples.\n\n        y_new : ndarray of shape (n_samples_new,)\n            Target values for synthetic samples.\n        \"\"\"\n        random_state = check_random_state(self.random_state)\n        samples_indices = random_state.randint(low=0, high=nn_num.size, size=n_samples)\n\n        # np.newaxis for backwards compatability with random_state\n        steps = step_size * random_state.uniform(size=n_samples)[:, np.newaxis]\n        rows = np.floor_divide(samples_indices, nn_num.shape[1])\n        cols = np.mod(samples_indices, nn_num.shape[1])\n\n        X_new = self._generate_samples(X, nn_data, nn_num, rows, cols, steps, y_type, y)\n        y_new = np.full(n_samples, fill_value=y_type, dtype=y_dtype)\n        return X_new, y_new\n\n    def _generate_samples(\n        self, X, nn_data, nn_num, rows, cols, steps, y_type=None, y=None\n    ):\n        r\"\"\"Generate a synthetic sample.\n\n        The rule for the generation is:\n\n        .. math::\n           \\mathbf{s_{s}} = \\mathbf{s_{i}} + \\mathcal{u}(0, 1) \\times\n           (\\mathbf{s_{i}} - \\mathbf{s_{nn}}) \\,\n\n        where \\mathbf{s_{s}} is the new synthetic samples, \\mathbf{s_{i}} is\n        the current sample, \\mathbf{s_{nn}} is a randomly selected neighbors of\n        \\mathbf{s_{i}} and \\mathcal{u}(0, 1) is a random number between [0, 1).\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix} of shape (n_samples, n_features)\n            Points from which the points will be created.\n\n        nn_data : ndarray of shape (n_samples_all, n_features)\n            Data set carrying all the neighbours to be used.\n\n        nn_num : ndarray of shape (n_samples_all, k_nearest_neighbours)\n            The nearest neighbours of each sample in `nn_data`.\n\n        rows : ndarray of shape (n_samples,), dtype=int\n            Indices pointing at feature vector in X which will be used\n            as a base for creating new samples.\n\n        cols : ndarray of shape (n_samples,), dtype=int\n            Indices pointing at which nearest neighbor of base feature vector\n            will be used when creating new samples.\n\n        steps : ndarray of shape (n_samples,), dtype=float\n            Step sizes for new samples.\n\n        y_type : str, int or None, default=None\n            Class label of the current target classes for which we want to generate\n            samples.\n\n        y : ndarray of shape (n_samples_all,), default=None\n            The true target associated with `nn_data`. Used by Borderline SMOTE-2 to\n            weight the distances in the sample generation process.\n\n        Returns\n        -------\n        X_new : {ndarray, sparse matrix} of shape (n_samples, n_features)\n            Synthetically generated samples.\n        \"\"\"\n        diffs = nn_data[nn_num[rows, cols]] - X[rows]\n        if y is not None:  # only entering for BorderlineSMOTE-2\n            random_state = check_random_state(self.random_state)\n            mask_pair_samples = y[nn_num[rows, cols]] != y_type\n            diffs[mask_pair_samples] *= random_state.uniform(\n                low=0.0, high=0.5, size=(mask_pair_samples.sum(), 1)\n            )\n\n        if sparse.issparse(X):\n            sparse_func = type(X).__name__\n            steps = getattr(sparse, sparse_func)(steps)\n            X_new = X[rows] + steps.multiply(diffs)\n        else:\n            X_new = X[rows] + steps * diffs\n\n        return X_new.astype(X.dtype)\n\n    def _in_danger_noise(self, nn_estimator, samples, target_class, y, kind=\"danger\"):\n        \"\"\"Estimate if a set of sample are in danger or noise.\n\n        Used by BorderlineSMOTE and SVMSMOTE.\n\n        Parameters\n        ----------\n        nn_estimator : estimator object\n            An estimator that inherits from\n            :class:`~sklearn.neighbors.base.KNeighborsMixin` use to determine\n            if a sample is in danger/noise.\n\n        samples : {array-like, sparse matrix} of shape (n_samples, n_features)\n            The samples to check if either they are in danger or not.\n\n        target_class : int or str\n            The target corresponding class being over-sampled.\n\n        y : array-like of shape (n_samples,)\n            The true label in order to check the neighbour labels.\n\n        kind : {'danger', 'noise'}, default='danger'\n            The type of classification to use. Can be either:\n\n            - If 'danger', check if samples are in danger,\n            - If 'noise', check if samples are noise.\n\n        Returns\n        -------\n        output : ndarray of shape (n_samples,)\n            A boolean array where True refer to samples in danger or noise.\n        \"\"\"\n        x = nn_estimator.kneighbors(samples, return_distance=False)[:, 1:]\n        nn_label = (y[x] != target_class).astype(int)\n        n_maj = np.sum(nn_label, axis=1)\n\n        if kind == \"danger\":\n            # Samples are in danger for m/2 <= m' < m\n            return np.bitwise_and(\n                n_maj >= (nn_estimator.n_neighbors - 1) / 2,\n                n_maj < nn_estimator.n_neighbors - 1,\n            )\n        else:  # kind == \"noise\":\n            # Samples are noise for m = m'\n            return n_maj == nn_estimator.n_neighbors - 1\n\n\n@Substitution(\n    sampling_strategy=BaseOverSampler._sampling_strategy_docstring,\n    random_state=_random_state_docstring,\n)\nclass SMOTE(BaseSMOTE):\n    \"\"\"Class to perform over-sampling using SMOTE.\n\n    This object is an implementation of SMOTE - Synthetic Minority\n    Over-sampling Technique as presented in [1]_.\n\n    Read more in the :ref:`User Guide <smote_adasyn>`.\n\n    Parameters\n    ----------\n    {sampling_strategy}\n\n    {random_state}\n\n    k_neighbors : int or object, default=5\n        The nearest neighbors used to define the neighborhood of samples to use\n        to generate the synthetic samples. You can pass:\n\n        - an `int` corresponding to the number of neighbors to use. A\n          `~sklearn.neighbors.NearestNeighbors` instance will be fitted in this\n          case.\n        - an instance of a compatible nearest neighbors algorithm that should\n          implement both methods `kneighbors` and `kneighbors_graph`. For\n          instance, it could correspond to a\n          :class:`~sklearn.neighbors.NearestNeighbors` but could be extended to\n          any compatible class.\n\n    Attributes\n    ----------\n    sampling_strategy_ : dict\n        Dictionary containing the information to sample the dataset. The keys\n        corresponds to the class labels from which to sample and the values\n        are the number of samples to sample.\n\n    nn_k_ : estimator object\n        Validated k-nearest neighbours created from the `k_neighbors` parameter.\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.10\n\n    See Also\n    --------\n    SMOTENC : Over-sample using SMOTE for continuous and categorical features.\n\n    SMOTEN : Over-sample using the SMOTE variant specifically for categorical\n        features only.\n\n    BorderlineSMOTE : Over-sample using the borderline-SMOTE variant.\n\n    SVMSMOTE : Over-sample using the SVM-SMOTE variant.\n\n    ADASYN : Over-sample using ADASYN.\n\n    KMeansSMOTE : Over-sample applying a clustering before to oversample using\n        SMOTE.\n\n    Notes\n    -----\n    See the original papers: [1]_ for more details.\n\n    Supports multi-class resampling. A one-vs.-rest scheme is used as\n    originally proposed in [1]_.\n\n    References\n    ----------\n    .. [1] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, \"SMOTE:\n       synthetic minority over-sampling technique,\" Journal of artificial\n       intelligence research, 321-357, 2002.\n\n    Examples\n    --------\n    >>> from collections import Counter\n    >>> from sklearn.datasets import make_classification\n    >>> from imblearn.over_sampling import SMOTE\n    >>> X, y = make_classification(n_classes=2, class_sep=2,\n    ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,\n    ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)\n    >>> print('Original dataset shape %s' % Counter(y))\n    Original dataset shape Counter({{1: 900, 0: 100}})\n    >>> sm = SMOTE(random_state=42)\n    >>> X_res, y_res = sm.fit_resample(X, y)\n    >>> print('Resampled dataset shape %s' % Counter(y_res))\n    Resampled dataset shape Counter({{0: 900, 1: 900}})\n    \"\"\"\n\n    def __init__(\n        self,\n        *,\n        sampling_strategy=\"auto\",\n        random_state=None,\n        k_neighbors=5,\n    ):\n        super().__init__(\n            sampling_strategy=sampling_strategy,\n            random_state=random_state,\n            k_neighbors=k_neighbors,\n        )\n\n    def _fit_resample(self, X, y):\n        self._validate_estimator()\n\n        X_resampled = [X.copy()]\n        y_resampled = [y.copy()]\n\n        for class_sample, n_samples in self.sampling_strategy_.items():\n            if n_samples == 0:\n                continue\n            target_class_indices = np.flatnonzero(y == class_sample)\n            X_class = _safe_indexing(X, target_class_indices)\n\n            self.nn_k_.fit(X_class)\n            nns = self.nn_k_.kneighbors(X_class, return_distance=False)[:, 1:]\n            X_new, y_new = self._make_samples(\n                X_class, y.dtype, class_sample, X_class, nns, n_samples, 1.0\n            )\n            X_resampled.append(X_new)\n            y_resampled.append(y_new)\n\n        if sparse.issparse(X):\n            X_resampled = sparse.vstack(X_resampled, format=X.format)\n        else:\n            X_resampled = np.vstack(X_resampled)\n        y_resampled = np.hstack(y_resampled)\n\n        return X_resampled, y_resampled\n\n\n@Substitution(\n    sampling_strategy=BaseOverSampler._sampling_strategy_docstring,\n    random_state=_random_state_docstring,\n)\nclass SMOTENC(SMOTE):\n    \"\"\"Synthetic Minority Over-sampling Technique for Nominal and Continuous.\n\n    Unlike :class:`SMOTE`, SMOTE-NC for dataset containing numerical and\n    categorical features. However, it is not designed to work with only\n    categorical features.\n\n    Read more in the :ref:`User Guide <smote_adasyn>`.\n\n    .. versionadded:: 0.4\n\n    Parameters\n    ----------\n    categorical_features : \"infer\" or array-like of shape (n_cat_features,) or \\\n            (n_features,), dtype={{bool, int, str}}\n        Specified which features are categorical. Can either be:\n\n        - \"auto\" (default) to automatically detect categorical features. Only\n          supported when `X` is a :class:`pandas.DataFrame` and it corresponds\n          to columns that have a :class:`pandas.CategoricalDtype`;\n        - array of `int` corresponding to the indices specifying the categorical\n          features;\n        - array of `str` corresponding to the feature names. `X` should be a pandas\n          :class:`pandas.DataFrame` in this case.\n        - mask array of shape (n_features, ) and ``bool`` dtype for which\n          ``True`` indicates the categorical features.\n\n    categorical_encoder : estimator, default=None\n        One-hot encoder used to encode the categorical features. If `None`, a\n        :class:`~sklearn.preprocessing.OneHotEncoder` is used with default parameters\n        apart from `handle_unknown` which is set to 'ignore'.\n\n    {sampling_strategy}\n\n    {random_state}\n\n    k_neighbors : int or object, default=5\n        The nearest neighbors used to define the neighborhood of samples to use\n        to generate the synthetic samples. You can pass:\n\n        - an `int` corresponding to the number of neighbors to use. A\n          `~sklearn.neighbors.NearestNeighbors` instance will be fitted in this\n          case.\n        - an instance of a compatible nearest neighbors algorithm that should\n          implement both methods `kneighbors` and `kneighbors_graph`. For\n          instance, it could correspond to a\n          :class:`~sklearn.neighbors.NearestNeighbors` but could be extended to\n          any compatible class.\n\n    Attributes\n    ----------\n    sampling_strategy_ : dict\n        Dictionary containing the information to sample the dataset. The keys\n        corresponds to the class labels from which to sample and the values\n        are the number of samples to sample.\n\n    nn_k_ : estimator object\n        Validated k-nearest neighbours created from the `k_neighbors` parameter.\n\n    categorical_encoder_ : estimator\n        The encoder used to encode the categorical features.\n\n    categorical_features_ : ndarray of shape (n_cat_features,), dtype=np.int64\n        Indices of the categorical features.\n\n    continuous_features_ : ndarray of shape (n_cont_features,), dtype=np.int64\n        Indices of the continuous features.\n\n    median_std_ : dict of int -> float\n        Median of the standard deviation of the continuous features for each\n        class to be over-sampled.\n\n    n_features_ : int\n        Number of features observed at `fit`.\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.10\n\n    See Also\n    --------\n    SMOTE : Over-sample using SMOTE.\n\n    SMOTEN : Over-sample using the SMOTE variant specifically for categorical\n        features only.\n\n    SVMSMOTE : Over-sample using SVM-SMOTE variant.\n\n    BorderlineSMOTE : Over-sample using Borderline-SMOTE variant.\n\n    ADASYN : Over-sample using ADASYN.\n\n    KMeansSMOTE : Over-sample applying a clustering before to oversample using\n        SMOTE.\n\n    Notes\n    -----\n    See the original paper [1]_ for more details.\n\n    Supports multi-class resampling. A one-vs.-rest scheme is used as\n    originally proposed in [1]_.\n\n    See\n    :ref:`sphx_glr_auto_examples_over-sampling_plot_comparison_over_sampling.py`,\n    and\n    :ref:`sphx_glr_auto_examples_over-sampling_plot_illustration_generation_sample.py`.\n\n    References\n    ----------\n    .. [1] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, \"SMOTE:\n       synthetic minority over-sampling technique,\" Journal of artificial\n       intelligence research, 321-357, 2002.\n\n    Examples\n    --------\n    >>> from collections import Counter\n    >>> from numpy.random import RandomState\n    >>> from sklearn.datasets import make_classification\n    >>> from imblearn.over_sampling import SMOTENC\n    >>> X, y = make_classification(n_classes=2, class_sep=2,\n    ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,\n    ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)\n    >>> print(f'Original dataset shape {{X.shape}}')\n    Original dataset shape (1000, 20)\n    >>> print(f'Original dataset samples per class {{Counter(y)}}')\n    Original dataset samples per class Counter({{1: 900, 0: 100}})\n    >>> # simulate the 2 last columns to be categorical features\n    >>> X[:, -2:] = RandomState(10).randint(0, 4, size=(1000, 2))\n    >>> sm = SMOTENC(random_state=42, categorical_features=[18, 19])\n    >>> X_res, y_res = sm.fit_resample(X, y)\n    >>> print(f'Resampled dataset samples per class {{Counter(y_res)}}')\n    Resampled dataset samples per class Counter({{0: 900, 1: 900}})\n    \"\"\"\n\n    _required_parameters = [\"categorical_features\"]\n\n    _parameter_constraints: dict = {\n        **SMOTE._parameter_constraints,\n        \"categorical_features\": [\"array-like\", StrOptions({\"auto\"})],\n        \"categorical_encoder\": [\n            HasMethods([\"fit_transform\", \"inverse_transform\"]),\n            None,\n        ],\n    }\n\n    def __init__(\n        self,\n        categorical_features,\n        *,\n        categorical_encoder=None,\n        sampling_strategy=\"auto\",\n        random_state=None,\n        k_neighbors=5,\n    ):\n        super().__init__(\n            sampling_strategy=sampling_strategy,\n            random_state=random_state,\n            k_neighbors=k_neighbors,\n        )\n        self.categorical_features = categorical_features\n        self.categorical_encoder = categorical_encoder\n\n    def _check_X_y(self, X, y):\n        \"\"\"Overwrite the checking to let pass some string for categorical\n        features.\n        \"\"\"\n        y, binarize_y = check_target_type(y, indicate_one_vs_all=True)\n        X = _check_X(X)\n        validate_data(self, X=X, y=y, reset=True, skip_check_array=True)\n        return X, y, binarize_y\n\n    def _validate_column_types(self, X):\n        \"\"\"Compute the indices of the categorical and continuous features.\"\"\"\n        if self.categorical_features == \"auto\":\n            if not is_pandas_df(X):\n                raise ValueError(\n                    \"When `categorical_features='auto'`, the input data \"\n                    f\"should be a pandas.DataFrame. Got {type(X)} instead.\"\n                )\n            import pandas as pd  # safely import pandas now\n\n            are_columns_categorical = np.array(\n                [isinstance(col_dtype, pd.CategoricalDtype) for col_dtype in X.dtypes]\n            )\n            self.categorical_features_ = np.flatnonzero(are_columns_categorical)\n            self.continuous_features_ = np.flatnonzero(~are_columns_categorical)\n        else:\n            self.categorical_features_ = np.array(\n                _get_column_indices(X, self.categorical_features)\n            )\n            self.continuous_features_ = np.setdiff1d(\n                np.arange(self.n_features_), self.categorical_features_\n            )\n\n    def _validate_estimator(self):\n        super()._validate_estimator()\n        if self.categorical_features_.size == self.n_features_in_:\n            raise ValueError(\n                \"SMOTE-NC is not designed to work only with categorical \"\n                \"features. It requires some numerical features.\"\n            )\n        elif self.categorical_features_.size == 0:\n            raise ValueError(\n                \"SMOTE-NC is not designed to work only with numerical \"\n                \"features. It requires some categorical features.\"\n            )\n\n    def _fit_resample(self, X, y):\n        self.n_features_ = _num_features(X)\n        self._validate_column_types(X)\n        self._validate_estimator()\n\n        X_continuous = _safe_indexing(X, self.continuous_features_, axis=1)\n        X_continuous = check_array(X_continuous, accept_sparse=[\"csr\", \"csc\"])\n        X_categorical = _safe_indexing(X, self.categorical_features_, axis=1)\n        if X_continuous.dtype.name != \"object\":\n            dtype_ohe = X_continuous.dtype\n        else:\n            dtype_ohe = np.float64\n\n        if self.categorical_encoder is None:\n            self.categorical_encoder_ = OneHotEncoder(\n                handle_unknown=\"ignore\", dtype=dtype_ohe\n            )\n        else:\n            self.categorical_encoder_ = clone(self.categorical_encoder)\n\n        # the input of the OneHotEncoder needs to be dense\n        X_ohe = self.categorical_encoder_.fit_transform(\n            X_categorical.toarray() if sparse.issparse(X_categorical) else X_categorical\n        )\n        if not sparse.issparse(X_ohe):\n            X_ohe = sparse.csr_matrix(X_ohe, dtype=dtype_ohe)\n\n        X_encoded = sparse.hstack((X_continuous, X_ohe), format=\"csr\", dtype=dtype_ohe)\n        X_resampled = [X_encoded.copy()]\n        y_resampled = [y.copy()]\n\n        # SMOTE resampling starts here\n        self.median_std_ = {}\n        for class_sample, n_samples in self.sampling_strategy_.items():\n            if n_samples == 0:\n                continue\n            target_class_indices = np.flatnonzero(y == class_sample)\n            X_class = _safe_indexing(X_encoded, target_class_indices)\n\n            _, var = csr_mean_variance_axis0(\n                X_class[:, : self.continuous_features_.size]\n            )\n            self.median_std_[class_sample] = np.median(np.sqrt(var))\n\n            # In the edge case where the median of the std is equal to 0, the 1s\n            # entries will be also nullified. In this case, we store the original\n            # categorical encoding which will be later used for inverting the OHE\n            if math.isclose(self.median_std_[class_sample], 0):\n                # This variable will be used when generating data\n                self._X_categorical_minority_encoded = X_class[\n                    :, self.continuous_features_.size :\n                ].toarray()\n\n            # we can replace the 1 entries of the categorical features with the\n            # median of the standard deviation. It will ensure that whenever\n            # distance is computed between 2 samples, the difference will be equal\n            # to the median of the standard deviation as in the original paper.\n            X_class_categorical = X_class[:, self.continuous_features_.size :]\n            # With one-hot encoding, the median will be repeated twice. We need\n            # to divide by sqrt(2) such that we only have one median value\n            # contributing to the Euclidean distance\n            X_class_categorical.data[:] = self.median_std_[class_sample] / np.sqrt(2)\n            X_class[:, self.continuous_features_.size :] = X_class_categorical\n\n            self.nn_k_.fit(X_class)\n            nns = self.nn_k_.kneighbors(X_class, return_distance=False)[:, 1:]\n            X_new, y_new = self._make_samples(\n                X_class, y.dtype, class_sample, X_class, nns, n_samples, 1.0\n            )\n            X_resampled.append(X_new)\n            y_resampled.append(y_new)\n\n        X_resampled = sparse.vstack(X_resampled, format=X_encoded.format)\n        y_resampled = np.hstack(y_resampled)\n        # SMOTE resampling ends here\n\n        # reverse the encoding of the categorical features\n        X_res_cat = X_resampled[:, self.continuous_features_.size :]\n        X_res_cat.data = np.ones_like(X_res_cat.data)\n        X_res_cat_dec = self.categorical_encoder_.inverse_transform(X_res_cat)\n\n        if sparse.issparse(X):\n            X_resampled = sparse.hstack(\n                (\n                    X_resampled[:, : self.continuous_features_.size],\n                    X_res_cat_dec,\n                ),\n                format=\"csr\",\n            )\n        else:\n            X_resampled = np.hstack(\n                (\n                    X_resampled[:, : self.continuous_features_.size].toarray(),\n                    X_res_cat_dec,\n                )\n            )\n\n        indices_reordered = np.argsort(\n            np.hstack((self.continuous_features_, self.categorical_features_))\n        )\n        if sparse.issparse(X_resampled):\n            # the matrix is supposed to be in the CSR format after the stacking\n            col_indices = X_resampled.indices.copy()\n            for idx, col_idx in enumerate(indices_reordered):\n                mask = X_resampled.indices == col_idx\n                col_indices[mask] = idx\n            X_resampled.indices = col_indices\n        else:\n            X_resampled = X_resampled[:, indices_reordered]\n\n        return X_resampled, y_resampled\n\n    def _generate_samples(self, X, nn_data, nn_num, rows, cols, steps, y_type, y=None):\n        \"\"\"Generate a synthetic sample with an additional steps for the\n        categorical features.\n\n        Each new sample is generated the same way than in SMOTE. However, the\n        categorical features are mapped to the most frequent nearest neighbors\n        of the majority class.\n        \"\"\"\n        rng = check_random_state(self.random_state)\n        X_new = super()._generate_samples(X, nn_data, nn_num, rows, cols, steps)\n        # change in sparsity structure more efficient with LIL than CSR\n        X_new = X_new.tolil() if sparse.issparse(X_new) else X_new\n\n        # convert to dense array since scipy.sparse doesn't handle 3D\n        nn_data = nn_data.toarray() if sparse.issparse(nn_data) else nn_data\n\n        # In the case that the median std was equal to zeros, we have to\n        # create non-null entry based on the encoded of OHE\n        if math.isclose(self.median_std_[y_type], 0):\n            nn_data[:, self.continuous_features_.size :] = (\n                self._X_categorical_minority_encoded\n            )\n\n        all_neighbors = nn_data[nn_num[rows]]\n\n        categories_size = [self.continuous_features_.size] + [\n            cat.size for cat in self.categorical_encoder_.categories_\n        ]\n\n        for start_idx, end_idx in zip(\n            np.cumsum(categories_size)[:-1], np.cumsum(categories_size)[1:]\n        ):\n            col_maxs = all_neighbors[:, :, start_idx:end_idx].sum(axis=1)\n            # tie breaking argmax\n            is_max = np.isclose(col_maxs, col_maxs.max(axis=1, keepdims=True))\n            max_idxs = rng.permutation(np.argwhere(is_max))\n            xs, idx_sels = np.unique(max_idxs[:, 0], return_index=True)\n            col_sels = max_idxs[idx_sels, 1]\n\n            ys = start_idx + col_sels\n            X_new[:, start_idx:end_idx] = 0\n            X_new[xs, ys] = 1\n\n        return X_new\n\n    def _more_tags(self):\n        return {\"X_types\": [\"2darray\", \"dataframe\", \"string\"]}\n\n    def __sklearn_tags__(self):\n        tags = super().__sklearn_tags__()\n        tags.input_tags.sparse = False\n        tags.input_tags.string = True\n        return tags\n\n\n@Substitution(\n    sampling_strategy=BaseOverSampler._sampling_strategy_docstring,\n    random_state=_random_state_docstring,\n)\nclass SMOTEN(SMOTE):\n    \"\"\"Synthetic Minority Over-sampling Technique for Nominal.\n\n    This method is referred as SMOTEN in [1]_. It expects that the data to\n    resample are only made of categorical features.\n\n    Read more in the :ref:`User Guide <smote_adasyn>`.\n\n    .. versionadded:: 0.8\n\n    Parameters\n    ----------\n    categorical_encoder : estimator, default=None\n        Ordinal encoder used to encode the categorical features. If `None`, a\n        :class:`~sklearn.preprocessing.OrdinalEncoder` is used with default parameters.\n\n    {sampling_strategy}\n\n    {random_state}\n\n    k_neighbors : int or object, default=5\n        The nearest neighbors used to define the neighborhood of samples to use\n        to generate the synthetic samples. You can pass:\n\n        - an `int` corresponding to the number of neighbors to use. A\n          `~sklearn.neighbors.NearestNeighbors` instance will be fitted in this\n          case.\n        - an instance of a compatible nearest neighbors algorithm that should\n          implement both methods `kneighbors` and `kneighbors_graph`. For\n          instance, it could correspond to a\n          :class:`~sklearn.neighbors.NearestNeighbors` but could be extended to\n          any compatible class.\n\n    Attributes\n    ----------\n    categorical_encoder_ : estimator\n        The encoder used to encode the categorical features.\n\n    sampling_strategy_ : dict\n        Dictionary containing the information to sample the dataset. The keys\n        corresponds to the class labels from which to sample and the values\n        are the number of samples to sample.\n\n    nn_k_ : estimator object\n        Validated k-nearest neighbours created from the `k_neighbors` parameter.\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.10\n\n    See Also\n    --------\n    SMOTE : Over-sample using SMOTE.\n\n    SMOTENC : Over-sample using SMOTE for continuous and categorical features.\n\n    BorderlineSMOTE : Over-sample using the borderline-SMOTE variant.\n\n    SVMSMOTE : Over-sample using the SVM-SMOTE variant.\n\n    ADASYN : Over-sample using ADASYN.\n\n    KMeansSMOTE : Over-sample applying a clustering before to oversample using\n        SMOTE.\n\n    Notes\n    -----\n    See the original papers: [1]_ for more details.\n\n    Supports multi-class resampling. A one-vs.-rest scheme is used as\n    originally proposed in [1]_.\n\n    References\n    ----------\n    .. [1] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, \"SMOTE:\n       synthetic minority over-sampling technique,\" Journal of artificial\n       intelligence research, 321-357, 2002.\n\n    Examples\n    --------\n    >>> import numpy as np\n    >>> X = np.array([\"A\"] * 10 + [\"B\"] * 20 + [\"C\"] * 30, dtype=object).reshape(-1, 1)\n    >>> y = np.array([0] * 20 + [1] * 40, dtype=np.int32)\n    >>> from collections import Counter\n    >>> print(f\"Original class counts: {{Counter(y)}}\")\n    Original class counts: Counter({{1: 40, 0: 20}})\n    >>> from imblearn.over_sampling import SMOTEN\n    >>> sampler = SMOTEN(random_state=0)\n    >>> X_res, y_res = sampler.fit_resample(X, y)\n    >>> print(f\"Class counts after resampling {{Counter(y_res)}}\")\n    Class counts after resampling Counter({{0: 40, 1: 40}})\n    \"\"\"\n\n    _parameter_constraints: dict = {\n        **SMOTE._parameter_constraints,\n        \"categorical_encoder\": [\n            HasMethods([\"fit_transform\", \"inverse_transform\"]),\n            None,\n        ],\n    }\n\n    def __init__(\n        self,\n        categorical_encoder=None,\n        *,\n        sampling_strategy=\"auto\",\n        random_state=None,\n        k_neighbors=5,\n    ):\n        super().__init__(\n            sampling_strategy=sampling_strategy,\n            random_state=random_state,\n            k_neighbors=k_neighbors,\n        )\n        self.categorical_encoder = categorical_encoder\n\n    def _check_X_y(self, X, y):\n        \"\"\"Check should accept strings and not sparse matrices.\"\"\"\n        y, binarize_y = check_target_type(y, indicate_one_vs_all=True)\n        X, y = validate_data(\n            self,\n            X=X,\n            y=y,\n            reset=True,\n            dtype=None,\n            accept_sparse=[\"csr\", \"csc\"],\n        )\n        return X, y, binarize_y\n\n    def _validate_estimator(self):\n        \"\"\"Force to use precomputed distance matrix.\"\"\"\n        super()._validate_estimator()\n        self.nn_k_.set_params(metric=\"precomputed\")\n\n    def _make_samples(self, X_class, klass, y_dtype, nn_indices, n_samples):\n        random_state = check_random_state(self.random_state)\n        # generate sample indices that will be used to generate new samples\n        samples_indices = random_state.choice(\n            np.arange(X_class.shape[0]), size=n_samples, replace=True\n        )\n        # for each drawn samples, select its k-neighbors and generate a sample\n        # where for each feature individually, each category generated is the\n        # most common category\n        X_new = np.squeeze(\n            mode(X_class[nn_indices[samples_indices]], axis=1, keepdims=True).mode,\n            axis=1,\n        )\n        y_new = np.full(n_samples, fill_value=klass, dtype=y_dtype)\n        return X_new, y_new\n\n    def _fit_resample(self, X, y):\n        if sparse.issparse(X):\n            X_sparse_format = X.format\n            X = X.toarray()\n            warnings.warn(\n                (\n                    \"Passing a sparse matrix to SMOTEN is not really efficient since it\"\n                    \" is converted to a dense array internally.\"\n                ),\n                DataConversionWarning,\n            )\n        else:\n            X_sparse_format = None\n\n        self._validate_estimator()\n\n        X_resampled = [X.copy()]\n        y_resampled = [y.copy()]\n\n        if self.categorical_encoder is None:\n            self.categorical_encoder_ = OrdinalEncoder(dtype=np.int32)\n        else:\n            self.categorical_encoder_ = clone(self.categorical_encoder)\n        X_encoded = self.categorical_encoder_.fit_transform(X)\n\n        vdm = ValueDifferenceMetric(\n            n_categories=[len(cat) for cat in self.categorical_encoder_.categories_]\n        ).fit(X_encoded, y)\n\n        for class_sample, n_samples in self.sampling_strategy_.items():\n            if n_samples == 0:\n                continue\n            target_class_indices = np.flatnonzero(y == class_sample)\n            X_class = _safe_indexing(X_encoded, target_class_indices)\n\n            X_class_dist = vdm.pairwise(X_class)\n            self.nn_k_.fit(X_class_dist)\n            # the kneigbors search will include the sample itself which is\n            # expected from the original algorithm\n            nn_indices = self.nn_k_.kneighbors(X_class_dist, return_distance=False)\n            X_new, y_new = self._make_samples(\n                X_class, class_sample, y.dtype, nn_indices, n_samples\n            )\n\n            X_new = self.categorical_encoder_.inverse_transform(X_new)\n            X_resampled.append(X_new)\n            y_resampled.append(y_new)\n\n        X_resampled = np.vstack(X_resampled)\n        y_resampled = np.hstack(y_resampled)\n\n        if X_sparse_format == \"csr\":\n            return sparse.csr_matrix(X_resampled), y_resampled\n        elif X_sparse_format == \"csc\":\n            return sparse.csc_matrix(X_resampled), y_resampled\n        else:\n            return X_resampled, y_resampled\n\n    def _more_tags(self):\n        return {\"X_types\": [\"2darray\", \"dataframe\", \"string\"]}\n\n    def __sklearn_tags__(self):\n        tags = super().__sklearn_tags__()\n        tags.input_tags.string = True\n        return tags\n"
  },
  {
    "path": "imblearn/over_sampling/_smote/cluster.py",
    "content": "\"\"\"SMOTE variant employing some clustering before the generation.\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Fernando Nogueira\n#          Christos Aridas\n# License: MIT\n\nimport math\nimport numbers\n\nimport numpy as np\nfrom scipy import sparse\nfrom sklearn.base import clone\nfrom sklearn.cluster import MiniBatchKMeans\nfrom sklearn.metrics import pairwise_distances\nfrom sklearn.utils import _safe_indexing\nfrom sklearn.utils._param_validation import HasMethods, Interval, StrOptions\n\nfrom imblearn.over_sampling._smote.base import BaseSMOTE\nfrom imblearn.over_sampling.base import BaseOverSampler\nfrom imblearn.utils import Substitution\nfrom imblearn.utils._docstring import _n_jobs_docstring, _random_state_docstring\n\n\n@Substitution(\n    sampling_strategy=BaseOverSampler._sampling_strategy_docstring,\n    n_jobs=_n_jobs_docstring,\n    random_state=_random_state_docstring,\n)\nclass KMeansSMOTE(BaseSMOTE):\n    \"\"\"Apply a KMeans clustering before to over-sample using SMOTE.\n\n    This is an implementation of the algorithm described in [1]_.\n\n    Read more in the :ref:`User Guide <smote_adasyn>`.\n\n    .. versionadded:: 0.5\n\n    Parameters\n    ----------\n    {sampling_strategy}\n\n    {random_state}\n\n    k_neighbors : int or object, default=2\n        The nearest neighbors used to define the neighborhood of samples to use\n        to generate the synthetic samples. You can pass:\n\n        - an `int` corresponding to the number of neighbors to use. A\n          `~sklearn.neighbors.NearestNeighbors` instance will be fitted in this\n          case.\n        - an instance of a compatible nearest neighbors algorithm that should\n          implement both methods `kneighbors` and `kneighbors_graph`. For\n          instance, it could correspond to a\n          :class:`~sklearn.neighbors.NearestNeighbors` but could be extended to\n          any compatible class.\n\n    {n_jobs}\n\n    kmeans_estimator : int or object, default=None\n        A KMeans instance or the number of clusters to be used. By default,\n        we used a :class:`~sklearn.cluster.MiniBatchKMeans` which tend to be\n        better with large number of samples.\n\n    cluster_balance_threshold : \"auto\" or float, default=\"auto\"\n        The threshold at which a cluster is called balanced and where samples\n        of the class selected for SMOTE will be oversampled. If \"auto\", this\n        will be determined by the ratio for each class, or it can be set\n        manually.\n\n    density_exponent : \"auto\" or float, default=\"auto\"\n        This exponent is used to determine the density of a cluster. Leaving\n        this to \"auto\" will use a feature-length based exponent.\n\n    Attributes\n    ----------\n    sampling_strategy_ : dict\n        Dictionary containing the information to sample the dataset. The keys\n        corresponds to the class labels from which to sample and the values\n        are the number of samples to sample.\n\n    kmeans_estimator_ : estimator\n        The fitted clustering method used before to apply SMOTE.\n\n    nn_k_ : estimator\n        The fitted k-NN estimator used in SMOTE.\n\n    cluster_balance_threshold_ : float\n        The threshold used during ``fit`` for calling a cluster balanced.\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.10\n\n    See Also\n    --------\n    SMOTE : Over-sample using SMOTE.\n\n    SMOTENC : Over-sample using SMOTE for continuous and categorical features.\n\n    SMOTEN : Over-sample using the SMOTE variant specifically for categorical\n        features only.\n\n    SVMSMOTE : Over-sample using SVM-SMOTE variant.\n\n    BorderlineSMOTE : Over-sample using Borderline-SMOTE variant.\n\n    ADASYN : Over-sample using ADASYN.\n\n    References\n    ----------\n    .. [1] Felix Last, Georgios Douzas, Fernando Bacao, \"Oversampling for\n       Imbalanced Learning Based on K-Means and SMOTE\"\n       https://arxiv.org/abs/1711.00837\n\n    Examples\n    --------\n    >>> import numpy as np\n    >>> from imblearn.over_sampling import KMeansSMOTE\n    >>> from sklearn.datasets import make_blobs\n    >>> blobs = [100, 800, 100]\n    >>> X, y  = make_blobs(blobs, centers=[(-10, 0), (0,0), (10, 0)], random_state=0)\n    >>> # Add a single 0 sample in the middle blob\n    >>> X = np.concatenate([X, [[0, 0]]])\n    >>> y = np.append(y, 0)\n    >>> # Make this a binary classification problem\n    >>> y = y == 1\n    >>> sm = KMeansSMOTE(\n    ...     kmeans_estimator=MiniBatchKMeans(n_init=1, random_state=0), random_state=42\n    ... )\n    >>> X_res, y_res = sm.fit_resample(X, y)\n    >>> # Find the number of new samples in the middle blob\n    >>> n_res_in_middle = ((X_res[:, 0] > -5) & (X_res[:, 0] < 5)).sum()\n    >>> print(\"Samples in the middle blob: %s\" % n_res_in_middle)\n    Samples in the middle blob: 801\n    >>> print(\"Middle blob unchanged: %s\" % (n_res_in_middle == blobs[1] + 1))\n    Middle blob unchanged: True\n    >>> print(\"More 0 samples: %s\" % ((y_res == 0).sum() > (y == 0).sum()))\n    More 0 samples: True\n    \"\"\"\n\n    _parameter_constraints: dict = {\n        **BaseSMOTE._parameter_constraints,\n        \"kmeans_estimator\": [\n            HasMethods([\"fit\", \"predict\"]),\n            Interval(numbers.Integral, 1, None, closed=\"left\"),\n            None,\n        ],\n        \"cluster_balance_threshold\": [StrOptions({\"auto\"}), numbers.Real],\n        \"density_exponent\": [StrOptions({\"auto\"}), numbers.Real],\n        \"n_jobs\": [numbers.Integral, None],\n    }\n\n    def __init__(\n        self,\n        *,\n        sampling_strategy=\"auto\",\n        random_state=None,\n        k_neighbors=2,\n        n_jobs=None,\n        kmeans_estimator=None,\n        cluster_balance_threshold=\"auto\",\n        density_exponent=\"auto\",\n    ):\n        super().__init__(\n            sampling_strategy=sampling_strategy,\n            random_state=random_state,\n            k_neighbors=k_neighbors,\n        )\n        self.kmeans_estimator = kmeans_estimator\n        self.cluster_balance_threshold = cluster_balance_threshold\n        self.density_exponent = density_exponent\n        self.n_jobs = n_jobs\n\n    def _validate_estimator(self):\n        super()._validate_estimator()\n        if self.kmeans_estimator is None:\n            self.kmeans_estimator_ = MiniBatchKMeans(random_state=self.random_state)\n        elif isinstance(self.kmeans_estimator, int):\n            self.kmeans_estimator_ = MiniBatchKMeans(\n                n_clusters=self.kmeans_estimator,\n                random_state=self.random_state,\n            )\n        else:\n            self.kmeans_estimator_ = clone(self.kmeans_estimator)\n\n        self.cluster_balance_threshold_ = (\n            self.cluster_balance_threshold\n            if self.kmeans_estimator_.n_clusters != 1\n            else -np.inf\n        )\n\n    def _find_cluster_sparsity(self, X):\n        \"\"\"Compute the cluster sparsity.\"\"\"\n        euclidean_distances = pairwise_distances(\n            X, metric=\"euclidean\", n_jobs=self.n_jobs\n        )\n        # negate diagonal elements\n        for ind in range(X.shape[0]):\n            euclidean_distances[ind, ind] = 0\n\n        non_diag_elements = (X.shape[0] ** 2) - X.shape[0]\n        mean_distance = euclidean_distances.sum() / non_diag_elements\n        exponent = (\n            math.log(X.shape[0], 1.6) ** 1.8 * 0.16\n            if self.density_exponent == \"auto\"\n            else self.density_exponent\n        )\n        return (mean_distance**exponent) / X.shape[0]\n\n    def _fit_resample(self, X, y):\n        self._validate_estimator()\n        X_resampled = X.copy()\n        y_resampled = y.copy()\n        total_inp_samples = sum(self.sampling_strategy_.values())\n\n        for class_sample, n_samples in self.sampling_strategy_.items():\n            if n_samples == 0:\n                continue\n\n            X_clusters = self.kmeans_estimator_.fit_predict(X)\n            valid_clusters = []\n            cluster_sparsities = []\n\n            # identify cluster which are answering the requirements\n            for cluster_idx in range(self.kmeans_estimator_.n_clusters):\n                cluster_mask = np.flatnonzero(X_clusters == cluster_idx)\n\n                if cluster_mask.size == 0:\n                    # empty cluster\n                    continue\n\n                X_cluster = _safe_indexing(X, cluster_mask)\n                y_cluster = _safe_indexing(y, cluster_mask)\n\n                cluster_class_mean = (y_cluster == class_sample).mean()\n\n                if self.cluster_balance_threshold_ == \"auto\":\n                    balance_threshold = n_samples / total_inp_samples / 2\n                else:\n                    balance_threshold = self.cluster_balance_threshold_\n\n                # the cluster is already considered balanced\n                if cluster_class_mean < balance_threshold:\n                    continue\n\n                # not enough samples to apply SMOTE\n                anticipated_samples = cluster_class_mean * X_cluster.shape[0]\n                if anticipated_samples < self.nn_k_.n_neighbors:\n                    continue\n\n                X_cluster_class = _safe_indexing(\n                    X_cluster, np.flatnonzero(y_cluster == class_sample)\n                )\n\n                valid_clusters.append(cluster_mask)\n                cluster_sparsities.append(self._find_cluster_sparsity(X_cluster_class))\n\n            cluster_sparsities = np.array(cluster_sparsities)\n            cluster_weights = cluster_sparsities / cluster_sparsities.sum()\n\n            if not valid_clusters:\n                raise RuntimeError(\n                    \"No clusters found with sufficient samples of \"\n                    f\"class {class_sample}. Try lowering the \"\n                    \"cluster_balance_threshold or increasing the number of \"\n                    \"clusters.\"\n                )\n\n            for valid_cluster_idx, valid_cluster in enumerate(valid_clusters):\n                X_cluster = _safe_indexing(X, valid_cluster)\n                y_cluster = _safe_indexing(y, valid_cluster)\n\n                X_cluster_class = _safe_indexing(\n                    X_cluster, np.flatnonzero(y_cluster == class_sample)\n                )\n\n                self.nn_k_.fit(X_cluster_class)\n                nns = self.nn_k_.kneighbors(X_cluster_class, return_distance=False)[\n                    :, 1:\n                ]\n\n                cluster_n_samples = int(\n                    math.ceil(n_samples * cluster_weights[valid_cluster_idx])\n                )\n\n                X_new, y_new = self._make_samples(\n                    X_cluster_class,\n                    y.dtype,\n                    class_sample,\n                    X_cluster_class,\n                    nns,\n                    cluster_n_samples,\n                    1.0,\n                )\n\n                stack = [np.vstack, sparse.vstack][int(sparse.issparse(X_new))]\n                X_resampled = stack((X_resampled, X_new))\n                y_resampled = np.hstack((y_resampled, y_new))\n\n        return X_resampled, y_resampled\n"
  },
  {
    "path": "imblearn/over_sampling/_smote/filter.py",
    "content": "\"\"\"SMOTE variant applying some filtering before the generation process.\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Fernando Nogueira\n#          Christos Aridas\n#          Dzianis Dudnik\n# License: MIT\n\nimport numbers\n\nimport numpy as np\nfrom scipy import sparse\nfrom sklearn.base import clone\nfrom sklearn.svm import SVC\nfrom sklearn.utils import _safe_indexing, check_random_state\nfrom sklearn.utils._param_validation import HasMethods, Interval, StrOptions\n\nfrom imblearn.over_sampling._smote.base import BaseSMOTE\nfrom imblearn.over_sampling.base import BaseOverSampler\nfrom imblearn.utils import Substitution, check_neighbors_object\nfrom imblearn.utils._docstring import _random_state_docstring\n\n\n@Substitution(\n    sampling_strategy=BaseOverSampler._sampling_strategy_docstring,\n    random_state=_random_state_docstring,\n)\nclass BorderlineSMOTE(BaseSMOTE):\n    \"\"\"Over-sampling using Borderline SMOTE.\n\n    This algorithm is a variant of the original SMOTE algorithm proposed in\n    [2]_. Borderline samples will be detected and used to generate new\n    synthetic samples.\n\n    Read more in the :ref:`User Guide <smote_adasyn>`.\n\n    .. versionadded:: 0.4\n\n    Parameters\n    ----------\n    {sampling_strategy}\n\n    {random_state}\n\n    k_neighbors : int or object, default=5\n        The nearest neighbors used to define the neighborhood of samples to use\n        to generate the synthetic samples. You can pass:\n\n        - an `int` corresponding to the number of neighbors to use. A\n          `~sklearn.neighbors.NearestNeighbors` instance will be fitted in this\n          case.\n        - an instance of a compatible nearest neighbors algorithm that should\n          implement both methods `kneighbors` and `kneighbors_graph`. For\n          instance, it could correspond to a\n          :class:`~sklearn.neighbors.NearestNeighbors` but could be extended to\n          any compatible class.\n\n    m_neighbors : int or object, default=10\n        The nearest neighbors used to determine if a minority sample is in\n        \"danger\". You can pass:\n\n        - an `int` corresponding to the number of neighbors to use. A\n          `~sklearn.neighbors.NearestNeighbors` instance will be fitted in this\n          case.\n        - an instance of a compatible nearest neighbors algorithm that should\n          implement both methods `kneighbors` and `kneighbors_graph`. For\n          instance, it could correspond to a\n          :class:`~sklearn.neighbors.NearestNeighbors` but could be extended to\n          any compatible class.\n\n    kind : {{\"borderline-1\", \"borderline-2\"}}, default='borderline-1'\n        The type of SMOTE algorithm to use one of the following options:\n        ``'borderline-1'``, ``'borderline-2'``.\n\n    Attributes\n    ----------\n    sampling_strategy_ : dict\n        Dictionary containing the information to sample the dataset. The keys\n        corresponds to the class labels from which to sample and the values\n        are the number of samples to sample.\n\n    nn_k_ : estimator object\n        Validated k-nearest neighbours created from the `k_neighbors` parameter.\n\n    nn_m_ : estimator object\n        Validated m-nearest neighbours created from the `m_neighbors` parameter.\n\n    in_danger_indices : dict of ndarray\n        Dictionary containing the indices of the samples considered in danger that\n        are used to generate new synthetic samples. The keys corresponds to the class\n        label.\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.10\n\n    See Also\n    --------\n    SMOTE : Over-sample using SMOTE.\n\n    SMOTENC : Over-sample using SMOTE for continuous and categorical features.\n\n    SVMSMOTE : Over-sample using SVM-SMOTE variant.\n\n    ADASYN : Over-sample using ADASYN.\n\n    KMeansSMOTE : Over-sample applying a clustering before to oversample using\n        SMOTE.\n\n    Notes\n    -----\n    See the original papers: [2]_ for more details.\n\n    Supports multi-class resampling. A one-vs.-rest scheme is used as\n    originally proposed in [1]_.\n\n    References\n    ----------\n    .. [1] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, \"SMOTE:\n       synthetic minority over-sampling technique,\" Journal of artificial\n       intelligence research, 321-357, 2002.\n\n    .. [2] H. Han, W. Wen-Yuan, M. Bing-Huan, \"Borderline-SMOTE: a new\n       over-sampling method in imbalanced data sets learning,\" Advances in\n       intelligent computing, 878-887, 2005.\n\n    Examples\n    --------\n    >>> from collections import Counter\n    >>> from sklearn.datasets import make_classification\n    >>> from imblearn.over_sampling import BorderlineSMOTE\n    >>> X, y = make_classification(n_classes=2, class_sep=2,\n    ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,\n    ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)\n    >>> print('Original dataset shape %s' % Counter(y))\n    Original dataset shape Counter({{1: 900, 0: 100}})\n    >>> sm = BorderlineSMOTE(random_state=42)\n    >>> X_res, y_res = sm.fit_resample(X, y)\n    >>> print('Resampled dataset shape %s' % Counter(y_res))\n    Resampled dataset shape Counter({{0: 900, 1: 900}})\n    \"\"\"\n\n    _parameter_constraints: dict = {\n        **BaseSMOTE._parameter_constraints,\n        \"m_neighbors\": [\n            Interval(numbers.Integral, 1, None, closed=\"left\"),\n            HasMethods([\"kneighbors\", \"kneighbors_graph\"]),\n        ],\n        \"kind\": [StrOptions({\"borderline-1\", \"borderline-2\"})],\n    }\n\n    def __init__(\n        self,\n        *,\n        sampling_strategy=\"auto\",\n        random_state=None,\n        k_neighbors=5,\n        m_neighbors=10,\n        kind=\"borderline-1\",\n    ):\n        super().__init__(\n            sampling_strategy=sampling_strategy,\n            random_state=random_state,\n            k_neighbors=k_neighbors,\n        )\n        self.m_neighbors = m_neighbors\n        self.kind = kind\n\n    def _validate_estimator(self):\n        super()._validate_estimator()\n        self.nn_m_ = check_neighbors_object(\n            \"m_neighbors\", self.m_neighbors, additional_neighbor=1\n        )\n\n    def _fit_resample(self, X, y):\n        self._validate_estimator()\n\n        X_resampled = X.copy()\n        y_resampled = y.copy()\n\n        self.in_danger_indices = {}\n        for class_sample, n_samples in self.sampling_strategy_.items():\n            if n_samples == 0:\n                continue\n            target_class_indices = np.flatnonzero(y == class_sample)\n            X_class = _safe_indexing(X, target_class_indices)\n\n            self.nn_m_.fit(X)\n            mask_danger = self._in_danger_noise(\n                self.nn_m_, X_class, class_sample, y, kind=\"danger\"\n            )\n            if not any(mask_danger):\n                continue\n            X_danger = _safe_indexing(X_class, mask_danger)\n            self.in_danger_indices[class_sample] = target_class_indices[mask_danger]\n\n            if self.kind == \"borderline-1\":\n                X_to_sample_from = X_class  # consider the positive class only\n                y_to_check_neighbors = None\n            else:  # self.kind == \"borderline-2\"\n                X_to_sample_from = X  # consider the whole dataset\n                y_to_check_neighbors = y\n\n            self.nn_k_.fit(X_to_sample_from)\n            nns = self.nn_k_.kneighbors(X_danger, return_distance=False)[:, 1:]\n            X_new, y_new = self._make_samples(\n                X_danger,\n                y.dtype,\n                class_sample,\n                X_to_sample_from,\n                nns,\n                n_samples,\n                y=y_to_check_neighbors,\n            )\n            if sparse.issparse(X_new):\n                X_resampled = sparse.vstack([X_resampled, X_new])\n            else:\n                X_resampled = np.vstack((X_resampled, X_new))\n            y_resampled = np.hstack((y_resampled, y_new))\n\n        return X_resampled, y_resampled\n\n\n@Substitution(\n    sampling_strategy=BaseOverSampler._sampling_strategy_docstring,\n    random_state=_random_state_docstring,\n)\nclass SVMSMOTE(BaseSMOTE):\n    \"\"\"Over-sampling using SVM-SMOTE.\n\n    Variant of SMOTE algorithm which use an SVM algorithm to detect sample to\n    use for generating new synthetic samples as proposed in [2]_.\n\n    Read more in the :ref:`User Guide <smote_adasyn>`.\n\n    .. versionadded:: 0.4\n\n    Parameters\n    ----------\n    {sampling_strategy}\n\n    {random_state}\n\n    k_neighbors : int or object, default=5\n        The nearest neighbors used to define the neighborhood of samples to use\n        to generate the synthetic samples. You can pass:\n\n        - an `int` corresponding to the number of neighbors to use. A\n          `~sklearn.neighbors.NearestNeighbors` instance will be fitted in this\n          case.\n        - an instance of a compatible nearest neighbors algorithm that should\n          implement both methods `kneighbors` and `kneighbors_graph`. For\n          instance, it could correspond to a\n          :class:`~sklearn.neighbors.NearestNeighbors` but could be extended to\n          any compatible class.\n\n    m_neighbors : int or object, default=10\n        The nearest neighbors used to determine if a minority sample is in\n        \"danger\". You can pass:\n\n        - an `int` corresponding to the number of neighbors to use. A\n          `~sklearn.neighbors.NearestNeighbors` instance will be fitted in this\n          case.\n        - an instance of a compatible nearest neighbors algorithm that should\n          implement both methods `kneighbors` and `kneighbors_graph`. For\n          instance, it could correspond to a\n          :class:`~sklearn.neighbors.NearestNeighbors` but could be extended to\n          any compatible class.\n\n    svm_estimator : estimator object, default=SVC()\n        A parametrized :class:`~sklearn.svm.SVC` classifier can be passed.\n        A scikit-learn compatible estimator can be passed but it is required\n        to expose a `support_` fitted attribute.\n\n    out_step : float, default=0.5\n        Step size when extrapolating.\n\n    Attributes\n    ----------\n    sampling_strategy_ : dict\n        Dictionary containing the information to sample the dataset. The keys\n        corresponds to the class labels from which to sample and the values\n        are the number of samples to sample.\n\n    nn_k_ : estimator object\n        Validated k-nearest neighbours created from the `k_neighbors` parameter.\n\n    nn_m_ : estimator object\n        Validated m-nearest neighbours created from the `m_neighbors` parameter.\n\n    svm_estimator_ : estimator object\n        The validated SVM classifier used to detect samples from which to\n        generate new synthetic samples.\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.10\n\n    See Also\n    --------\n    SMOTE : Over-sample using SMOTE.\n\n    SMOTENC : Over-sample using SMOTE for continuous and categorical features.\n\n    SMOTEN : Over-sample using the SMOTE variant specifically for categorical\n        features only.\n\n    BorderlineSMOTE : Over-sample using Borderline-SMOTE.\n\n    ADASYN : Over-sample using ADASYN.\n\n    KMeansSMOTE : Over-sample applying a clustering before to oversample using\n        SMOTE.\n\n    Notes\n    -----\n    See the original papers: [2]_ for more details.\n\n    Supports multi-class resampling. A one-vs.-rest scheme is used as\n    originally proposed in [1]_.\n\n    References\n    ----------\n    .. [1] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, \"SMOTE:\n       synthetic minority over-sampling technique,\" Journal of artificial\n       intelligence research, 321-357, 2002.\n\n    .. [2] H. M. Nguyen, E. W. Cooper, K. Kamei, \"Borderline over-sampling for\n       imbalanced data classification,\" International Journal of Knowledge\n       Engineering and Soft Data Paradigms, 3(1), pp.4-21, 2009.\n\n    Examples\n    --------\n    >>> from collections import Counter\n    >>> from sklearn.datasets import make_classification\n    >>> from imblearn.over_sampling import SVMSMOTE\n    >>> X, y = make_classification(n_classes=2, class_sep=2,\n    ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,\n    ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)\n    >>> print('Original dataset shape %s' % Counter(y))\n    Original dataset shape Counter({{1: 900, 0: 100}})\n    >>> sm = SVMSMOTE(random_state=42)\n    >>> X_res, y_res = sm.fit_resample(X, y)\n    >>> print('Resampled dataset shape %s' % Counter(y_res))\n    Resampled dataset shape Counter({{0: 900, 1: 900}})\n    \"\"\"\n\n    _parameter_constraints: dict = {\n        **BaseSMOTE._parameter_constraints,\n        \"m_neighbors\": [\n            Interval(numbers.Integral, 1, None, closed=\"left\"),\n            HasMethods([\"kneighbors\", \"kneighbors_graph\"]),\n        ],\n        \"svm_estimator\": [HasMethods([\"fit\", \"predict\"]), None],\n        \"out_step\": [Interval(numbers.Real, 0, 1, closed=\"both\")],\n    }\n\n    def __init__(\n        self,\n        *,\n        sampling_strategy=\"auto\",\n        random_state=None,\n        k_neighbors=5,\n        m_neighbors=10,\n        svm_estimator=None,\n        out_step=0.5,\n    ):\n        super().__init__(\n            sampling_strategy=sampling_strategy,\n            random_state=random_state,\n            k_neighbors=k_neighbors,\n        )\n        self.m_neighbors = m_neighbors\n        self.svm_estimator = svm_estimator\n        self.out_step = out_step\n\n    def _validate_estimator(self):\n        super()._validate_estimator()\n        self.nn_m_ = check_neighbors_object(\n            \"m_neighbors\", self.m_neighbors, additional_neighbor=1\n        )\n\n        if self.svm_estimator is None:\n            self.svm_estimator_ = SVC(gamma=\"scale\", random_state=self.random_state)\n        else:\n            self.svm_estimator_ = clone(self.svm_estimator)\n\n    def _fit_resample(self, X, y):\n        self._validate_estimator()\n        random_state = check_random_state(self.random_state)\n        X_resampled = X.copy()\n        y_resampled = y.copy()\n\n        for class_sample, n_samples in self.sampling_strategy_.items():\n            if n_samples == 0:\n                continue\n            target_class_indices = np.flatnonzero(y == class_sample)\n            X_class = _safe_indexing(X, target_class_indices)\n\n            self.svm_estimator_.fit(X, y)\n            if not hasattr(self.svm_estimator_, \"support_\"):\n                raise RuntimeError(\n                    \"`svm_estimator` is required to exposed a `support_` fitted \"\n                    \"attribute. Such estimator belongs to the familly of Support \"\n                    \"Vector Machine.\"\n                )\n            support_index = self.svm_estimator_.support_[\n                y[self.svm_estimator_.support_] == class_sample\n            ]\n            support_vector = _safe_indexing(X, support_index)\n\n            self.nn_m_.fit(X)\n            noise_bool = self._in_danger_noise(\n                self.nn_m_, support_vector, class_sample, y, kind=\"noise\"\n            )\n            support_vector = _safe_indexing(\n                support_vector, np.flatnonzero(np.logical_not(noise_bool))\n            )\n            if support_vector.shape[0] == 0:\n                raise ValueError(\n                    \"All support vectors are considered as noise. SVM-SMOTE is not \"\n                    \"adapted to your dataset. Try another SMOTE variant.\"\n                )\n            danger_bool = self._in_danger_noise(\n                self.nn_m_, support_vector, class_sample, y, kind=\"danger\"\n            )\n            safety_bool = np.logical_not(danger_bool)\n\n            self.nn_k_.fit(X_class)\n            fractions = random_state.beta(10, 10)\n            n_generated_samples = int(fractions * (n_samples + 1))\n            if np.count_nonzero(danger_bool) > 0:\n                nns = self.nn_k_.kneighbors(\n                    _safe_indexing(support_vector, np.flatnonzero(danger_bool)),\n                    return_distance=False,\n                )[:, 1:]\n\n                X_new_1, y_new_1 = self._make_samples(\n                    _safe_indexing(support_vector, np.flatnonzero(danger_bool)),\n                    y.dtype,\n                    class_sample,\n                    X_class,\n                    nns,\n                    n_generated_samples,\n                    step_size=1.0,\n                )\n\n            if np.count_nonzero(safety_bool) > 0:\n                nns = self.nn_k_.kneighbors(\n                    _safe_indexing(support_vector, np.flatnonzero(safety_bool)),\n                    return_distance=False,\n                )[:, 1:]\n\n                X_new_2, y_new_2 = self._make_samples(\n                    _safe_indexing(support_vector, np.flatnonzero(safety_bool)),\n                    y.dtype,\n                    class_sample,\n                    X_class,\n                    nns,\n                    n_samples - n_generated_samples,\n                    step_size=-self.out_step,\n                )\n\n            if np.count_nonzero(danger_bool) > 0 and np.count_nonzero(safety_bool) > 0:\n                if sparse.issparse(X_resampled):\n                    X_resampled = sparse.vstack([X_resampled, X_new_1, X_new_2])\n                else:\n                    X_resampled = np.vstack((X_resampled, X_new_1, X_new_2))\n                y_resampled = np.concatenate((y_resampled, y_new_1, y_new_2), axis=0)\n            elif np.count_nonzero(danger_bool) == 0:\n                if sparse.issparse(X_resampled):\n                    X_resampled = sparse.vstack([X_resampled, X_new_2])\n                else:\n                    X_resampled = np.vstack((X_resampled, X_new_2))\n                y_resampled = np.concatenate((y_resampled, y_new_2), axis=0)\n            elif np.count_nonzero(safety_bool) == 0:\n                if sparse.issparse(X_resampled):\n                    X_resampled = sparse.vstack([X_resampled, X_new_1])\n                else:\n                    X_resampled = np.vstack((X_resampled, X_new_1))\n                y_resampled = np.concatenate((y_resampled, y_new_1), axis=0)\n\n        return X_resampled, y_resampled\n"
  },
  {
    "path": "imblearn/over_sampling/_smote/tests/__init__.py",
    "content": ""
  },
  {
    "path": "imblearn/over_sampling/_smote/tests/test_borderline_smote.py",
    "content": "from collections import Counter\n\nimport pytest\nfrom sklearn.datasets import make_classification\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.utils._testing import assert_allclose, assert_array_equal\n\nfrom imblearn.over_sampling import BorderlineSMOTE\n\n\n@pytest.mark.parametrize(\"kind\", [\"borderline-1\", \"borderline-2\"])\ndef test_borderline_smote_no_in_danger_samples(kind):\n    \"\"\"Check that the algorithm behave properly even on a dataset without any sample\n    in danger.\n    \"\"\"\n    X, y = make_classification(\n        n_samples=500,\n        n_features=2,\n        n_informative=2,\n        n_redundant=0,\n        n_repeated=0,\n        n_clusters_per_class=1,\n        n_classes=3,\n        weights=[0.1, 0.2, 0.7],\n        class_sep=1.5,\n        random_state=1,\n    )\n    smote = BorderlineSMOTE(kind=kind, m_neighbors=3, k_neighbors=5, random_state=0)\n    X_res, y_res = smote.fit_resample(X, y)\n\n    assert_allclose(X, X_res)\n    assert_allclose(y, y_res)\n    assert not smote.in_danger_indices\n\n\ndef test_borderline_smote_kind():\n    \"\"\"Check the behaviour of the `kind` parameter.\n\n    In short, \"borderline-2\" generates sample closer to the boundary decision than\n    \"borderline-1\". We generate an example where a logistic regression will perform\n    worse on \"borderline-2\" than on \"borderline-1\".\n    \"\"\"\n    X, y = make_classification(\n        n_samples=500,\n        n_features=2,\n        n_informative=2,\n        n_redundant=0,\n        n_repeated=0,\n        n_clusters_per_class=1,\n        n_classes=3,\n        weights=[0.1, 0.2, 0.7],\n        class_sep=1.0,\n        random_state=1,\n    )\n    smote = BorderlineSMOTE(\n        kind=\"borderline-1\", m_neighbors=9, k_neighbors=5, random_state=0\n    )\n    X_res_borderline_1, y_res_borderline_1 = smote.fit_resample(X, y)\n    smote.set_params(kind=\"borderline-2\")\n    X_res_borderline_2, y_res_borderline_2 = smote.fit_resample(X, y)\n\n    score_borderline_1 = (\n        LogisticRegression()\n        .fit(X_res_borderline_1, y_res_borderline_1)\n        .score(X_res_borderline_1, y_res_borderline_1)\n    )\n    score_borderline_2 = (\n        LogisticRegression()\n        .fit(X_res_borderline_2, y_res_borderline_2)\n        .score(X_res_borderline_2, y_res_borderline_2)\n    )\n    assert score_borderline_1 > score_borderline_2\n\n\ndef test_borderline_smote_in_danger():\n    X, y = make_classification(\n        n_samples=500,\n        n_features=2,\n        n_informative=2,\n        n_redundant=0,\n        n_repeated=0,\n        n_clusters_per_class=1,\n        n_classes=3,\n        weights=[0.1, 0.2, 0.7],\n        class_sep=0.8,\n        random_state=1,\n    )\n    smote = BorderlineSMOTE(\n        kind=\"borderline-1\",\n        m_neighbors=9,\n        k_neighbors=5,\n        random_state=0,\n    )\n    _, y_res_1 = smote.fit_resample(X, y)\n    in_danger_indices_borderline_1 = smote.in_danger_indices\n    smote.set_params(kind=\"borderline-2\")\n    _, y_res_2 = smote.fit_resample(X, y)\n    in_danger_indices_borderline_2 = smote.in_danger_indices\n\n    for key1, key2 in zip(\n        in_danger_indices_borderline_1, in_danger_indices_borderline_2\n    ):\n        assert_array_equal(\n            in_danger_indices_borderline_1[key1], in_danger_indices_borderline_2[key2]\n        )\n    assert len(in_danger_indices_borderline_1) == len(in_danger_indices_borderline_2)\n    counter = Counter(y_res_1)\n    assert counter[0] == counter[1] == counter[2]\n    counter = Counter(y_res_2)\n    assert counter[0] == counter[1] == counter[2]\n"
  },
  {
    "path": "imblearn/over_sampling/_smote/tests/test_kmeans_smote.py",
    "content": "import numpy as np\nimport pytest\nfrom sklearn.cluster import KMeans, MiniBatchKMeans\nfrom sklearn.datasets import make_classification\nfrom sklearn.neighbors import NearestNeighbors\nfrom sklearn.utils._testing import assert_allclose, assert_array_equal\n\nfrom imblearn.over_sampling import SMOTE, KMeansSMOTE\n\n\n@pytest.fixture\ndef data():\n    X = np.array(\n        [\n            [0.11622591, -0.0317206],\n            [0.77481731, 0.60935141],\n            [1.25192108, -0.22367336],\n            [0.53366841, -0.30312976],\n            [1.52091956, -0.49283504],\n            [-0.28162401, -2.10400981],\n            [0.83680821, 1.72827342],\n            [0.3084254, 0.33299982],\n            [0.70472253, -0.73309052],\n            [0.28893132, -0.38761769],\n            [1.15514042, 0.0129463],\n            [0.88407872, 0.35454207],\n            [1.31301027, -0.92648734],\n            [-1.11515198, -0.93689695],\n            [-0.18410027, -0.45194484],\n            [0.9281014, 0.53085498],\n            [-0.14374509, 0.27370049],\n            [-0.41635887, -0.38299653],\n            [0.08711622, 0.93259929],\n            [1.70580611, -0.11219234],\n        ]\n    )\n    y = np.array([0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0])\n    return X, y\n\n\n@pytest.mark.filterwarnings(\"ignore:The default value of `n_init` will change\")\ndef test_kmeans_smote(data):\n    X, y = data\n    kmeans_smote = KMeansSMOTE(\n        kmeans_estimator=1,\n        random_state=42,\n        cluster_balance_threshold=0.0,\n        k_neighbors=5,\n    )\n    smote = SMOTE(random_state=42)\n\n    X_res_1, y_res_1 = kmeans_smote.fit_resample(X, y)\n    X_res_2, y_res_2 = smote.fit_resample(X, y)\n\n    assert_allclose(X_res_1, X_res_2)\n    assert_array_equal(y_res_1, y_res_2)\n\n    assert kmeans_smote.nn_k_.n_neighbors == 6\n    assert kmeans_smote.kmeans_estimator_.n_clusters == 1\n    assert \"batch_size\" in kmeans_smote.kmeans_estimator_.get_params()\n\n\n@pytest.mark.filterwarnings(\"ignore:The default value of `n_init` will change\")\n@pytest.mark.parametrize(\"k_neighbors\", [2, NearestNeighbors(n_neighbors=3)])\n@pytest.mark.parametrize(\n    \"kmeans_estimator\",\n    [\n        3,\n        KMeans(n_clusters=3, n_init=1, random_state=42),\n        MiniBatchKMeans(n_clusters=3, n_init=1, random_state=42),\n    ],\n)\ndef test_sample_kmeans_custom(data, k_neighbors, kmeans_estimator):\n    X, y = data\n    kmeans_smote = KMeansSMOTE(\n        random_state=42,\n        kmeans_estimator=kmeans_estimator,\n        k_neighbors=k_neighbors,\n    )\n    X_resampled, y_resampled = kmeans_smote.fit_resample(X, y)\n    assert X_resampled.shape == (24, 2)\n    assert y_resampled.shape == (24,)\n\n    assert kmeans_smote.nn_k_.n_neighbors == 3\n    assert kmeans_smote.kmeans_estimator_.n_clusters == 3\n\n\n@pytest.mark.filterwarnings(\"ignore:The default value of `n_init` will change\")\ndef test_sample_kmeans_not_enough_clusters(data):\n    X, y = data\n    smote = KMeansSMOTE(cluster_balance_threshold=10, random_state=42)\n    with pytest.raises(RuntimeError):\n        smote.fit_resample(X, y)\n\n\n@pytest.mark.parametrize(\"density_exponent\", [\"auto\", 10])\n@pytest.mark.parametrize(\"cluster_balance_threshold\", [\"auto\", 0.1])\ndef test_sample_kmeans_density_estimation(density_exponent, cluster_balance_threshold):\n    X, y = make_classification(\n        n_samples=10_000, n_classes=2, weights=[0.3, 0.7], random_state=42\n    )\n    smote = KMeansSMOTE(\n        kmeans_estimator=MiniBatchKMeans(n_init=1, random_state=42),\n        random_state=0,\n        density_exponent=density_exponent,\n        cluster_balance_threshold=cluster_balance_threshold,\n    )\n    smote.fit_resample(X, y)\n"
  },
  {
    "path": "imblearn/over_sampling/_smote/tests/test_smote.py",
    "content": "\"\"\"Test the module SMOTE.\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport numpy as np\nfrom sklearn.neighbors import NearestNeighbors\nfrom sklearn.utils._testing import assert_allclose, assert_array_equal\n\nfrom imblearn.over_sampling import SMOTE\n\nRND_SEED = 0\nX = np.array(\n    [\n        [0.11622591, -0.0317206],\n        [0.77481731, 0.60935141],\n        [1.25192108, -0.22367336],\n        [0.53366841, -0.30312976],\n        [1.52091956, -0.49283504],\n        [-0.28162401, -2.10400981],\n        [0.83680821, 1.72827342],\n        [0.3084254, 0.33299982],\n        [0.70472253, -0.73309052],\n        [0.28893132, -0.38761769],\n        [1.15514042, 0.0129463],\n        [0.88407872, 0.35454207],\n        [1.31301027, -0.92648734],\n        [-1.11515198, -0.93689695],\n        [-0.18410027, -0.45194484],\n        [0.9281014, 0.53085498],\n        [-0.14374509, 0.27370049],\n        [-0.41635887, -0.38299653],\n        [0.08711622, 0.93259929],\n        [1.70580611, -0.11219234],\n    ]\n)\nY = np.array([0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0])\nR_TOL = 1e-4\n\n\ndef test_sample_regular():\n    smote = SMOTE(random_state=RND_SEED)\n    X_resampled, y_resampled = smote.fit_resample(X, Y)\n    X_gt = np.array(\n        [\n            [0.11622591, -0.0317206],\n            [0.77481731, 0.60935141],\n            [1.25192108, -0.22367336],\n            [0.53366841, -0.30312976],\n            [1.52091956, -0.49283504],\n            [-0.28162401, -2.10400981],\n            [0.83680821, 1.72827342],\n            [0.3084254, 0.33299982],\n            [0.70472253, -0.73309052],\n            [0.28893132, -0.38761769],\n            [1.15514042, 0.0129463],\n            [0.88407872, 0.35454207],\n            [1.31301027, -0.92648734],\n            [-1.11515198, -0.93689695],\n            [-0.18410027, -0.45194484],\n            [0.9281014, 0.53085498],\n            [-0.14374509, 0.27370049],\n            [-0.41635887, -0.38299653],\n            [0.08711622, 0.93259929],\n            [1.70580611, -0.11219234],\n            [0.29307743, -0.14670439],\n            [0.84976473, -0.15570176],\n            [0.61319159, -0.11571668],\n            [0.66052536, -0.28246517],\n        ]\n    )\n    y_gt = np.array(\n        [0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0]\n    )\n    assert_allclose(X_resampled, X_gt, rtol=R_TOL)\n    assert_array_equal(y_resampled, y_gt)\n\n\ndef test_sample_regular_half():\n    sampling_strategy = {0: 9, 1: 12}\n    smote = SMOTE(sampling_strategy=sampling_strategy, random_state=RND_SEED)\n    X_resampled, y_resampled = smote.fit_resample(X, Y)\n    X_gt = np.array(\n        [\n            [0.11622591, -0.0317206],\n            [0.77481731, 0.60935141],\n            [1.25192108, -0.22367336],\n            [0.53366841, -0.30312976],\n            [1.52091956, -0.49283504],\n            [-0.28162401, -2.10400981],\n            [0.83680821, 1.72827342],\n            [0.3084254, 0.33299982],\n            [0.70472253, -0.73309052],\n            [0.28893132, -0.38761769],\n            [1.15514042, 0.0129463],\n            [0.88407872, 0.35454207],\n            [1.31301027, -0.92648734],\n            [-1.11515198, -0.93689695],\n            [-0.18410027, -0.45194484],\n            [0.9281014, 0.53085498],\n            [-0.14374509, 0.27370049],\n            [-0.41635887, -0.38299653],\n            [0.08711622, 0.93259929],\n            [1.70580611, -0.11219234],\n            [0.36784496, -0.1953161],\n        ]\n    )\n    y_gt = np.array([0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0])\n    assert_allclose(X_resampled, X_gt, rtol=R_TOL)\n    assert_array_equal(y_resampled, y_gt)\n\n\ndef test_sample_regular_with_nn():\n    nn_k = NearestNeighbors(n_neighbors=6)\n    smote = SMOTE(random_state=RND_SEED, k_neighbors=nn_k)\n    X_resampled, y_resampled = smote.fit_resample(X, Y)\n    X_gt = np.array(\n        [\n            [0.11622591, -0.0317206],\n            [0.77481731, 0.60935141],\n            [1.25192108, -0.22367336],\n            [0.53366841, -0.30312976],\n            [1.52091956, -0.49283504],\n            [-0.28162401, -2.10400981],\n            [0.83680821, 1.72827342],\n            [0.3084254, 0.33299982],\n            [0.70472253, -0.73309052],\n            [0.28893132, -0.38761769],\n            [1.15514042, 0.0129463],\n            [0.88407872, 0.35454207],\n            [1.31301027, -0.92648734],\n            [-1.11515198, -0.93689695],\n            [-0.18410027, -0.45194484],\n            [0.9281014, 0.53085498],\n            [-0.14374509, 0.27370049],\n            [-0.41635887, -0.38299653],\n            [0.08711622, 0.93259929],\n            [1.70580611, -0.11219234],\n            [0.29307743, -0.14670439],\n            [0.84976473, -0.15570176],\n            [0.61319159, -0.11571668],\n            [0.66052536, -0.28246517],\n        ]\n    )\n    y_gt = np.array(\n        [0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0]\n    )\n    assert_allclose(X_resampled, X_gt, rtol=R_TOL)\n    assert_array_equal(y_resampled, y_gt)\n"
  },
  {
    "path": "imblearn/over_sampling/_smote/tests/test_smote_nc.py",
    "content": "\"\"\"Test the module SMOTENC.\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n#          Dzianis Dudnik\n# License: MIT\n\nfrom collections import Counter\n\nimport numpy as np\nimport pytest\nfrom scipy import sparse\nfrom sklearn.datasets import make_classification\nfrom sklearn.preprocessing import OneHotEncoder\nfrom sklearn.utils._testing import assert_allclose, assert_array_equal\n\nfrom imblearn.over_sampling import SMOTENC\n\n\ndef data_heterogneous_ordered():\n    rng = np.random.RandomState(42)\n    X = np.empty((30, 4), dtype=object)\n    # create 2 random continuous feature\n    X[:, :2] = rng.randn(30, 2)\n    # create a categorical feature using some string\n    X[:, 2] = rng.choice([\"a\", \"b\", \"c\"], size=30).astype(object)\n    # create a categorical feature using some integer\n    X[:, 3] = rng.randint(3, size=30)\n    y = np.array([0] * 10 + [1] * 20)\n    # return the categories\n    return X, y, [2, 3]\n\n\ndef data_heterogneous_unordered():\n    rng = np.random.RandomState(42)\n    X = np.empty((30, 4), dtype=object)\n    # create 2 random continuous feature\n    X[:, [1, 2]] = rng.randn(30, 2)\n    # create a categorical feature using some string\n    X[:, 0] = rng.choice([\"a\", \"b\", \"c\"], size=30).astype(object)\n    # create a categorical feature using some integer\n    X[:, 3] = rng.randint(3, size=30)\n    y = np.array([0] * 10 + [1] * 20)\n    # return the categories\n    return X, y, [0, 3]\n\n\ndef data_heterogneous_masked():\n    rng = np.random.RandomState(42)\n    X = np.empty((30, 4), dtype=object)\n    # create 2 random continuous feature\n    X[:, [1, 2]] = rng.randn(30, 2)\n    # create a categorical feature using some string\n    X[:, 0] = rng.choice([\"a\", \"b\", \"c\"], size=30).astype(object)\n    # create a categorical feature using some integer\n    X[:, 3] = rng.randint(3, size=30)\n    y = np.array([0] * 10 + [1] * 20)\n    # return the categories\n    return X, y, [True, False, False, True]\n\n\ndef data_heterogneous_unordered_multiclass():\n    rng = np.random.RandomState(42)\n    X = np.empty((50, 4), dtype=object)\n    # create 2 random continuous feature\n    X[:, [1, 2]] = rng.randn(50, 2)\n    # create a categorical feature using some string\n    X[:, 0] = rng.choice([\"a\", \"b\", \"c\"], size=50).astype(object)\n    # create a categorical feature using some integer\n    X[:, 3] = rng.randint(3, size=50)\n    y = np.array([0] * 10 + [1] * 15 + [2] * 25)\n    # return the categories\n    return X, y, [0, 3]\n\n\ndef data_sparse(format):\n    rng = np.random.RandomState(42)\n    X = np.empty((30, 4), dtype=np.float64)\n    # create 2 random continuous feature\n    X[:, [1, 2]] = rng.randn(30, 2)\n    # create a categorical feature using some string\n    X[:, 0] = rng.randint(3, size=30)\n    # create a categorical feature using some integer\n    X[:, 3] = rng.randint(3, size=30)\n    y = np.array([0] * 10 + [1] * 20)\n    X = sparse.csr_matrix(X) if format == \"csr\" else sparse.csc_matrix(X)\n    return X, y, [0, 3]\n\n\ndef test_smotenc_error():\n    X, y, _ = data_heterogneous_unordered()\n    categorical_features = [0, 10]\n    smote = SMOTENC(random_state=0, categorical_features=categorical_features)\n    with pytest.raises(ValueError, match=\"all features must be in\"):\n        smote.fit_resample(X, y)\n\n\n@pytest.mark.parametrize(\n    \"data\",\n    [\n        data_heterogneous_ordered(),\n        data_heterogneous_unordered(),\n        data_heterogneous_masked(),\n        data_sparse(\"csr\"),\n        data_sparse(\"csc\"),\n    ],\n)\ndef test_smotenc(data):\n    X, y, categorical_features = data\n    smote = SMOTENC(random_state=0, categorical_features=categorical_features)\n    X_resampled, y_resampled = smote.fit_resample(X, y)\n\n    assert X_resampled.dtype == X.dtype\n\n    categorical_features = np.array(categorical_features)\n    if categorical_features.dtype == bool:\n        categorical_features = np.flatnonzero(categorical_features)\n    for cat_idx in categorical_features:\n        if sparse.issparse(X):\n            assert set(X[:, cat_idx].data) == set(X_resampled[:, cat_idx].data)\n            assert X[:, cat_idx].dtype == X_resampled[:, cat_idx].dtype\n        else:\n            assert set(X[:, cat_idx]) == set(X_resampled[:, cat_idx])\n            assert X[:, cat_idx].dtype == X_resampled[:, cat_idx].dtype\n\n    assert isinstance(smote.median_std_, dict)\n\n\n# part of the common test which apply to SMOTE-NC even if it is not default\n# constructible\ndef test_smotenc_check_target_type():\n    X, _, categorical_features = data_heterogneous_unordered()\n    y = np.linspace(0, 1, 30)\n    smote = SMOTENC(categorical_features=categorical_features, random_state=0)\n    with pytest.raises(ValueError, match=\"Unknown label type\"):\n        smote.fit_resample(X, y)\n    rng = np.random.RandomState(42)\n    y = rng.randint(2, size=(20, 3))\n    msg = \"Multilabel and multioutput targets are not supported.\"\n    with pytest.raises(ValueError, match=msg):\n        smote.fit_resample(X, y)\n\n\ndef test_smotenc_samplers_one_label():\n    X, _, categorical_features = data_heterogneous_unordered()\n    y = np.zeros(30)\n    smote = SMOTENC(categorical_features=categorical_features, random_state=0)\n    with pytest.raises(ValueError, match=\"needs to have more than 1 class\"):\n        smote.fit(X, y)\n\n\ndef test_smotenc_fit():\n    X, y, categorical_features = data_heterogneous_unordered()\n    smote = SMOTENC(categorical_features=categorical_features, random_state=0)\n    smote.fit_resample(X, y)\n    assert hasattr(\n        smote, \"sampling_strategy_\"\n    ), \"No fitted attribute sampling_strategy_\"\n\n\ndef test_smotenc_fit_resample():\n    X, y, categorical_features = data_heterogneous_unordered()\n    target_stats = Counter(y)\n    smote = SMOTENC(categorical_features=categorical_features, random_state=0)\n    _, y_res = smote.fit_resample(X, y)\n    _ = Counter(y_res)\n    n_samples = max(target_stats.values())\n    assert all(value >= n_samples for value in Counter(y_res).values())\n\n\ndef test_smotenc_fit_resample_sampling_strategy():\n    X, y, categorical_features = data_heterogneous_unordered_multiclass()\n    expected_stat = Counter(y)[1]\n    smote = SMOTENC(categorical_features=categorical_features, random_state=0)\n    sampling_strategy = {2: 25, 0: 25}\n    smote.set_params(sampling_strategy=sampling_strategy)\n    X_res, y_res = smote.fit_resample(X, y)\n    assert Counter(y_res)[1] == expected_stat\n\n\ndef test_smotenc_pandas():\n    pd = pytest.importorskip(\"pandas\")\n    # Check that the samplers handle pandas dataframe and pandas series\n    X, y, categorical_features = data_heterogneous_unordered_multiclass()\n    X_pd = pd.DataFrame(X)\n    smote = SMOTENC(categorical_features=categorical_features, random_state=0)\n    X_res_pd, y_res_pd = smote.fit_resample(X_pd, y)\n    X_res, y_res = smote.fit_resample(X, y)\n    assert_array_equal(X_res_pd.to_numpy(), X_res)\n    assert_allclose(y_res_pd, y_res)\n    assert set(smote.median_std_.keys()) == {0, 1}\n\n\ndef test_smotenc_preserve_dtype():\n    X, y = make_classification(\n        n_samples=50,\n        n_classes=3,\n        n_informative=4,\n        weights=[0.2, 0.3, 0.5],\n        random_state=0,\n    )\n    # Cast X and y to not default dtype\n    X = X.astype(np.float32)\n    y = y.astype(np.int32)\n    smote = SMOTENC(categorical_features=[1], random_state=0)\n    X_res, y_res = smote.fit_resample(X, y)\n    assert X.dtype == X_res.dtype, \"X dtype is not preserved\"\n    assert y.dtype == y_res.dtype, \"y dtype is not preserved\"\n\n\n@pytest.mark.parametrize(\"categorical_features\", [[True, True, True], [0, 1, 2]])\ndef test_smotenc_raising_error_all_categorical(categorical_features):\n    X, y = make_classification(\n        n_features=3,\n        n_informative=1,\n        n_redundant=1,\n        n_repeated=0,\n        n_clusters_per_class=1,\n    )\n    smote = SMOTENC(categorical_features=categorical_features)\n    err_msg = \"SMOTE-NC is not designed to work only with categorical features\"\n    with pytest.raises(ValueError, match=err_msg):\n        smote.fit_resample(X, y)\n\n\ndef test_smote_nc_with_null_median_std():\n    # Non-regression test for #662\n    # https://github.com/scikit-learn-contrib/imbalanced-learn/issues/662\n    data = np.array(\n        [\n            [1, 2, 1, \"A\"],\n            [2, 1, 2, \"A\"],\n            [2, 1, 2, \"A\"],\n            [1, 2, 3, \"B\"],\n            [1, 2, 4, \"C\"],\n            [1, 2, 5, \"C\"],\n            [1, 2, 4, \"C\"],\n            [1, 2, 4, \"C\"],\n            [1, 2, 4, \"C\"],\n        ],\n        dtype=\"object\",\n    )\n    labels = np.array(\n        [\n            \"class_1\",\n            \"class_1\",\n            \"class_1\",\n            \"class_1\",\n            \"class_2\",\n            \"class_2\",\n            \"class_3\",\n            \"class_3\",\n            \"class_3\",\n        ],\n        dtype=object,\n    )\n    smote = SMOTENC(categorical_features=[3], k_neighbors=1, random_state=0)\n    X_res, y_res = smote.fit_resample(data, labels)\n    # check that the categorical feature is not random but correspond to the\n    # categories seen in the minority class samples\n    assert_array_equal(X_res[-3:, -1], np.array([\"C\", \"C\", \"C\"], dtype=object))\n    assert smote.median_std_ == {\"class_2\": 0.0, \"class_3\": 0.0}\n\n\ndef test_smotenc_categorical_encoder():\n    \"\"\"Check that we can pass our own categorical encoder.\"\"\"\n\n    X, y, categorical_features = data_heterogneous_unordered()\n    smote = SMOTENC(categorical_features=categorical_features, random_state=0)\n    smote.fit_resample(X, y)\n\n    assert getattr(smote.categorical_encoder_, \"sparse_output\") is True\n\n    encoder = OneHotEncoder(sparse_output=False)\n    smote.set_params(categorical_encoder=encoder).fit_resample(X, y)\n    assert smote.categorical_encoder is encoder\n    assert smote.categorical_encoder_ is not encoder\n    assert getattr(smote.categorical_encoder_, \"sparse_output\") is False\n\n\ndef test_smotenc_bool_categorical():\n    \"\"\"Check that we don't try to early convert the full input data to numeric when\n    handling a pandas dataframe.\n\n    Non-regression test for:\n    https://github.com/scikit-learn-contrib/imbalanced-learn/issues/974\n    \"\"\"\n    pd = pytest.importorskip(\"pandas\")\n\n    X = pd.DataFrame(\n        {\n            \"c\": pd.Categorical(list(\"abbacaba\" * 3)),\n            \"f\": [0.3, 0.5, 0.1, 0.2] * 6,\n            \"b\": [False, False, True] * 8,\n        }\n    )\n    y = pd.DataFrame({\"out\": [1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0] * 2})\n    smote = SMOTENC(categorical_features=[0])\n\n    X_res, y_res = smote.fit_resample(X, y)\n    pd.testing.assert_series_equal(X_res.dtypes, X.dtypes)\n    assert len(X_res) == len(y_res)\n\n    smote.set_params(categorical_features=[0, 2])\n    X_res, y_res = smote.fit_resample(X, y)\n    pd.testing.assert_series_equal(X_res.dtypes, X.dtypes)\n    assert len(X_res) == len(y_res)\n\n    X = X.astype({\"b\": \"category\"})\n    X_res, y_res = smote.fit_resample(X, y)\n    pd.testing.assert_series_equal(X_res.dtypes, X.dtypes)\n    assert len(X_res) == len(y_res)\n\n\ndef test_smotenc_categorical_features_str():\n    \"\"\"Check that we support array-like of strings for `categorical_features` using\n    pandas dataframe.\n    \"\"\"\n    pd = pytest.importorskip(\"pandas\")\n\n    X = pd.DataFrame(\n        {\n            \"A\": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],\n            \"B\": [\"a\", \"b\"] * 5,\n            \"C\": [\"a\", \"b\", \"c\"] * 3 + [\"a\"],\n        }\n    )\n    X = pd.concat([X] * 10, ignore_index=True)\n    y = np.array([0] * 70 + [1] * 30)\n    smote = SMOTENC(categorical_features=[\"B\", \"C\"], random_state=0)\n    X_res, y_res = smote.fit_resample(X, y)\n    assert X_res[\"B\"].isin([\"a\", \"b\"]).all()\n    assert X_res[\"C\"].isin([\"a\", \"b\", \"c\"]).all()\n    counter = Counter(y_res)\n    assert counter[0] == counter[1] == 70\n    assert_array_equal(smote.categorical_features_, [1, 2])\n    assert_array_equal(smote.continuous_features_, [0])\n\n\ndef test_smotenc_categorical_features_auto():\n    \"\"\"Check that we can automatically detect categorical features based on pandas\n    dataframe.\n    \"\"\"\n    pd = pytest.importorskip(\"pandas\")\n\n    X = pd.DataFrame(\n        {\n            \"A\": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],\n            \"B\": [\"a\", \"b\"] * 5,\n            \"C\": [\"a\", \"b\", \"c\"] * 3 + [\"a\"],\n        }\n    )\n    X = pd.concat([X] * 10, ignore_index=True)\n    X[\"B\"] = X[\"B\"].astype(\"category\")\n    X[\"C\"] = X[\"C\"].astype(\"category\")\n    y = np.array([0] * 70 + [1] * 30)\n    smote = SMOTENC(categorical_features=\"auto\", random_state=0)\n    X_res, y_res = smote.fit_resample(X, y)\n    assert X_res[\"B\"].isin([\"a\", \"b\"]).all()\n    assert X_res[\"C\"].isin([\"a\", \"b\", \"c\"]).all()\n    counter = Counter(y_res)\n    assert counter[0] == counter[1] == 70\n    assert_array_equal(smote.categorical_features_, [1, 2])\n    assert_array_equal(smote.continuous_features_, [0])\n\n\ndef test_smote_nc_categorical_features_auto_error():\n    \"\"\"Check that we raise a proper error when we cannot use the `'auto'` mode.\"\"\"\n    pd = pytest.importorskip(\"pandas\")\n\n    X = pd.DataFrame(\n        {\n            \"A\": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],\n            \"B\": [\"a\", \"b\"] * 5,\n            \"C\": [\"a\", \"b\", \"c\"] * 3 + [\"a\"],\n        }\n    )\n    y = np.array([0] * 70 + [1] * 30)\n    smote = SMOTENC(categorical_features=\"auto\", random_state=0)\n\n    with pytest.raises(ValueError, match=\"the input data should be a pandas.DataFrame\"):\n        smote.fit_resample(X.to_numpy(), y)\n\n    err_msg = \"SMOTE-NC is not designed to work only with numerical features\"\n    with pytest.raises(ValueError, match=err_msg):\n        smote.fit_resample(X, y)\n"
  },
  {
    "path": "imblearn/over_sampling/_smote/tests/test_smoten.py",
    "content": "import numpy as np\nimport pytest\nfrom sklearn.exceptions import DataConversionWarning\nfrom sklearn.preprocessing import OneHotEncoder, OrdinalEncoder\nfrom sklearn.utils._testing import _convert_container\n\nfrom imblearn.over_sampling import SMOTEN\n\n\n@pytest.fixture\ndef data():\n    rng = np.random.RandomState(0)\n\n    feature_1 = [\"A\"] * 10 + [\"B\"] * 20 + [\"C\"] * 30\n    feature_2 = [\"A\"] * 40 + [\"B\"] * 20\n    feature_3 = [\"A\"] * 20 + [\"B\"] * 20 + [\"C\"] * 10 + [\"D\"] * 10\n    X = np.array([feature_1, feature_2, feature_3], dtype=object).T\n    rng.shuffle(X)\n    y = np.array([0] * 20 + [1] * 40, dtype=np.int32)\n    y_labels = np.array([\"not apple\", \"apple\"], dtype=object)\n    y = y_labels[y]\n    return X, y\n\n\ndef test_smoten(data):\n    # overall check for SMOTEN\n    X, y = data\n    sampler = SMOTEN(random_state=0)\n    X_res, y_res = sampler.fit_resample(X, y)\n\n    assert X_res.shape == (80, 3)\n    assert y_res.shape == (80,)\n    assert isinstance(sampler.categorical_encoder_, OrdinalEncoder)\n\n\ndef test_smoten_resampling():\n    # check if the SMOTEN resample data as expected\n    # we generate data such that \"not apple\" will be the minority class and\n    # samples from this class will be generated. We will force the \"blue\"\n    # category to be associated with this class. Therefore, the new generated\n    # samples should as well be from the \"blue\" category.\n    X = np.array([\"green\"] * 5 + [\"red\"] * 10 + [\"blue\"] * 7, dtype=object).reshape(\n        -1, 1\n    )\n    y = np.array(\n        [\"apple\"] * 5\n        + [\"not apple\"] * 3\n        + [\"apple\"] * 7\n        + [\"not apple\"] * 5\n        + [\"apple\"] * 2,\n        dtype=object,\n    )\n    sampler = SMOTEN(random_state=0)\n    X_res, y_res = sampler.fit_resample(X, y)\n\n    X_generated, y_generated = X_res[X.shape[0] :], y_res[X.shape[0] :]\n    np.testing.assert_array_equal(X_generated, \"blue\")\n    np.testing.assert_array_equal(y_generated, \"not apple\")\n\n\n@pytest.mark.parametrize(\"sparse_format\", [\"sparse_csr\", \"sparse_csc\"])\ndef test_smoten_sparse_input(data, sparse_format):\n    \"\"\"Check that we handle sparse input in SMOTEN even if it is not efficient.\n\n    Non-regression test for:\n    https://github.com/scikit-learn-contrib/imbalanced-learn/issues/971\n    \"\"\"\n    X, y = data\n    X = OneHotEncoder().fit_transform(X).toarray()\n    X = _convert_container(X, sparse_format)\n\n    with pytest.warns(DataConversionWarning, match=\"is not really efficient\"):\n        X_res, y_res = SMOTEN(random_state=0).fit_resample(X, y)\n\n    assert X_res.format == X.format\n    assert X_res.shape[0] == len(y_res)\n\n\ndef test_smoten_categorical_encoder(data):\n    \"\"\"Check that `categorical_encoder` is used when provided.\"\"\"\n\n    X, y = data\n    sampler = SMOTEN(random_state=0)\n    sampler.fit_resample(X, y)\n\n    assert isinstance(sampler.categorical_encoder_, OrdinalEncoder)\n    assert sampler.categorical_encoder_.dtype == np.int32\n\n    encoder = OrdinalEncoder(dtype=np.int64)\n    sampler.set_params(categorical_encoder=encoder).fit_resample(X, y)\n\n    assert isinstance(sampler.categorical_encoder_, OrdinalEncoder)\n    assert sampler.categorical_encoder is encoder\n    assert sampler.categorical_encoder_ is not encoder\n    assert sampler.categorical_encoder_.dtype == np.int64\n"
  },
  {
    "path": "imblearn/over_sampling/_smote/tests/test_svm_smote.py",
    "content": "import numpy as np\nimport pytest\nfrom sklearn.datasets import make_classification\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.neighbors import NearestNeighbors\nfrom sklearn.svm import SVC\nfrom sklearn.utils._testing import assert_allclose, assert_array_equal\n\nfrom imblearn.over_sampling import SVMSMOTE\n\n\n@pytest.fixture\ndef data():\n    X = np.array(\n        [\n            [0.11622591, -0.0317206],\n            [0.77481731, 0.60935141],\n            [1.25192108, -0.22367336],\n            [0.53366841, -0.30312976],\n            [1.52091956, -0.49283504],\n            [-0.28162401, -2.10400981],\n            [0.83680821, 1.72827342],\n            [0.3084254, 0.33299982],\n            [0.70472253, -0.73309052],\n            [0.28893132, -0.38761769],\n            [1.15514042, 0.0129463],\n            [0.88407872, 0.35454207],\n            [1.31301027, -0.92648734],\n            [-1.11515198, -0.93689695],\n            [-0.18410027, -0.45194484],\n            [0.9281014, 0.53085498],\n            [-0.14374509, 0.27370049],\n            [-0.41635887, -0.38299653],\n            [0.08711622, 0.93259929],\n            [1.70580611, -0.11219234],\n        ]\n    )\n    y = np.array([0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0])\n    return X, y\n\n\ndef test_svm_smote(data):\n    svm_smote = SVMSMOTE(random_state=42)\n    svm_smote_nn = SVMSMOTE(\n        random_state=42,\n        k_neighbors=NearestNeighbors(n_neighbors=6),\n        m_neighbors=NearestNeighbors(n_neighbors=11),\n        svm_estimator=SVC(gamma=\"scale\", random_state=42),\n    )\n\n    X_res_1, y_res_1 = svm_smote.fit_resample(*data)\n    X_res_2, y_res_2 = svm_smote_nn.fit_resample(*data)\n\n    assert_allclose(X_res_1, X_res_2)\n    assert_array_equal(y_res_1, y_res_2)\n\n\ndef test_svm_smote_not_svm(data):\n    \"\"\"Check that we raise a proper error if passing an estimator that does not\n    expose a `support_` fitted attribute.\"\"\"\n\n    err_msg = \"`svm_estimator` is required to exposed a `support_` fitted attribute.\"\n    with pytest.raises(RuntimeError, match=err_msg):\n        SVMSMOTE(svm_estimator=LogisticRegression()).fit_resample(*data)\n\n\ndef test_svm_smote_all_noise(data):\n    \"\"\"Check that we raise a proper error message when all support vectors are\n    detected as noise and there is nothing that we can do.\n\n    Non-regression test for:\n    https://github.com/scikit-learn-contrib/imbalanced-learn/issues/742\n    \"\"\"\n    X, y = make_classification(\n        n_classes=3,\n        class_sep=0.001,\n        weights=[0.004, 0.451, 0.545],\n        n_informative=3,\n        n_redundant=0,\n        flip_y=0,\n        n_features=3,\n        n_clusters_per_class=2,\n        n_samples=1000,\n        random_state=10,\n    )\n\n    with pytest.raises(ValueError, match=\"SVM-SMOTE is not adapted to your dataset\"):\n        SVMSMOTE(k_neighbors=4, random_state=42).fit_resample(X, y)\n"
  },
  {
    "path": "imblearn/over_sampling/base.py",
    "content": "\"\"\"\nBase class for the over-sampling method.\n\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport numbers\nfrom collections.abc import Mapping\n\nfrom sklearn.utils._param_validation import Interval, StrOptions\n\nfrom imblearn.base import BaseSampler\n\n\nclass BaseOverSampler(BaseSampler):\n    \"\"\"Base class for over-sampling algorithms.\n\n    Warning: This class should not be used directly. Use the derive classes\n    instead.\n    \"\"\"\n\n    _sampling_type = \"over-sampling\"\n\n    _sampling_strategy_docstring = (\n        \"\"\"sampling_strategy : float, str, dict or callable, default='auto'\n        Sampling information to resample the data set.\n\n        - When ``float``, it corresponds to the desired ratio of the number of\n          samples in the minority class over the number of samples in the\n          majority class after resampling. Therefore, the ratio is expressed as\n          :math:`\\\\alpha_{os} = N_{rm} / N_{M}` where :math:`N_{rm}` is the\n          number of samples in the minority class after resampling and\n          :math:`N_{M}` is the number of samples in the majority class.\n\n            .. warning::\n               ``float`` is only available for **binary** classification. An\n               error is raised for multi-class classification.\n\n        - When ``str``, specify the class targeted by the resampling. The\n          number of samples in the different classes will be equalized.\n          Possible choices are:\n\n            ``'minority'``: resample only the minority class;\n\n            ``'not minority'``: resample all classes but the minority class;\n\n            ``'not majority'``: resample all classes but the majority class;\n\n            ``'all'``: resample all classes;\n\n            ``'auto'``: equivalent to ``'not majority'``.\n\n        - When ``dict``, the keys correspond to the targeted classes. The\n          values correspond to the desired number of samples for each targeted\n          class.\n\n        - When callable, function taking ``y`` and returns a ``dict``. The keys\n          correspond to the targeted classes. The values correspond to the\n          desired number of samples for each class.\n        \"\"\".strip()\n    )  # noqa: E501\n\n    _parameter_constraints: dict = {\n        \"sampling_strategy\": [\n            Interval(numbers.Real, 0, 1, closed=\"right\"),\n            StrOptions({\"auto\", \"minority\", \"not minority\", \"not majority\", \"all\"}),\n            Mapping,\n            callable,\n        ],\n        \"random_state\": [\"random_state\"],\n    }\n"
  },
  {
    "path": "imblearn/over_sampling/tests/__init__.py",
    "content": ""
  },
  {
    "path": "imblearn/over_sampling/tests/test_adasyn.py",
    "content": "\"\"\"Test the module under sampler.\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport numpy as np\nfrom sklearn.neighbors import NearestNeighbors\nfrom sklearn.utils._testing import assert_allclose, assert_array_equal\n\nfrom imblearn.over_sampling import ADASYN\n\nRND_SEED = 0\nX = np.array(\n    [\n        [0.11622591, -0.0317206],\n        [0.77481731, 0.60935141],\n        [1.25192108, -0.22367336],\n        [0.53366841, -0.30312976],\n        [1.52091956, -0.49283504],\n        [-0.28162401, -2.10400981],\n        [0.83680821, 1.72827342],\n        [0.3084254, 0.33299982],\n        [0.70472253, -0.73309052],\n        [0.28893132, -0.38761769],\n        [1.15514042, 0.0129463],\n        [0.88407872, 0.35454207],\n        [1.31301027, -0.92648734],\n        [-1.11515198, -0.93689695],\n        [-0.18410027, -0.45194484],\n        [0.9281014, 0.53085498],\n        [-0.14374509, 0.27370049],\n        [-0.41635887, -0.38299653],\n        [0.08711622, 0.93259929],\n        [1.70580611, -0.11219234],\n    ]\n)\nY = np.array([0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0])\nR_TOL = 1e-4\n\n\ndef test_ada_init():\n    sampling_strategy = \"auto\"\n    ada = ADASYN(sampling_strategy=sampling_strategy, random_state=RND_SEED)\n    assert ada.random_state == RND_SEED\n\n\ndef test_ada_fit_resample():\n    ada = ADASYN(random_state=RND_SEED)\n    X_resampled, y_resampled = ada.fit_resample(X, Y)\n    X_gt = np.array(\n        [\n            [0.11622591, -0.0317206],\n            [0.77481731, 0.60935141],\n            [1.25192108, -0.22367336],\n            [0.53366841, -0.30312976],\n            [1.52091956, -0.49283504],\n            [-0.28162401, -2.10400981],\n            [0.83680821, 1.72827342],\n            [0.3084254, 0.33299982],\n            [0.70472253, -0.73309052],\n            [0.28893132, -0.38761769],\n            [1.15514042, 0.0129463],\n            [0.88407872, 0.35454207],\n            [1.31301027, -0.92648734],\n            [-1.11515198, -0.93689695],\n            [-0.18410027, -0.45194484],\n            [0.9281014, 0.53085498],\n            [-0.14374509, 0.27370049],\n            [-0.41635887, -0.38299653],\n            [0.08711622, 0.93259929],\n            [1.70580611, -0.11219234],\n            [0.88161986, -0.2829741],\n            [0.35681689, -0.18814597],\n            [1.4148276, 0.05308106],\n            [0.3136591, -0.31327875],\n        ]\n    )\n    y_gt = np.array(\n        [0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0]\n    )\n    assert_allclose(X_resampled, X_gt, rtol=R_TOL)\n    assert_array_equal(y_resampled, y_gt)\n\n\ndef test_ada_fit_resample_nn_obj():\n    nn = NearestNeighbors(n_neighbors=6)\n    ada = ADASYN(random_state=RND_SEED, n_neighbors=nn)\n    X_resampled, y_resampled = ada.fit_resample(X, Y)\n    X_gt = np.array(\n        [\n            [0.11622591, -0.0317206],\n            [0.77481731, 0.60935141],\n            [1.25192108, -0.22367336],\n            [0.53366841, -0.30312976],\n            [1.52091956, -0.49283504],\n            [-0.28162401, -2.10400981],\n            [0.83680821, 1.72827342],\n            [0.3084254, 0.33299982],\n            [0.70472253, -0.73309052],\n            [0.28893132, -0.38761769],\n            [1.15514042, 0.0129463],\n            [0.88407872, 0.35454207],\n            [1.31301027, -0.92648734],\n            [-1.11515198, -0.93689695],\n            [-0.18410027, -0.45194484],\n            [0.9281014, 0.53085498],\n            [-0.14374509, 0.27370049],\n            [-0.41635887, -0.38299653],\n            [0.08711622, 0.93259929],\n            [1.70580611, -0.11219234],\n            [0.88161986, -0.2829741],\n            [0.35681689, -0.18814597],\n            [1.4148276, 0.05308106],\n            [0.3136591, -0.31327875],\n        ]\n    )\n    y_gt = np.array(\n        [0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0]\n    )\n    assert_allclose(X_resampled, X_gt, rtol=R_TOL)\n    assert_array_equal(y_resampled, y_gt)\n"
  },
  {
    "path": "imblearn/over_sampling/tests/test_common.py",
    "content": "from collections import Counter\n\nimport numpy as np\nimport pytest\nfrom sklearn.cluster import MiniBatchKMeans\n\nfrom imblearn.over_sampling import (\n    ADASYN,\n    SMOTE,\n    SMOTEN,\n    SMOTENC,\n    SVMSMOTE,\n    BorderlineSMOTE,\n    KMeansSMOTE,\n)\nfrom imblearn.utils.testing import _CustomNearestNeighbors\n\n\n@pytest.fixture\ndef numerical_data():\n    rng = np.random.RandomState(0)\n    X = rng.randn(100, 2)\n    y = np.repeat([0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0], 5)\n\n    return X, y\n\n\n@pytest.fixture\ndef categorical_data():\n    rng = np.random.RandomState(0)\n\n    feature_1 = [\"A\"] * 10 + [\"B\"] * 20 + [\"C\"] * 30\n    feature_2 = [\"A\"] * 40 + [\"B\"] * 20\n    feature_3 = [\"A\"] * 20 + [\"B\"] * 20 + [\"C\"] * 10 + [\"D\"] * 10\n    X = np.array([feature_1, feature_2, feature_3], dtype=object).T\n    rng.shuffle(X)\n    y = np.array([0] * 20 + [1] * 40, dtype=np.int32)\n    y_labels = np.array([\"not apple\", \"apple\"], dtype=object)\n    y = y_labels[y]\n    return X, y\n\n\n@pytest.fixture\ndef heterogeneous_data():\n    rng = np.random.RandomState(42)\n    X = np.empty((30, 4), dtype=object)\n    X[:, :2] = rng.randn(30, 2)\n    X[:, 2] = rng.choice([\"a\", \"b\", \"c\"], size=30).astype(object)\n    X[:, 3] = rng.randint(3, size=30)\n    y = np.array([0] * 10 + [1] * 20)\n    return X, y, [2, 3]\n\n\n@pytest.mark.parametrize(\n    \"smote\", [BorderlineSMOTE(), SVMSMOTE()], ids=[\"borderline\", \"svm\"]\n)\ndef test_smote_m_neighbors(numerical_data, smote):\n    # check that m_neighbors is properly set. Regression test for:\n    # https://github.com/scikit-learn-contrib/imbalanced-learn/issues/568\n    X, y = numerical_data\n    _ = smote.fit_resample(X, y)\n    assert smote.nn_k_.n_neighbors == 6\n    assert smote.nn_m_.n_neighbors == 11\n\n\n@pytest.mark.parametrize(\n    \"smote, neighbor_estimator_name\",\n    [\n        (ADASYN(random_state=0), \"n_neighbors\"),\n        (BorderlineSMOTE(random_state=0), \"k_neighbors\"),\n        (\n            KMeansSMOTE(\n                kmeans_estimator=MiniBatchKMeans(n_init=1, random_state=0),\n                random_state=1,\n            ),\n            \"k_neighbors\",\n        ),\n        (SMOTE(random_state=0), \"k_neighbors\"),\n        (SVMSMOTE(random_state=0), \"k_neighbors\"),\n    ],\n    ids=[\"adasyn\", \"borderline\", \"kmeans\", \"smote\", \"svm\"],\n)\ndef test_numerical_smote_custom_nn(numerical_data, smote, neighbor_estimator_name):\n    X, y = numerical_data\n    params = {\n        neighbor_estimator_name: _CustomNearestNeighbors(n_neighbors=5),\n    }\n    smote.set_params(**params)\n    X_res, _ = smote.fit_resample(X, y)\n\n    assert X_res.shape[0] >= 120\n\n\ndef test_categorical_smote_k_custom_nn(categorical_data):\n    X, y = categorical_data\n    smote = SMOTEN(k_neighbors=_CustomNearestNeighbors(n_neighbors=5))\n    X_res, y_res = smote.fit_resample(X, y)\n\n    assert X_res.shape == (80, 3)\n    assert Counter(y_res) == {\"apple\": 40, \"not apple\": 40}\n\n\ndef test_heterogeneous_smote_k_custom_nn(heterogeneous_data):\n    X, y, categorical_features = heterogeneous_data\n    smote = SMOTENC(\n        categorical_features, k_neighbors=_CustomNearestNeighbors(n_neighbors=5)\n    )\n    X_res, y_res = smote.fit_resample(X, y)\n\n    assert X_res.shape == (40, 4)\n    assert Counter(y_res) == {0: 20, 1: 20}\n\n\n@pytest.mark.parametrize(\n    \"smote\",\n    [BorderlineSMOTE(random_state=0), SVMSMOTE(random_state=0)],\n    ids=[\"borderline\", \"svm\"],\n)\ndef test_numerical_smote_extra_custom_nn(numerical_data, smote):\n    X, y = numerical_data\n    smote.set_params(m_neighbors=_CustomNearestNeighbors(n_neighbors=5))\n    X_res, y_res = smote.fit_resample(X, y)\n\n    assert X_res.shape == (120, 2)\n    assert Counter(y_res) == {0: 60, 1: 60}\n"
  },
  {
    "path": "imblearn/over_sampling/tests/test_random_over_sampler.py",
    "content": "\"\"\"Test the module under sampler.\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nfrom collections import Counter\nfrom datetime import datetime\n\nimport numpy as np\nimport pytest\nfrom sklearn.datasets import make_classification\nfrom sklearn.utils._testing import (\n    _convert_container,\n    assert_allclose,\n    assert_array_equal,\n)\n\nfrom imblearn.over_sampling import RandomOverSampler\n\nRND_SEED = 0\n\n\n@pytest.fixture\ndef data():\n    X = np.array(\n        [\n            [0.04352327, -0.20515826],\n            [0.92923648, 0.76103773],\n            [0.20792588, 1.49407907],\n            [0.47104475, 0.44386323],\n            [0.22950086, 0.33367433],\n            [0.15490546, 0.3130677],\n            [0.09125309, -0.85409574],\n            [0.12372842, 0.6536186],\n            [0.13347175, 0.12167502],\n            [0.094035, -2.55298982],\n        ]\n    )\n    Y = np.array([1, 0, 1, 0, 1, 1, 1, 1, 0, 1])\n    return X, Y\n\n\ndef test_ros_init():\n    sampling_strategy = \"auto\"\n    ros = RandomOverSampler(sampling_strategy=sampling_strategy, random_state=RND_SEED)\n    assert ros.random_state == RND_SEED\n\n\n@pytest.mark.parametrize(\n    \"params\", [{\"shrinkage\": None}, {\"shrinkage\": 0}, {\"shrinkage\": {0: 0}}]\n)\n@pytest.mark.parametrize(\"X_type\", [\"array\", \"dataframe\"])\ndef test_ros_fit_resample(X_type, data, params):\n    X, Y = data\n    X_ = _convert_container(X, X_type)\n    ros = RandomOverSampler(**params, random_state=RND_SEED)\n    X_resampled, y_resampled = ros.fit_resample(X_, Y)\n    X_gt = np.array(\n        [\n            [0.04352327, -0.20515826],\n            [0.92923648, 0.76103773],\n            [0.20792588, 1.49407907],\n            [0.47104475, 0.44386323],\n            [0.22950086, 0.33367433],\n            [0.15490546, 0.3130677],\n            [0.09125309, -0.85409574],\n            [0.12372842, 0.6536186],\n            [0.13347175, 0.12167502],\n            [0.094035, -2.55298982],\n            [0.92923648, 0.76103773],\n            [0.47104475, 0.44386323],\n            [0.92923648, 0.76103773],\n            [0.47104475, 0.44386323],\n        ]\n    )\n    y_gt = np.array([1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0])\n\n    if X_type == \"dataframe\":\n        assert hasattr(X_resampled, \"loc\")\n        X_resampled = X_resampled.to_numpy()\n\n    assert_allclose(X_resampled, X_gt)\n    assert_array_equal(y_resampled, y_gt)\n\n    if params[\"shrinkage\"] is None:\n        assert ros.shrinkage_ is None\n    else:\n        assert ros.shrinkage_ == {0: 0}\n\n\n@pytest.mark.parametrize(\"params\", [{\"shrinkage\": None}, {\"shrinkage\": 0}])\ndef test_ros_fit_resample_half(data, params):\n    X, Y = data\n    sampling_strategy = {0: 3, 1: 7}\n    ros = RandomOverSampler(\n        **params, sampling_strategy=sampling_strategy, random_state=RND_SEED\n    )\n    X_resampled, y_resampled = ros.fit_resample(X, Y)\n    X_gt = np.array(\n        [\n            [0.04352327, -0.20515826],\n            [0.92923648, 0.76103773],\n            [0.20792588, 1.49407907],\n            [0.47104475, 0.44386323],\n            [0.22950086, 0.33367433],\n            [0.15490546, 0.3130677],\n            [0.09125309, -0.85409574],\n            [0.12372842, 0.6536186],\n            [0.13347175, 0.12167502],\n            [0.094035, -2.55298982],\n        ]\n    )\n    y_gt = np.array([1, 0, 1, 0, 1, 1, 1, 1, 0, 1])\n    assert_allclose(X_resampled, X_gt)\n    assert_array_equal(y_resampled, y_gt)\n\n    if params[\"shrinkage\"] is None:\n        assert ros.shrinkage_ is None\n    else:\n        assert ros.shrinkage_ == {0: 0, 1: 0}\n\n\n@pytest.mark.parametrize(\"params\", [{\"shrinkage\": None}, {\"shrinkage\": 0}])\ndef test_multiclass_fit_resample(data, params):\n    # check the random over-sampling with a multiclass problem\n    X, Y = data\n    y = Y.copy()\n    y[5] = 2\n    y[6] = 2\n    ros = RandomOverSampler(**params, random_state=RND_SEED)\n    X_resampled, y_resampled = ros.fit_resample(X, y)\n    count_y_res = Counter(y_resampled)\n    assert count_y_res[0] == 5\n    assert count_y_res[1] == 5\n    assert count_y_res[2] == 5\n\n    if params[\"shrinkage\"] is None:\n        assert ros.shrinkage_ is None\n    else:\n        assert ros.shrinkage_ == {0: 0, 2: 0}\n\n\ndef test_random_over_sampling_heterogeneous_data():\n    # check that resampling with heterogeneous dtype is working with basic\n    # resampling\n    X_hetero = np.array(\n        [[\"xxx\", 1, 1.0], [\"yyy\", 2, 2.0], [\"zzz\", 3, 3.0]], dtype=object\n    )\n    y = np.array([0, 0, 1])\n    ros = RandomOverSampler(random_state=RND_SEED)\n    X_res, y_res = ros.fit_resample(X_hetero, y)\n\n    assert X_res.shape[0] == 4\n    assert y_res.shape[0] == 4\n    assert X_res.dtype == object\n    assert X_res[-1, 0] in X_hetero[:, 0]\n\n\ndef test_random_over_sampling_nan_inf(data):\n    # check that we can oversample even with missing or infinite data\n    # regression tests for #605\n    X, Y = data\n    rng = np.random.RandomState(42)\n    n_not_finite = X.shape[0] // 3\n    row_indices = rng.choice(np.arange(X.shape[0]), size=n_not_finite)\n    col_indices = rng.randint(0, X.shape[1], size=n_not_finite)\n    not_finite_values = rng.choice([np.nan, np.inf], size=n_not_finite)\n\n    X_ = X.copy()\n    X_[row_indices, col_indices] = not_finite_values\n\n    ros = RandomOverSampler(random_state=0)\n    X_res, y_res = ros.fit_resample(X_, Y)\n\n    assert y_res.shape == (14,)\n    assert X_res.shape == (14, 2)\n    assert np.any(~np.isfinite(X_res))\n\n\ndef test_random_over_sampling_heterogeneous_data_smoothed_bootstrap():\n    # check that we raise an error when heterogeneous dtype data are given\n    # and a smoothed bootstrap is requested\n    X_hetero = np.array(\n        [[\"xxx\", 1, 1.0], [\"yyy\", 2, 2.0], [\"zzz\", 3, 3.0]], dtype=object\n    )\n    y = np.array([0, 0, 1])\n    ros = RandomOverSampler(shrinkage=1, random_state=RND_SEED)\n    err_msg = \"When shrinkage is not None, X needs to contain only numerical\"\n    with pytest.raises(ValueError, match=err_msg):\n        ros.fit_resample(X_hetero, y)\n\n\n@pytest.mark.parametrize(\"X_type\", [\"dataframe\", \"array\", \"sparse_csr\", \"sparse_csc\"])\ndef test_random_over_sampler_smoothed_bootstrap(X_type, data):\n    # check that smoothed bootstrap is working for numerical array\n    X, y = data\n    sampler = RandomOverSampler(shrinkage=1)\n    X = _convert_container(X, X_type)\n    X_res, y_res = sampler.fit_resample(X, y)\n\n    assert y_res.shape == (14,)\n    assert X_res.shape == (14, 2)\n\n    if X_type == \"dataframe\":\n        assert hasattr(X_res, \"loc\")\n\n\ndef test_random_over_sampler_equivalence_shrinkage(data):\n    # check that a shrinkage factor of 0 is equivalent to not create a smoothed\n    # bootstrap\n    X, y = data\n\n    ros_not_shrink = RandomOverSampler(shrinkage=0, random_state=0)\n    ros_hard_bootstrap = RandomOverSampler(shrinkage=None, random_state=0)\n\n    X_res_not_shrink, y_res_not_shrink = ros_not_shrink.fit_resample(X, y)\n    X_res, y_res = ros_hard_bootstrap.fit_resample(X, y)\n\n    assert_allclose(X_res_not_shrink, X_res)\n    assert_allclose(y_res_not_shrink, y_res)\n\n    assert y_res.shape == (14,)\n    assert X_res.shape == (14, 2)\n    assert y_res_not_shrink.shape == (14,)\n    assert X_res_not_shrink.shape == (14, 2)\n\n\ndef test_random_over_sampler_shrinkage_behaviour(data):\n    # check the behaviour of the shrinkage parameter\n    # the covariance of the data generated with the larger shrinkage factor\n    # should also be larger.\n    X, y = data\n\n    ros = RandomOverSampler(shrinkage=1, random_state=0)\n    X_res_shink_1, y_res_shrink_1 = ros.fit_resample(X, y)\n\n    ros.set_params(shrinkage=5)\n    X_res_shink_5, y_res_shrink_5 = ros.fit_resample(X, y)\n\n    disperstion_shrink_1 = np.linalg.det(np.cov(X_res_shink_1[y_res_shrink_1 == 0].T))\n    disperstion_shrink_5 = np.linalg.det(np.cov(X_res_shink_5[y_res_shrink_5 == 0].T))\n\n    assert disperstion_shrink_1 < disperstion_shrink_5\n\n\n@pytest.mark.parametrize(\n    \"shrinkage, err_msg\",\n    [\n        ({}, \"`shrinkage` should contain a shrinkage factor for each class\"),\n        ({0: -1}, \"The shrinkage factor needs to be >= 0\"),\n    ],\n)\ndef test_random_over_sampler_shrinkage_error(data, shrinkage, err_msg):\n    # check the validation of the shrinkage parameter\n    X, y = data\n    ros = RandomOverSampler(shrinkage=shrinkage)\n    with pytest.raises(ValueError, match=err_msg):\n        ros.fit_resample(X, y)\n\n\n@pytest.mark.parametrize(\n    \"sampling_strategy\", [\"auto\", \"minority\", \"not minority\", \"not majority\", \"all\"]\n)\ndef test_random_over_sampler_strings(sampling_strategy):\n    \"\"\"Check that we support all supposed strings as `sampling_strategy` in\n    a sampler inheriting from `BaseOverSampler`.\"\"\"\n\n    X, y = make_classification(\n        n_samples=100,\n        n_clusters_per_class=1,\n        n_classes=3,\n        weights=[0.1, 0.3, 0.6],\n        random_state=0,\n    )\n    RandomOverSampler(sampling_strategy=sampling_strategy).fit_resample(X, y)\n\n\ndef test_random_over_sampling_datetime():\n    \"\"\"Check that we don't convert input data and only sample from it.\"\"\"\n    pd = pytest.importorskip(\"pandas\")\n    X = pd.DataFrame({\"label\": [0, 0, 0, 1], \"td\": [datetime.now()] * 4})\n    y = X[\"label\"]\n    ros = RandomOverSampler(random_state=0)\n    X_res, y_res = ros.fit_resample(X, y)\n\n    pd.testing.assert_series_equal(X_res.dtypes, X.dtypes)\n    pd.testing.assert_index_equal(X_res.index, y_res.index)\n    assert_array_equal(y_res.to_numpy(), np.array([0, 0, 0, 1, 1, 1]))\n\n\ndef test_random_over_sampler_full_nat():\n    \"\"\"Check that we can return timedelta columns full of NaT.\n\n    Non-regression test for:\n    https://github.com/scikit-learn-contrib/imbalanced-learn/issues/1055\n    \"\"\"\n    pd = pytest.importorskip(\"pandas\")\n\n    X = pd.DataFrame(\n        {\n            \"col_str\": [\"abc\", \"def\", \"xyz\"],\n            \"col_timedelta\": pd.to_timedelta([np.nan, np.nan, np.nan]),\n        }\n    )\n    y = np.array([0, 0, 1])\n\n    X_res, y_res = RandomOverSampler().fit_resample(X, y)\n    assert X_res.shape == (4, 2)\n    assert y_res.shape == (4,)\n\n    assert X_res[\"col_timedelta\"].dtype == \"timedelta64[ns]\"\n"
  },
  {
    "path": "imblearn/pipeline.py",
    "content": "\"\"\"\nThe :mod:`imblearn.pipeline` module implements utilities to build a\ncomposite estimator, as a chain of transforms, samples and estimators.\n\"\"\"\n\n# Adapted from scikit-learn\n\n# Author: Edouard Duchesnay\n#         Gael Varoquaux\n#         Virgile Fritsch\n#         Alexandre Gramfort\n#         Lars Buitinck\n#         Christos Aridas\n#         Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: BSD\nimport warnings\nfrom contextlib import contextmanager\nfrom copy import deepcopy\n\nfrom sklearn import pipeline\nfrom sklearn.base import clone\nfrom sklearn.exceptions import NotFittedError\nfrom sklearn.utils import Bunch\nfrom sklearn.utils._param_validation import HasMethods\nfrom sklearn.utils.fixes import parse_version\nfrom sklearn.utils.metadata_routing import (\n    MetadataRouter,\n    MethodMapping,\n    _routing_enabled,\n    get_routing_for_object,\n)\nfrom sklearn.utils.metaestimators import available_if\nfrom sklearn.utils.validation import check_is_fitted, check_memory\nfrom sklearn_compat._sklearn_compat import sklearn_version\nfrom sklearn_compat.base import _fit_context\nfrom sklearn_compat.utils._param_validation import validate_params\nfrom sklearn_compat.utils._user_interface import _print_elapsed_time\nfrom sklearn_compat.utils.metadata_routing import _raise_for_params, process_routing\n\nfrom imblearn.base import METHODS\nfrom imblearn.utils._tags import get_tags\n\n__all__ = [\"Pipeline\", \"make_pipeline\"]\n\n\n@contextmanager\ndef _raise_or_warn_if_not_fitted(estimator):\n    \"\"\"A context manager to make sure a NotFittedError is raised, if a sub-estimator\n    raises the error.\n    Otherwise, we raise a warning if the pipeline is not fitted, with the deprecation.\n    TODO(0.15): remove this context manager and replace with check_is_fitted.\n    \"\"\"\n    try:\n        yield\n    except NotFittedError as exc:\n        raise NotFittedError(\"Pipeline is not fitted yet.\") from exc\n\n    # we only get here if the above didn't raise\n    try:\n        check_is_fitted(estimator)\n    except NotFittedError:\n        warnings.warn(\n            (\n                \"This Pipeline instance is not fitted yet. Call 'fit' with \"\n                \"appropriate arguments before using other methods such as transform, \"\n                \"predict, etc. This will raise an error in 0.15 instead of the current \"\n                \"warning.\"\n            ),\n            FutureWarning,\n        )\n\n\ndef _cached_transform(\n    sub_pipeline, *, cache, param_name, param_value, transform_params\n):\n    \"\"\"Transform a parameter value using a sub-pipeline and cache the result.\n    Parameters\n    ----------\n    sub_pipeline : Pipeline\n        The sub-pipeline to be used for transformation.\n    cache : dict\n        The cache dictionary to store the transformed values.\n    param_name : str\n        The name of the parameter to be transformed.\n    param_value : object\n        The value of the parameter to be transformed.\n    transform_params : dict\n        The metadata to be used for transformation. This passed to the\n        `transform` method of the sub-pipeline.\n    Returns\n    -------\n    transformed_value : object\n        The transformed value of the parameter.\n    \"\"\"\n    if param_name not in cache:\n        # If the parameter is a tuple, transform each element of the\n        # tuple. This is needed to support the pattern present in\n        # `lightgbm` and `xgboost` where users can pass multiple\n        # validation sets.\n        if isinstance(param_value, tuple):\n            cache[param_name] = tuple(\n                sub_pipeline.transform(element, **transform_params)\n                for element in param_value\n            )\n        else:\n            cache[param_name] = sub_pipeline.transform(param_value, **transform_params)\n\n    return cache[param_name]\n\n\nclass Pipeline(pipeline.Pipeline):\n    \"\"\"Pipeline of transforms and resamples with a final estimator.\n\n    Sequentially apply a list of transforms, sampling, and a final estimator.\n    Intermediate steps of the pipeline must be transformers or resamplers,\n    that is, they must implement fit, transform and sample methods.\n    The samplers are only applied during fit.\n    The final estimator only needs to implement fit.\n    The transformers and samplers in the pipeline can be cached using\n    ``memory`` argument.\n\n    The purpose of the pipeline is to assemble several steps that can be\n    cross-validated together while setting different parameters.\n    For this, it enables setting parameters of the various steps using their\n    names and the parameter name separated by a '__', as in the example below.\n    A step's estimator may be replaced entirely by setting the parameter\n    with its name to another estimator, or a transformer removed by setting\n    it to 'passthrough' or ``None``.\n\n    Parameters\n    ----------\n    steps : list\n        List of (name, transform) tuples (implementing\n        fit/transform/fit_resample) that are chained, in the order in which\n        they are chained, with the last object an estimator.\n\n    transform_input : list of str, default=None\n        The names of the :term:`metadata` parameters that should be transformed by the\n        pipeline before passing it to the step consuming it.\n\n        This enables transforming some input arguments to ``fit`` (other than ``X``)\n        to be transformed by the steps of the pipeline up to the step which requires\n        them. Requirement is defined via :ref:`metadata routing <metadata_routing>`.\n        For instance, this can be used to pass a validation set through the pipeline.\n\n        You can only set this if metadata routing is enabled, which you\n        can enable using ``sklearn.set_config(enable_metadata_routing=True)``.\n\n        .. versionadded:: 1.6\n\n    memory : Instance of joblib.Memory or str, default=None\n        Used to cache the fitted transformers of the pipeline. By default,\n        no caching is performed. If a string is given, it is the path to\n        the caching directory. Enabling caching triggers a clone of\n        the transformers before fitting. Therefore, the transformer\n        instance given to the pipeline cannot be inspected\n        directly. Use the attribute ``named_steps`` or ``steps`` to\n        inspect estimators within the pipeline. Caching the\n        transformers is advantageous when fitting is time consuming.\n\n    verbose : bool, default=False\n        If True, the time elapsed while fitting each step will be printed as it\n        is completed.\n\n    Attributes\n    ----------\n    named_steps : :class:`~sklearn.utils.Bunch`\n        Read-only attribute to access any step parameter by user given name.\n        Keys are step names and values are steps parameters.\n\n    classes_ : ndarray of shape (n_classes,)\n        The classes labels.\n\n    n_features_in_ : int\n        Number of features seen during first step `fit` method.\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during :term:`fit`. Only defined if the\n        underlying estimator exposes such an attribute when fit.\n\n    See Also\n    --------\n    make_pipeline : Helper function to make pipeline.\n\n    Notes\n    -----\n    See :ref:`sphx_glr_auto_examples_pipeline_plot_pipeline_classification.py`\n\n    .. warning::\n       A surprising behaviour of the `imbalanced-learn` pipeline is that it\n       breaks the `scikit-learn` contract where one expects\n       `estimmator.fit_transform(X, y)` to be equivalent to\n       `estimator.fit(X, y).transform(X)`.\n\n       The semantic of `fit_resample` is to be applied only during the fit\n       stage. Therefore, resampling will happen when calling `fit_transform`\n       while it will only happen on the `fit` stage when calling `fit` and\n       `transform` separately. Practically, `fit_transform` will lead to a\n       resampled dataset while `fit` and `transform` will not.\n\n    Examples\n    --------\n    >>> from collections import Counter\n    >>> from sklearn.datasets import make_classification\n    >>> from sklearn.model_selection import train_test_split as tts\n    >>> from sklearn.decomposition import PCA\n    >>> from sklearn.neighbors import KNeighborsClassifier as KNN\n    >>> from sklearn.metrics import classification_report\n    >>> from imblearn.over_sampling import SMOTE\n    >>> from imblearn.pipeline import Pipeline\n    >>> X, y = make_classification(n_classes=2, class_sep=2,\n    ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,\n    ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)\n    >>> print(f'Original dataset shape {Counter(y)}')\n    Original dataset shape Counter({1: 900, 0: 100})\n    >>> pca = PCA()\n    >>> smt = SMOTE(random_state=42)\n    >>> knn = KNN()\n    >>> pipeline = Pipeline([('smt', smt), ('pca', pca), ('knn', knn)])\n    >>> X_train, X_test, y_train, y_test = tts(X, y, random_state=42)\n    >>> pipeline.fit(X_train, y_train)\n    Pipeline(...)\n    >>> y_hat = pipeline.predict(X_test)\n    >>> print(classification_report(y_test, y_hat))\n                  precision    recall  f1-score   support\n    <BLANKLINE>\n               0       0.87      1.00      0.93        26\n               1       1.00      0.98      0.99       224\n    <BLANKLINE>\n        accuracy                           0.98       250\n       macro avg       0.93      0.99      0.96       250\n    weighted avg       0.99      0.98      0.98       250\n    <BLANKLINE>\n    \"\"\"\n\n    _parameter_constraints: dict = {\n        \"steps\": \"no_validation\",  # validated in `_validate_steps`\n        \"transform_input\": [list, None],\n        \"memory\": [None, str, HasMethods([\"cache\"])],\n        \"verbose\": [\"boolean\"],\n    }\n\n    def __init__(self, steps, *, transform_input=None, memory=None, verbose=False):\n        self.steps = steps\n        self.transform_input = transform_input\n        self.memory = memory\n        self.verbose = verbose\n\n    # BaseEstimator interface\n\n    def _validate_steps(self):\n        names, estimators = zip(*self.steps)\n\n        # validate names\n        self._validate_names(names)\n\n        # validate estimators\n        transformers = estimators[:-1]\n        estimator = estimators[-1]\n\n        for t in transformers:\n            if t is None or t == \"passthrough\":\n                continue\n\n            is_transfomer = hasattr(t, \"fit\") and hasattr(t, \"transform\")\n            is_sampler = hasattr(t, \"fit_resample\")\n            is_not_transfomer_or_sampler = not (is_transfomer or is_sampler)\n\n            if is_not_transfomer_or_sampler:\n                raise TypeError(\n                    \"All intermediate steps of the chain should \"\n                    \"be estimators that implement fit and transform or \"\n                    \"fit_resample (but not both) or be a string 'passthrough' \"\n                    f\"'{t}' (type {type(t)}) doesn't)\"\n                )\n\n            if is_transfomer and is_sampler:\n                raise TypeError(\n                    \"All intermediate steps of the chain should \"\n                    \"be estimators that implement fit and transform or \"\n                    \"fit_resample.\"\n                    f\" '{t}' implements both)\"\n                )\n\n            if isinstance(t, pipeline.Pipeline):\n                raise TypeError(\n                    \"All intermediate steps of the chain should not be Pipelines\"\n                )\n\n        # We allow last estimator to be None as an identity transformation\n        if (\n            estimator is not None\n            and estimator != \"passthrough\"\n            and not hasattr(estimator, \"fit\")\n        ):\n            raise TypeError(\n                \"Last step of Pipeline should implement fit or be the string\"\n                f\" 'passthrough'. '{estimator}' (type {type(estimator)}) doesn't\"\n            )\n\n    def _iter(self, with_final=True, filter_passthrough=True, filter_resample=True):\n        \"\"\"Generate (idx, (name, trans)) tuples from self.steps.\n\n        When `filter_passthrough` is `True`, 'passthrough' and None\n        transformers are filtered out. When `filter_resample` is `True`,\n        estimator with a method `fit_resample` are filtered out.\n        \"\"\"\n        it = super()._iter(with_final, filter_passthrough)\n        if filter_resample:\n            return filter(lambda x: not hasattr(x[-1], \"fit_resample\"), it)\n        else:\n            return it\n\n    def _get_metadata_for_step(self, *, step_idx, step_params, all_params):\n        \"\"\"Get params (metadata) for step `name`.\n\n        This transforms the metadata up to this step if required, which is\n        indicated by the `transform_input` parameter.\n\n        If a param in `step_params` is included in the `transform_input` list,\n        it will be transformed.\n\n        Parameters\n        ----------\n        step_idx : int\n            Index of the step in the pipeline.\n\n        step_params : dict\n            Parameters specific to the step. These are routed parameters, e.g.\n            `routed_params[name]`. If a parameter name here is included in the\n            `pipeline.transform_input`, then it will be transformed. Note that\n            these parameters are *after* routing, so the aliases are already\n            resolved.\n\n        all_params : dict\n            All parameters passed by the user. Here this is used to call\n            `transform` on the slice of the pipeline itself.\n\n        Returns\n        -------\n        dict\n            Parameters to be passed to the step. The ones which should be\n            transformed are transformed.\n        \"\"\"\n        if (\n            self.transform_input is None\n            or not all_params\n            or not step_params\n            or step_idx == 0\n        ):\n            # we only need to process step_params if transform_input is set\n            # and metadata is given by the user.\n            return step_params\n\n        sub_pipeline = self[:step_idx]\n        sub_metadata_routing = get_routing_for_object(sub_pipeline)\n        # here we get the metadata required by sub_pipeline.transform\n        transform_params = {\n            key: value\n            for key, value in all_params.items()\n            if key\n            in sub_metadata_routing.consumes(\n                method=\"transform\", params=all_params.keys()\n            )\n        }\n        transformed_params = dict()  # this is to be returned\n        transformed_cache = dict()  # used to transform each param once\n        # `step_params` is the output of `process_routing`, so it has a dict for each\n        # method (e.g. fit, transform, predict), which are the args to be passed to\n        # those methods. We need to transform the parameters which are in the\n        # `transform_input`, before returning these dicts.\n        for method, method_params in step_params.items():\n            transformed_params[method] = Bunch()\n            for param_name, param_value in method_params.items():\n                # An example of `(param_name, param_value)` is\n                # `('sample_weight', array([0.5, 0.5, ...]))`\n                if param_name in self.transform_input:\n                    # This parameter now needs to be transformed by the sub_pipeline, to\n                    # this step. We cache these computations to avoid repeating them.\n                    transformed_params[method][param_name] = _cached_transform(\n                        sub_pipeline,\n                        cache=transformed_cache,\n                        param_name=param_name,\n                        param_value=param_value,\n                        transform_params=transform_params,\n                    )\n                else:\n                    transformed_params[method][param_name] = param_value\n        return transformed_params\n\n    # Estimator interface\n\n    # def _fit(self, X, y=None, **fit_params_steps):\n    def _fit(self, X, y=None, routed_params=None, raw_params=None):\n        self.steps = list(self.steps)\n        self._validate_steps()\n        # Setup the memory\n        memory = check_memory(self.memory)\n\n        fit_transform_one_cached = memory.cache(_fit_transform_one)\n        fit_resample_one_cached = memory.cache(_fit_resample_one)\n\n        for step_idx, name, transformer in self._iter(\n            with_final=False, filter_passthrough=False, filter_resample=False\n        ):\n            if transformer is None or transformer == \"passthrough\":\n                with _print_elapsed_time(\"Pipeline\", self._log_message(step_idx)):\n                    continue\n\n            if hasattr(memory, \"location\") and memory.location is None:\n                # we do not clone when caching is disabled to\n                # preserve backward compatibility\n                cloned_transformer = transformer\n            else:\n                cloned_transformer = clone(transformer)\n\n            # Fit or load from cache the current transformer\n            step_params = self._get_metadata_for_step(\n                step_idx=step_idx,\n                step_params=routed_params[name],\n                all_params=raw_params,\n            )\n            if hasattr(cloned_transformer, \"transform\") or hasattr(\n                cloned_transformer, \"fit_transform\"\n            ):\n                X, fitted_transformer = fit_transform_one_cached(\n                    cloned_transformer,\n                    X,\n                    y,\n                    weight=None,\n                    message_clsname=\"Pipeline\",\n                    message=self._log_message(step_idx),\n                    params=step_params,\n                )\n            elif hasattr(cloned_transformer, \"fit_resample\"):\n                X, y, fitted_transformer = fit_resample_one_cached(\n                    cloned_transformer,\n                    X,\n                    y,\n                    message_clsname=\"Pipeline\",\n                    message=self._log_message(step_idx),\n                    params=routed_params[name],\n                )\n            # Replace the transformer of the step with the fitted\n            # transformer. This is necessary when loading the transformer\n            # from the cache.\n            self.steps[step_idx] = (name, fitted_transformer)\n        return X, y\n\n    # The `fit_*` methods need to be overridden to support the samplers.\n    @_fit_context(\n        # estimators in Pipeline.steps are not validated yet\n        prefer_skip_nested_validation=False\n    )\n    def fit(self, X, y=None, **params):\n        \"\"\"Fit the model.\n\n        Fit all the transforms/samplers one after the other and\n        transform/sample the data, then fit the transformed/sampled\n        data using the final estimator.\n\n        Parameters\n        ----------\n        X : iterable\n            Training data. Must fulfill input requirements of first step of the\n            pipeline.\n\n        y : iterable, default=None\n            Training targets. Must fulfill label requirements for all steps of\n            the pipeline.\n\n        **params : dict of str -> object\n            - If `enable_metadata_routing=False` (default):\n\n                Parameters passed to the ``fit`` method of each step, where\n                each parameter name is prefixed such that parameter ``p`` for step\n                ``s`` has key ``s__p``.\n\n            - If `enable_metadata_routing=True`:\n\n                Parameters requested and accepted by steps. Each step must have\n                requested certain metadata for these parameters to be forwarded to\n                them.\n\n            .. versionchanged:: 1.4\n                Parameters are now passed to the ``transform`` method of the\n                intermediate steps as well, if requested, and if\n                `enable_metadata_routing=True` is set via\n                :func:`~sklearn.set_config`.\n\n            See :ref:`Metadata Routing User Guide <metadata_routing>` for more\n            details.\n\n        Returns\n        -------\n        self : Pipeline\n            This estimator.\n        \"\"\"\n        if not _routing_enabled() and self.transform_input is not None:\n            raise ValueError(\n                \"The `transform_input` parameter can only be set if metadata \"\n                \"routing is enabled. You can enable metadata routing using \"\n                \"`sklearn.set_config(enable_metadata_routing=True)`.\"\n            )\n\n        if sklearn_version < parse_version(\"1.4\") and self.transform_input is not None:\n            raise ValueError(\n                \"The `transform_input` parameter is not supported in scikit-learn \"\n                \"versions prior to 1.4. Please upgrade to scikit-learn 1.4 or \"\n                \"later.\"\n            )\n\n        routed_params = self._check_method_params(method=\"fit\", props=params)\n        Xt, yt = self._fit(X, y, routed_params, raw_params=params)\n        with _print_elapsed_time(\"Pipeline\", self._log_message(len(self.steps) - 1)):\n            if self._final_estimator != \"passthrough\":\n                last_step_params = self._get_metadata_for_step(\n                    step_idx=len(self) - 1,\n                    step_params=routed_params[self.steps[-1][0]],\n                    all_params=params,\n                )\n                self._final_estimator.fit(Xt, yt, **last_step_params[\"fit\"])\n        return self\n\n    def _can_fit_transform(self):\n        return (\n            self._final_estimator == \"passthrough\"\n            or hasattr(self._final_estimator, \"transform\")\n            or hasattr(self._final_estimator, \"fit_transform\")\n        )\n\n    @available_if(_can_fit_transform)\n    @_fit_context(\n        # estimators in Pipeline.steps are not validated yet\n        prefer_skip_nested_validation=False\n    )\n    def fit_transform(self, X, y=None, **params):\n        \"\"\"Fit the model and transform with the final estimator.\n\n        Fits all the transformers/samplers one after the other and\n        transform/sample the data, then uses fit_transform on\n        transformed data with the final estimator.\n\n        Parameters\n        ----------\n        X : iterable\n            Training data. Must fulfill input requirements of first step of the\n            pipeline.\n\n        y : iterable, default=None\n            Training targets. Must fulfill label requirements for all steps of\n            the pipeline.\n\n        **params : dict of str -> object\n            - If `enable_metadata_routing=False` (default):\n\n                Parameters passed to the ``fit`` method of each step, where\n                each parameter name is prefixed such that parameter ``p`` for step\n                ``s`` has key ``s__p``.\n\n            - If `enable_metadata_routing=True`:\n\n                Parameters requested and accepted by steps. Each step must have\n                requested certain metadata for these parameters to be forwarded to\n                them.\n\n            .. versionchanged:: 1.4\n                Parameters are now passed to the ``transform`` method of the\n                intermediate steps as well, if requested, and if\n                `enable_metadata_routing=True`.\n\n            See :ref:`Metadata Routing User Guide <metadata_routing>` for more\n            details.\n\n        Returns\n        -------\n        Xt : array-like of shape (n_samples, n_transformed_features)\n            Transformed samples.\n        \"\"\"\n        routed_params = self._check_method_params(method=\"fit_transform\", props=params)\n        Xt, yt = self._fit(X, y, routed_params)\n\n        last_step = self._final_estimator\n        with _print_elapsed_time(\"Pipeline\", self._log_message(len(self.steps) - 1)):\n            if last_step == \"passthrough\":\n                return Xt\n            last_step_params = self._get_metadata_for_step(\n                step_idx=len(self) - 1,\n                step_params=routed_params[self.steps[-1][0]],\n                all_params=params,\n            )\n            if hasattr(last_step, \"fit_transform\"):\n                return last_step.fit_transform(\n                    Xt, yt, **last_step_params[\"fit_transform\"]\n                )\n            else:\n                return last_step.fit(Xt, y, **last_step_params[\"fit\"]).transform(\n                    Xt, **last_step_params[\"transform\"]\n                )\n\n    @available_if(pipeline._final_estimator_has(\"predict\"))\n    def predict(self, X, **params):\n        \"\"\"Transform the data, and apply `predict` with the final estimator.\n\n        Call `transform` of each transformer in the pipeline. The transformed\n        data are finally passed to the final estimator that calls `predict`\n        method. Only valid if the final estimator implements `predict`.\n\n        Parameters\n        ----------\n        X : iterable\n            Data to predict on. Must fulfill input requirements of first step\n            of the pipeline.\n\n        **params : dict of str -> object\n            - If `enable_metadata_routing=False` (default):\n\n                Parameters to the ``predict`` called at the end of all\n                transformations in the pipeline.\n\n            - If `enable_metadata_routing=True`:\n\n                Parameters requested and accepted by steps. Each step must have\n                requested certain metadata for these parameters to be forwarded to\n                them.\n\n            .. versionadded:: 0.20\n\n            .. versionchanged:: 1.4\n                Parameters are now passed to the ``transform`` method of the\n                intermediate steps as well, if requested, and if\n                `enable_metadata_routing=True` is set via\n                :func:`~sklearn.set_config`.\n\n            See :ref:`Metadata Routing User Guide <metadata_routing>` for more\n            details.\n\n            Note that while this may be used to return uncertainties from some\n            models with ``return_std`` or ``return_cov``, uncertainties that are\n            generated by the transformations in the pipeline are not propagated\n            to the final estimator.\n\n        Returns\n        -------\n        y_pred : ndarray\n            Result of calling `predict` on the final estimator.\n        \"\"\"\n        # TODO(0.15): Remove the context manager and use check_is_fitted(self)\n        with _raise_or_warn_if_not_fitted(self):\n            Xt = X\n\n            if not _routing_enabled():\n                for _, name, transform in self._iter(with_final=False):\n                    Xt = transform.transform(Xt)\n                return self.steps[-1][1].predict(Xt, **params)\n\n            # metadata routing enabled\n            routed_params = process_routing(self, \"predict\", **params)\n            for _, name, transform in self._iter(with_final=False):\n                Xt = transform.transform(Xt, **routed_params[name].transform)\n            return self.steps[-1][1].predict(\n                Xt, **routed_params[self.steps[-1][0]].predict\n            )\n\n    def _can_fit_resample(self):\n        return self._final_estimator == \"passthrough\" or hasattr(\n            self._final_estimator, \"fit_resample\"\n        )\n\n    @available_if(_can_fit_resample)\n    @_fit_context(\n        # estimators in Pipeline.steps are not validated yet\n        prefer_skip_nested_validation=False\n    )\n    def fit_resample(self, X, y=None, **params):\n        \"\"\"Fit the model and sample with the final estimator.\n\n        Fits all the transformers/samplers one after the other and\n        transform/sample the data, then uses fit_resample on transformed\n        data with the final estimator.\n\n        Parameters\n        ----------\n        X : iterable\n            Training data. Must fulfill input requirements of first step of the\n            pipeline.\n\n        y : iterable, default=None\n            Training targets. Must fulfill label requirements for all steps of\n            the pipeline.\n\n        **params : dict of str -> object\n            - If `enable_metadata_routing=False` (default):\n\n                Parameters passed to the ``fit`` method of each step, where\n                each parameter name is prefixed such that parameter ``p`` for step\n                ``s`` has key ``s__p``.\n\n            - If `enable_metadata_routing=True`:\n\n                Parameters requested and accepted by steps. Each step must have\n                requested certain metadata for these parameters to be forwarded to\n                them.\n\n            .. versionchanged:: 1.4\n                Parameters are now passed to the ``transform`` method of the\n                intermediate steps as well, if requested, and if\n                `enable_metadata_routing=True`.\n\n            See :ref:`Metadata Routing User Guide <metadata_routing>` for more\n            details.\n\n        Returns\n        -------\n        Xt : array-like of shape (n_samples, n_transformed_features)\n            Transformed samples.\n\n        yt : array-like of shape (n_samples, n_transformed_features)\n            Transformed target.\n        \"\"\"\n        routed_params = self._check_method_params(method=\"fit_resample\", props=params)\n        Xt, yt = self._fit(X, y, routed_params)\n        last_step = self._final_estimator\n        with _print_elapsed_time(\"Pipeline\", self._log_message(len(self.steps) - 1)):\n            if last_step == \"passthrough\":\n                return Xt\n            last_step_params = routed_params[self.steps[-1][0]]\n            if hasattr(last_step, \"fit_resample\"):\n                return last_step.fit_resample(\n                    Xt, yt, **last_step_params[\"fit_resample\"]\n                )\n\n    @available_if(pipeline._final_estimator_has(\"fit_predict\"))\n    @_fit_context(\n        # estimators in Pipeline.steps are not validated yet\n        prefer_skip_nested_validation=False\n    )\n    def fit_predict(self, X, y=None, **params):\n        \"\"\"Apply `fit_predict` of last step in pipeline after transforms.\n\n        Applies fit_transforms of a pipeline to the data, followed by the\n        fit_predict method of the final estimator in the pipeline. Valid\n        only if the final estimator implements fit_predict.\n\n        Parameters\n        ----------\n        X : iterable\n            Training data. Must fulfill input requirements of first step of\n            the pipeline.\n\n        y : iterable, default=None\n            Training targets. Must fulfill label requirements for all steps\n            of the pipeline.\n\n        **params : dict of str -> object\n            - If `enable_metadata_routing=False` (default):\n\n                Parameters to the ``predict`` called at the end of all\n                transformations in the pipeline.\n\n            - If `enable_metadata_routing=True`:\n\n                Parameters requested and accepted by steps. Each step must have\n                requested certain metadata for these parameters to be forwarded to\n                them.\n\n            .. versionadded:: 0.20\n\n            .. versionchanged:: 1.4\n                Parameters are now passed to the ``transform`` method of the\n                intermediate steps as well, if requested, and if\n                `enable_metadata_routing=True`.\n\n            See :ref:`Metadata Routing User Guide <metadata_routing>` for more\n            details.\n\n            Note that while this may be used to return uncertainties from some\n            models with ``return_std`` or ``return_cov``, uncertainties that are\n            generated by the transformations in the pipeline are not propagated\n            to the final estimator.\n\n        Returns\n        -------\n        y_pred : ndarray of shape (n_samples,)\n            The predicted target.\n        \"\"\"\n        routed_params = self._check_method_params(method=\"fit_predict\", props=params)\n        Xt, yt = self._fit(X, y, routed_params)\n\n        params_last_step = routed_params[self.steps[-1][0]]\n        with _print_elapsed_time(\"Pipeline\", self._log_message(len(self.steps) - 1)):\n            y_pred = self.steps[-1][-1].fit_predict(\n                Xt, yt, **params_last_step.get(\"fit_predict\", {})\n            )\n        return y_pred\n\n    # TODO: remove the following methods when the minimum scikit-learn >= 1.4\n    # They do not depend on resampling but we need to redefine them for the\n    # compatibility with the metadata routing framework.\n    @available_if(pipeline._final_estimator_has(\"predict_proba\"))\n    def predict_proba(self, X, **params):\n        \"\"\"Transform the data, and apply `predict_proba` with the final estimator.\n\n        Call `transform` of each transformer in the pipeline. The transformed\n        data are finally passed to the final estimator that calls\n        `predict_proba` method. Only valid if the final estimator implements\n        `predict_proba`.\n\n        Parameters\n        ----------\n        X : iterable\n            Data to predict on. Must fulfill input requirements of first step\n            of the pipeline.\n\n        **params : dict of str -> object\n            - If `enable_metadata_routing=False` (default):\n\n                Parameters to the `predict_proba` called at the end of all\n                transformations in the pipeline.\n\n            - If `enable_metadata_routing=True`:\n\n                Parameters requested and accepted by steps. Each step must have\n                requested certain metadata for these parameters to be forwarded to\n                them.\n\n            .. versionadded:: 0.20\n\n            .. versionchanged:: 1.4\n                Parameters are now passed to the ``transform`` method of the\n                intermediate steps as well, if requested, and if\n                `enable_metadata_routing=True`.\n\n            See :ref:`Metadata Routing User Guide <metadata_routing>` for more\n            details.\n\n        Returns\n        -------\n        y_proba : ndarray of shape (n_samples, n_classes)\n            Result of calling `predict_proba` on the final estimator.\n        \"\"\"\n        # TODO(0.15): Remove the context manager and use check_is_fitted(self)\n        with _raise_or_warn_if_not_fitted(self):\n            Xt = X\n\n            if not _routing_enabled():\n                for _, name, transform in self._iter(with_final=False):\n                    Xt = transform.transform(Xt)\n                return self.steps[-1][1].predict_proba(Xt, **params)\n\n            # metadata routing enabled\n            routed_params = process_routing(self, \"predict_proba\", **params)\n            for _, name, transform in self._iter(with_final=False):\n                Xt = transform.transform(Xt, **routed_params[name].transform)\n            return self.steps[-1][1].predict_proba(\n                Xt, **routed_params[self.steps[-1][0]].predict_proba\n            )\n\n    @available_if(pipeline._final_estimator_has(\"decision_function\"))\n    def decision_function(self, X, **params):\n        \"\"\"Transform the data, and apply `decision_function` with the final estimator.\n\n        Call `transform` of each transformer in the pipeline. The transformed\n        data are finally passed to the final estimator that calls\n        `decision_function` method. Only valid if the final estimator\n        implements `decision_function`.\n\n        Parameters\n        ----------\n        X : iterable\n            Data to predict on. Must fulfill input requirements of first step\n            of the pipeline.\n\n        **params : dict of string -> object\n            Parameters requested and accepted by steps. Each step must have\n            requested certain metadata for these parameters to be forwarded to\n            them.\n\n            .. versionadded:: 1.4\n                Only available if `enable_metadata_routing=True`. See\n                :ref:`Metadata Routing User Guide <metadata_routing>` for more\n                details.\n\n        Returns\n        -------\n        y_score : ndarray of shape (n_samples, n_classes)\n            Result of calling `decision_function` on the final estimator.\n        \"\"\"\n        # TODO(0.15): Remove the context manager and use check_is_fitted(self)\n        with _raise_or_warn_if_not_fitted(self):\n            _raise_for_params(params, self, \"decision_function\")\n\n            # not branching here since params is only available if\n            # enable_metadata_routing=True\n            routed_params = process_routing(self, \"decision_function\", **params)\n\n            Xt = X\n            for _, name, transform in self._iter(with_final=False):\n                Xt = transform.transform(\n                    Xt, **routed_params.get(name, {}).get(\"transform\", {})\n                )\n            return self.steps[-1][1].decision_function(\n                Xt,\n                **routed_params.get(self.steps[-1][0], {}).get(\"decision_function\", {}),\n            )\n\n    @available_if(pipeline._final_estimator_has(\"score_samples\"))\n    def score_samples(self, X):\n        \"\"\"Transform the data, and apply `score_samples` with the final estimator.\n\n        Call `transform` of each transformer in the pipeline. The transformed\n        data are finally passed to the final estimator that calls\n        `score_samples` method. Only valid if the final estimator implements\n        `score_samples`.\n\n        Parameters\n        ----------\n        X : iterable\n            Data to predict on. Must fulfill input requirements of first step\n            of the pipeline.\n\n        Returns\n        -------\n        y_score : ndarray of shape (n_samples,)\n            Result of calling `score_samples` on the final estimator.\n        \"\"\"\n        # TODO(0.15): Remove the context manager and use check_is_fitted(self)\n        with _raise_or_warn_if_not_fitted(self):\n            Xt = X\n            for _, _, transformer in self._iter(with_final=False):\n                Xt = transformer.transform(Xt)\n            return self.steps[-1][1].score_samples(Xt)\n\n    @available_if(pipeline._final_estimator_has(\"predict_log_proba\"))\n    def predict_log_proba(self, X, **params):\n        \"\"\"Transform the data, and apply `predict_log_proba` with the final estimator.\n\n        Call `transform` of each transformer in the pipeline. The transformed\n        data are finally passed to the final estimator that calls\n        `predict_log_proba` method. Only valid if the final estimator\n        implements `predict_log_proba`.\n\n        Parameters\n        ----------\n        X : iterable\n            Data to predict on. Must fulfill input requirements of first step\n            of the pipeline.\n\n        **params : dict of str -> object\n            - If `enable_metadata_routing=False` (default):\n\n                Parameters to the `predict_log_proba` called at the end of all\n                transformations in the pipeline.\n\n            - If `enable_metadata_routing=True`:\n\n                Parameters requested and accepted by steps. Each step must have\n                requested certain metadata for these parameters to be forwarded to\n                them.\n\n            .. versionadded:: 0.20\n\n            .. versionchanged:: 1.4\n                Parameters are now passed to the ``transform`` method of the\n                intermediate steps as well, if requested, and if\n                `enable_metadata_routing=True`.\n\n            See :ref:`Metadata Routing User Guide <metadata_routing>` for more\n            details.\n\n        Returns\n        -------\n        y_log_proba : ndarray of shape (n_samples, n_classes)\n            Result of calling `predict_log_proba` on the final estimator.\n        \"\"\"\n        # TODO(0.15): Remove the context manager and use check_is_fitted(self)\n        with _raise_or_warn_if_not_fitted(self):\n            Xt = X\n\n            if not _routing_enabled():\n                for _, name, transform in self._iter(with_final=False):\n                    Xt = transform.transform(Xt)\n                return self.steps[-1][1].predict_log_proba(Xt, **params)\n\n            # metadata routing enabled\n            routed_params = process_routing(self, \"predict_log_proba\", **params)\n            for _, name, transform in self._iter(with_final=False):\n                Xt = transform.transform(Xt, **routed_params[name].transform)\n            return self.steps[-1][1].predict_log_proba(\n                Xt, **routed_params[self.steps[-1][0]].predict_log_proba\n            )\n\n    def _can_transform(self):\n        return self._final_estimator == \"passthrough\" or hasattr(\n            self._final_estimator, \"transform\"\n        )\n\n    @available_if(_can_transform)\n    def transform(self, X, **params):\n        \"\"\"Transform the data, and apply `transform` with the final estimator.\n\n        Call `transform` of each transformer in the pipeline. The transformed\n        data are finally passed to the final estimator that calls\n        `transform` method. Only valid if the final estimator\n        implements `transform`.\n\n        This also works where final estimator is `None` in which case all prior\n        transformations are applied.\n\n        Parameters\n        ----------\n        X : iterable\n            Data to transform. Must fulfill input requirements of first step\n            of the pipeline.\n\n        **params : dict of str -> object\n            Parameters requested and accepted by steps. Each step must have\n            requested certain metadata for these parameters to be forwarded to\n            them.\n\n            .. versionadded:: 1.4\n                Only available if `enable_metadata_routing=True`. See\n                :ref:`Metadata Routing User Guide <metadata_routing>` for more\n                details.\n\n        Returns\n        -------\n        Xt : ndarray of shape (n_samples, n_transformed_features)\n            Transformed data.\n        \"\"\"\n        # TODO(0.15): Remove the context manager and use check_is_fitted(self)\n        with _raise_or_warn_if_not_fitted(self):\n            _raise_for_params(params, self, \"transform\")\n\n            # not branching here since params is only available if\n            # enable_metadata_routing=True\n            routed_params = process_routing(self, \"transform\", **params)\n            Xt = X\n            for _, name, transform in self._iter():\n                Xt = transform.transform(Xt, **routed_params[name].transform)\n            return Xt\n\n    def _can_inverse_transform(self):\n        return all(hasattr(t, \"inverse_transform\") for _, _, t in self._iter())\n\n    @available_if(_can_inverse_transform)\n    def inverse_transform(self, Xt, **params):\n        \"\"\"Apply `inverse_transform` for each step in a reverse order.\n\n        All estimators in the pipeline must support `inverse_transform`.\n\n        Parameters\n        ----------\n        Xt : array-like of shape (n_samples, n_transformed_features)\n            Data samples, where ``n_samples`` is the number of samples and\n            ``n_features`` is the number of features. Must fulfill\n            input requirements of last step of pipeline's\n            ``inverse_transform`` method.\n\n        **params : dict of str -> object\n            Parameters requested and accepted by steps. Each step must have\n            requested certain metadata for these parameters to be forwarded to\n            them.\n\n            .. versionadded:: 1.4\n                Only available if `enable_metadata_routing=True`. See\n                :ref:`Metadata Routing User Guide <metadata_routing>` for more\n                details.\n\n        Returns\n        -------\n        Xt : ndarray of shape (n_samples, n_features)\n            Inverse transformed data, that is, data in the original feature\n            space.\n        \"\"\"\n        # TODO(0.15): Remove the context manager and use check_is_fitted(self)\n        with _raise_or_warn_if_not_fitted(self):\n            _raise_for_params(params, self, \"inverse_transform\")\n\n            # we don't have to branch here, since params is only non-empty if\n            # enable_metadata_routing=True.\n            routed_params = process_routing(self, \"inverse_transform\", **params)\n            reverse_iter = reversed(list(self._iter()))\n            for _, name, transform in reverse_iter:\n                Xt = transform.inverse_transform(\n                    Xt, **routed_params[name].inverse_transform\n                )\n            return Xt\n\n    @available_if(pipeline._final_estimator_has(\"score\"))\n    def score(self, X, y=None, sample_weight=None, **params):\n        \"\"\"Transform the data, and apply `score` with the final estimator.\n\n        Call `transform` of each transformer in the pipeline. The transformed\n        data are finally passed to the final estimator that calls\n        `score` method. Only valid if the final estimator implements `score`.\n\n        Parameters\n        ----------\n        X : iterable\n            Data to predict on. Must fulfill input requirements of first step\n            of the pipeline.\n\n        y : iterable, default=None\n            Targets used for scoring. Must fulfill label requirements for all\n            steps of the pipeline.\n\n        sample_weight : array-like, default=None\n            If not None, this argument is passed as ``sample_weight`` keyword\n            argument to the ``score`` method of the final estimator.\n\n        **params : dict of str -> object\n            Parameters requested and accepted by steps. Each step must have\n            requested certain metadata for these parameters to be forwarded to\n            them.\n\n            .. versionadded:: 1.4\n                Only available if `enable_metadata_routing=True`. See\n                :ref:`Metadata Routing User Guide <metadata_routing>` for more\n                details.\n\n        Returns\n        -------\n        score : float\n            Result of calling `score` on the final estimator.\n        \"\"\"\n        # TODO(0.15): Remove the context manager and use check_is_fitted(self)\n        with _raise_or_warn_if_not_fitted(self):\n            Xt = X\n            if not _routing_enabled():\n                for _, name, transform in self._iter(with_final=False):\n                    Xt = transform.transform(Xt)\n                score_params = {}\n                if sample_weight is not None:\n                    score_params[\"sample_weight\"] = sample_weight\n                return self.steps[-1][1].score(Xt, y, **score_params)\n\n            # metadata routing is enabled.\n            routed_params = process_routing(\n                self, \"score\", sample_weight=sample_weight, **params\n            )\n\n            Xt = X\n            for _, name, transform in self._iter(with_final=False):\n                Xt = transform.transform(Xt, **routed_params[name].transform)\n            return self.steps[-1][1].score(\n                Xt, y, **routed_params[self.steps[-1][0]].score\n            )\n\n    # TODO: once scikit-learn >= 1.4, the following function should be simplified by\n    # calling `super().get_metadata_routing()`\n    def get_metadata_routing(self):\n        \"\"\"Get metadata routing of this object.\n\n        Please check :ref:`User Guide <metadata_routing>` on how the routing\n        mechanism works.\n\n        Returns\n        -------\n        routing : MetadataRouter\n            A :class:`~utils.metadata_routing.MetadataRouter` encapsulating\n            routing information.\n        \"\"\"\n        router = MetadataRouter(owner=self.__class__.__name__)\n\n        # first we add all steps except the last one\n        for _, name, trans in self._iter(\n            with_final=False, filter_passthrough=True, filter_resample=False\n        ):\n            method_mapping = MethodMapping()\n            # fit, fit_predict, and fit_transform call fit_transform if it\n            # exists, or else fit and transform\n            if hasattr(trans, \"fit_transform\"):\n                (\n                    method_mapping.add(caller=\"fit\", callee=\"fit_transform\")\n                    .add(caller=\"fit_transform\", callee=\"fit_transform\")\n                    .add(caller=\"fit_predict\", callee=\"fit_transform\")\n                )\n            else:\n                (\n                    method_mapping.add(caller=\"fit\", callee=\"fit\")\n                    .add(caller=\"fit\", callee=\"transform\")\n                    .add(caller=\"fit_transform\", callee=\"fit\")\n                    .add(caller=\"fit_transform\", callee=\"transform\")\n                    .add(caller=\"fit_predict\", callee=\"fit\")\n                    .add(caller=\"fit_predict\", callee=\"transform\")\n                )\n\n            (\n                # handling sampler if the fit_* stage\n                method_mapping.add(caller=\"fit\", callee=\"fit_resample\")\n                .add(caller=\"fit_transform\", callee=\"fit_resample\")\n                .add(caller=\"fit_predict\", callee=\"fit_resample\")\n            )\n            (\n                method_mapping.add(caller=\"predict\", callee=\"transform\")\n                .add(caller=\"predict\", callee=\"transform\")\n                .add(caller=\"predict_proba\", callee=\"transform\")\n                .add(caller=\"decision_function\", callee=\"transform\")\n                .add(caller=\"predict_log_proba\", callee=\"transform\")\n                .add(caller=\"transform\", callee=\"transform\")\n                .add(caller=\"inverse_transform\", callee=\"inverse_transform\")\n                .add(caller=\"score\", callee=\"transform\")\n                .add(caller=\"fit_resample\", callee=\"transform\")\n            )\n\n            router.add(method_mapping=method_mapping, **{name: trans})\n\n        final_name, final_est = self.steps[-1]\n        if final_est is None or final_est == \"passthrough\":\n            return router\n\n        # then we add the last step\n        method_mapping = MethodMapping()\n        if hasattr(final_est, \"fit_transform\"):\n            method_mapping.add(caller=\"fit_transform\", callee=\"fit_transform\")\n        else:\n            (\n                method_mapping.add(caller=\"fit\", callee=\"fit\").add(\n                    caller=\"fit\", callee=\"transform\"\n                )\n            )\n        (\n            method_mapping.add(caller=\"fit\", callee=\"fit\")\n            .add(caller=\"predict\", callee=\"predict\")\n            .add(caller=\"fit_predict\", callee=\"fit_predict\")\n            .add(caller=\"predict_proba\", callee=\"predict_proba\")\n            .add(caller=\"decision_function\", callee=\"decision_function\")\n            .add(caller=\"predict_log_proba\", callee=\"predict_log_proba\")\n            .add(caller=\"transform\", callee=\"transform\")\n            .add(caller=\"inverse_transform\", callee=\"inverse_transform\")\n            .add(caller=\"score\", callee=\"score\")\n            .add(caller=\"fit_resample\", callee=\"fit_resample\")\n        )\n\n        router.add(method_mapping=method_mapping, **{final_name: final_est})\n        return router\n\n    def _check_method_params(self, method, props, **kwargs):\n        if _routing_enabled():\n            routed_params = process_routing(self, method, **props, **kwargs)\n            return routed_params\n        else:\n            fit_params_steps = Bunch(\n                **{\n                    name: Bunch(**{method: {} for method in METHODS})\n                    for name, step in self.steps\n                    if step is not None\n                }\n            )\n            for pname, pval in props.items():\n                if \"__\" not in pname:\n                    raise ValueError(\n                        f\"Pipeline.fit does not accept the {pname} parameter. \"\n                        \"You can pass parameters to specific steps of your \"\n                        \"pipeline using the stepname__parameter format, e.g. \"\n                        \"`Pipeline.fit(X, y, logisticregression__sample_weight\"\n                        \"=sample_weight)`.\"\n                    )\n                step, param = pname.split(\"__\", 1)\n                fit_params_steps[step][\"fit\"][param] = pval\n                # without metadata routing, fit_transform and fit_predict\n                # get all the same params and pass it to the last fit.\n                fit_params_steps[step][\"fit_transform\"][param] = pval\n                fit_params_steps[step][\"fit_predict\"][param] = pval\n            return fit_params_steps\n\n    def __sklearn_is_fitted__(self):\n        \"\"\"Indicate whether pipeline has been fit.\n\n        This is done by checking whether the last non-`passthrough` step of the\n        pipeline is fitted.\n\n        An empty pipeline is considered fitted.\n        \"\"\"\n\n        # First find the last step that is not 'passthrough'\n        last_step = None\n        for _, estimator in reversed(self.steps):\n            if estimator != \"passthrough\":\n                last_step = estimator\n                break\n\n        if last_step is None:\n            # All steps are 'passthrough', so the pipeline is considered fitted\n            return True\n\n        try:\n            # check if the last step of the pipeline is fitted\n            # we only check the last step since if the last step is fit, it\n            # means the previous steps should also be fit. This is faster than\n            # checking if every step of the pipeline is fit.\n            check_is_fitted(last_step)\n            return True\n        except NotFittedError:\n            return False\n\n    def __sklearn_tags__(self):\n        tags = super().__sklearn_tags__()\n\n        if not self.steps:\n            return tags\n\n        try:\n            if self.steps[0][1] is not None and self.steps[0][1] != \"passthrough\":\n                tags.input_tags.pairwise = get_tags(\n                    self.steps[0][1]\n                ).input_tags.pairwise\n        except (ValueError, AttributeError, TypeError):\n            # This happens when the `steps` is not a list of (name, estimator)\n            # tuples and `fit` is not called yet to validate the steps.\n            pass\n\n        try:\n            if self.steps[-1][1] is not None and self.steps[-1][1] != \"passthrough\":\n                last_step_tags = get_tags(self.steps[-1][1])\n                tags.estimator_type = last_step_tags.estimator_type\n                tags.target_tags.multi_output = last_step_tags.target_tags.multi_output\n                tags.classifier_tags = deepcopy(last_step_tags.classifier_tags)\n                tags.regressor_tags = deepcopy(last_step_tags.regressor_tags)\n                tags.transformer_tags = deepcopy(last_step_tags.transformer_tags)\n        except (ValueError, AttributeError, TypeError):\n            # This happens when the `steps` is not a list of (name, estimator)\n            # tuples and `fit` is not called yet to validate the steps.\n            pass\n\n        return tags\n\n\ndef _fit_resample_one(sampler, X, y, message_clsname=\"\", message=None, params=None):\n    with _print_elapsed_time(message_clsname, message):\n        X_res, y_res = sampler.fit_resample(X, y, **params.get(\"fit_resample\", {}))\n\n        return X_res, y_res, sampler\n\n\ndef _transform_one(transformer, X, y, weight, params=None):\n    \"\"\"Call transform and apply weight to output.\n\n    Parameters\n    ----------\n    transformer : estimator\n        Estimator to be used for transformation.\n\n    X : {array-like, sparse matrix} of shape (n_samples, n_features)\n        Input data to be transformed.\n\n    y : ndarray of shape (n_samples,)\n        Ignored.\n\n    weight : float\n        Weight to be applied to the output of the transformation.\n\n    params : dict\n        Parameters to be passed to the transformer's ``transform`` method.\n\n        This should be of the form ``process_routing()[\"step_name\"]``.\n    \"\"\"\n    res = transformer.transform(X, **params.transform)\n    # if we have a weight for this transformer, multiply output\n    if weight is None:\n        return res\n    return res * weight\n\n\ndef _fit_transform_one(\n    transformer, X, y, weight, message_clsname=\"\", message=None, params=None\n):\n    \"\"\"\n    Fits ``transformer`` to ``X`` and ``y``. The transformed result is returned\n    with the fitted transformer. If ``weight`` is not ``None``, the result will\n    be multiplied by ``weight``.\n\n    ``params`` needs to be of the form ``process_routing()[\"step_name\"]``.\n    \"\"\"\n    params = params or {}\n    with _print_elapsed_time(message_clsname, message):\n        if hasattr(transformer, \"fit_transform\"):\n            res = transformer.fit_transform(X, y, **params.get(\"fit_transform\", {}))\n        else:\n            res = transformer.fit(X, y, **params.get(\"fit\", {})).transform(\n                X, **params.get(\"transform\", {})\n            )\n\n    if weight is None:\n        return res, transformer\n    return res * weight, transformer\n\n\n@validate_params(\n    {\n        \"memory\": [None, str, HasMethods([\"cache\"])],\n        \"transform_input\": [None, list],\n        \"verbose\": [\"boolean\"],\n    },\n    prefer_skip_nested_validation=True,\n)\ndef make_pipeline(*steps, memory=None, transform_input=None, verbose=False):\n    \"\"\"Construct a Pipeline from the given estimators.\n\n    This is a shorthand for the Pipeline constructor; it does not require, and\n    does not permit, naming the estimators. Instead, their names will be set\n    to the lowercase of their types automatically.\n\n    Parameters\n    ----------\n    *steps : list of estimators\n        A list of estimators.\n\n    memory : None, str or object with the joblib.Memory interface, default=None\n        Used to cache the fitted transformers of the pipeline. By default,\n        no caching is performed. If a string is given, it is the path to\n        the caching directory. Enabling caching triggers a clone of\n        the transformers before fitting. Therefore, the transformer\n        instance given to the pipeline cannot be inspected\n        directly. Use the attribute ``named_steps`` or ``steps`` to\n        inspect estimators within the pipeline. Caching the\n        transformers is advantageous when fitting is time consuming.\n\n    transform_input : list of str, default=None\n        This enables transforming some input arguments to ``fit`` (other than ``X``)\n        to be transformed by the steps of the pipeline up to the step which requires\n        them. Requirement is defined via :ref:`metadata routing <metadata_routing>`.\n        This can be used to pass a validation set through the pipeline for instance.\n\n        You can only set this if metadata routing is enabled, which you\n        can enable using ``sklearn.set_config(enable_metadata_routing=True)``.\n\n        .. versionadded:: 1.6\n\n    verbose : bool, default=False\n        If True, the time elapsed while fitting each step will be printed as it\n        is completed.\n\n    Returns\n    -------\n    p : Pipeline\n        Returns an imbalanced-learn `Pipeline` instance that handles samplers.\n\n    See Also\n    --------\n    imblearn.pipeline.Pipeline : Class for creating a pipeline of\n        transforms with a final estimator.\n\n    Examples\n    --------\n    >>> from sklearn.naive_bayes import GaussianNB\n    >>> from sklearn.preprocessing import StandardScaler\n    >>> make_pipeline(StandardScaler(), GaussianNB(priors=None))\n    Pipeline(steps=[('standardscaler', StandardScaler()),\n                    ('gaussiannb', GaussianNB())])\n    \"\"\"\n    return Pipeline(\n        pipeline._name_estimators(steps),\n        memory=memory,\n        transform_input=transform_input,\n        verbose=verbose,\n    )\n"
  },
  {
    "path": "imblearn/tensorflow/__init__.py",
    "content": "\"\"\"The :mod:`imblearn.tensorflow` provides utilities to deal with imbalanced\ndataset in tensorflow.\"\"\"\n\nfrom imblearn.tensorflow._generator import balanced_batch_generator\n\n__all__ = [\"balanced_batch_generator\"]\n"
  },
  {
    "path": "imblearn/tensorflow/_generator.py",
    "content": "\"\"\"Implement generators for ``tensorflow`` which will balance the data.\"\"\"\n\nfrom scipy.sparse import issparse\nfrom sklearn.base import clone\nfrom sklearn.utils import _safe_indexing, check_random_state\n\nfrom imblearn.under_sampling import RandomUnderSampler\nfrom imblearn.utils import Substitution\nfrom imblearn.utils._docstring import _random_state_docstring\n\n\n@Substitution(random_state=_random_state_docstring)\ndef balanced_batch_generator(\n    X,\n    y,\n    *,\n    sample_weight=None,\n    sampler=None,\n    batch_size=32,\n    keep_sparse=False,\n    random_state=None,\n):\n    \"\"\"Create a balanced batch generator to train tensorflow model.\n\n    Returns a generator --- as well as the number of step per epoch --- to\n    iterate to get the mini-batches. The sampler defines the sampling strategy\n    used to balance the dataset ahead of creating the batch. The sampler should\n    have an attribute ``sample_indices_``.\n\n    .. versionadded:: 0.4\n\n    Parameters\n    ----------\n    X : ndarray of shape (n_samples, n_features)\n        Original imbalanced dataset.\n\n    y : ndarray of shape (n_samples,) or (n_samples, n_classes)\n        Associated targets.\n\n    sample_weight : ndarray of shape (n_samples,), default=None\n        Sample weight.\n\n    sampler : sampler object, default=None\n        A sampler instance which has an attribute ``sample_indices_``.\n        By default, the sampler used is a\n        :class:`~imblearn.under_sampling.RandomUnderSampler`.\n\n    batch_size : int, default=32\n        Number of samples per gradient update.\n\n    keep_sparse : bool, default=False\n        Either or not to conserve or not the sparsity of the input ``X``. By\n        default, the returned batches will be dense.\n\n    {random_state}\n\n    Returns\n    -------\n    generator : generator of tuple\n        Generate batch of data. The tuple generated are either (X_batch,\n        y_batch) or (X_batch, y_batch, sampler_weight_batch).\n\n    steps_per_epoch : int\n        The number of samples per epoch.\n    \"\"\"\n\n    random_state = check_random_state(random_state)\n    if sampler is None:\n        sampler_ = RandomUnderSampler(random_state=random_state)\n    else:\n        sampler_ = clone(sampler)\n    sampler_.fit_resample(X, y)\n    if not hasattr(sampler_, \"sample_indices_\"):\n        raise ValueError(\"'sampler' needs to have an attribute 'sample_indices_'.\")\n    indices = sampler_.sample_indices_\n    # shuffle the indices since the sampler are packing them by class\n    random_state.shuffle(indices)\n\n    def generator(X, y, sample_weight, indices, batch_size):\n        while True:\n            for index in range(0, len(indices), batch_size):\n                X_res = _safe_indexing(X, indices[index : index + batch_size])\n                y_res = _safe_indexing(y, indices[index : index + batch_size])\n                if issparse(X_res) and not keep_sparse:\n                    X_res = X_res.toarray()\n                if sample_weight is None:\n                    yield X_res, y_res\n                else:\n                    sw_res = _safe_indexing(\n                        sample_weight, indices[index : index + batch_size]\n                    )\n                    yield X_res, y_res, sw_res\n\n    return (\n        generator(X, y, sample_weight, indices, batch_size),\n        int(indices.size // batch_size),\n    )\n"
  },
  {
    "path": "imblearn/tensorflow/tests/__init__.py",
    "content": ""
  },
  {
    "path": "imblearn/tensorflow/tests/test_generator.py",
    "content": "import numpy as np\nimport pytest\nfrom scipy import sparse\nfrom sklearn.datasets import load_iris\nfrom sklearn.utils.fixes import parse_version\n\nfrom imblearn.datasets import make_imbalance\nfrom imblearn.over_sampling import RandomOverSampler\nfrom imblearn.tensorflow import balanced_batch_generator\nfrom imblearn.under_sampling import NearMiss\n\ntf = pytest.importorskip(\"tensorflow\")\n\n\n@pytest.fixture\ndef data():\n    X, y = load_iris(return_X_y=True)\n    X, y = make_imbalance(X, y, sampling_strategy={0: 30, 1: 50, 2: 40})\n    X = X.astype(np.float32)\n    return X, y\n\n\ndef check_balanced_batch_generator_tf_1_X_X(dataset, sampler):\n    X, y = dataset\n    batch_size = 10\n    training_generator, steps_per_epoch = balanced_batch_generator(\n        X,\n        y,\n        sample_weight=None,\n        sampler=sampler,\n        batch_size=batch_size,\n        random_state=42,\n    )\n\n    learning_rate = 0.01\n    epochs = 10\n    input_size = X.shape[1]\n    output_size = 3\n\n    # helper functions\n    def init_weights(shape):\n        return tf.Variable(tf.random_normal(shape, stddev=0.01))\n\n    def accuracy(y_true, y_pred):\n        return np.mean(np.argmax(y_pred, axis=1) == y_true)\n\n    # input and output\n    data = tf.placeholder(\"float32\", shape=[None, input_size])\n    targets = tf.placeholder(\"int32\", shape=[None])\n\n    # build the model and weights\n    W = init_weights([input_size, output_size])\n    b = init_weights([output_size])\n    out_act = tf.nn.sigmoid(tf.matmul(data, W) + b)\n\n    # build the loss, predict, and train operator\n    cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(\n        logits=out_act, labels=targets\n    )\n    loss = tf.reduce_sum(cross_entropy)\n    optimizer = tf.train.GradientDescentOptimizer(learning_rate)\n    train_op = optimizer.minimize(loss)\n    predict = tf.nn.softmax(out_act)\n\n    # Initialization of all variables in the graph\n    init = tf.global_variables_initializer()\n\n    with tf.Session() as sess:\n        sess.run(init)\n\n        for e in range(epochs):\n            for i in range(steps_per_epoch):\n                X_batch, y_batch = next(training_generator)\n                sess.run(\n                    [train_op, loss],\n                    feed_dict={data: X_batch, targets: y_batch},\n                )\n\n            # For each epoch, run accuracy on train and test\n            predicts_train = sess.run(predict, feed_dict={data: X})\n            print(f\"epoch: {e} train accuracy: {accuracy(y, predicts_train):.3f}\")\n\n\ndef check_balanced_batch_generator_tf_2_X_X_compat_1_X_X(dataset, sampler):\n    tf.compat.v1.disable_eager_execution()\n\n    X, y = dataset\n    batch_size = 10\n    training_generator, steps_per_epoch = balanced_batch_generator(\n        X,\n        y,\n        sample_weight=None,\n        sampler=sampler,\n        batch_size=batch_size,\n        random_state=42,\n    )\n\n    learning_rate = 0.01\n    epochs = 10\n    input_size = X.shape[1]\n    output_size = 3\n\n    # helper functions\n    def init_weights(shape):\n        return tf.Variable(tf.random.normal(shape, stddev=0.01))\n\n    def accuracy(y_true, y_pred):\n        return np.mean(np.argmax(y_pred, axis=1) == y_true)\n\n    # input and output\n    data = tf.compat.v1.placeholder(\"float32\", shape=[None, input_size])\n    targets = tf.compat.v1.placeholder(\"int32\", shape=[None])\n\n    # build the model and weights\n    W = init_weights([input_size, output_size])\n    b = init_weights([output_size])\n    out_act = tf.nn.sigmoid(tf.matmul(data, W) + b)\n\n    # build the loss, predict, and train operator\n    cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(\n        logits=out_act, labels=targets\n    )\n    loss = tf.reduce_sum(input_tensor=cross_entropy)\n    optimizer = tf.compat.v1.train.GradientDescentOptimizer(learning_rate)\n    train_op = optimizer.minimize(loss)\n    predict = tf.nn.softmax(out_act)\n\n    # Initialization of all variables in the graph\n    init = tf.compat.v1.global_variables_initializer()\n\n    with tf.compat.v1.Session() as sess:\n        sess.run(init)\n\n        for e in range(epochs):\n            for i in range(steps_per_epoch):\n                X_batch, y_batch = next(training_generator)\n                sess.run(\n                    [train_op, loss],\n                    feed_dict={data: X_batch, targets: y_batch},\n                )\n\n            # For each epoch, run accuracy on train and test\n            predicts_train = sess.run(predict, feed_dict={data: X})\n            print(f\"epoch: {e} train accuracy: {accuracy(y, predicts_train):.3f}\")\n\n\n@pytest.mark.parametrize(\"sampler\", [None, NearMiss(), RandomOverSampler()])\ndef test_balanced_batch_generator(data, sampler):\n    if parse_version(tf.__version__) < parse_version(\"2.0.0\"):\n        check_balanced_batch_generator_tf_1_X_X(data, sampler)\n    else:\n        check_balanced_batch_generator_tf_2_X_X_compat_1_X_X(data, sampler)\n\n\n@pytest.mark.parametrize(\"keep_sparse\", [True, False])\ndef test_balanced_batch_generator_function_sparse(data, keep_sparse):\n    X, y = data\n\n    training_generator, steps_per_epoch = balanced_batch_generator(\n        sparse.csr_matrix(X),\n        y,\n        keep_sparse=keep_sparse,\n        batch_size=10,\n        random_state=42,\n    )\n    for idx in range(steps_per_epoch):\n        X_batch, y_batch = next(training_generator)\n        if keep_sparse:\n            assert sparse.issparse(X_batch)\n        else:\n            assert not sparse.issparse(X_batch)\n"
  },
  {
    "path": "imblearn/tests/__init__.py",
    "content": ""
  },
  {
    "path": "imblearn/tests/test_base.py",
    "content": "\"\"\"Test for miscellaneous samplers objects.\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\nimport numpy as np\nimport pytest\nfrom scipy import sparse\nfrom sklearn.datasets import load_iris, make_regression\nfrom sklearn.linear_model import LinearRegression\nfrom sklearn.utils import _safe_indexing\nfrom sklearn.utils._testing import assert_allclose_dense_sparse, assert_array_equal\nfrom sklearn.utils.multiclass import type_of_target\n\nfrom imblearn import FunctionSampler\nfrom imblearn.datasets import make_imbalance\nfrom imblearn.pipeline import make_pipeline\nfrom imblearn.under_sampling import RandomUnderSampler\n\niris = load_iris()\nX, y = make_imbalance(\n    iris.data, iris.target, sampling_strategy={0: 10, 1: 25}, random_state=0\n)\n\n\ndef test_function_sampler_reject_sparse():\n    X_sparse = sparse.csr_matrix(X)\n    sampler = FunctionSampler(accept_sparse=False)\n    err_msg = \"dense data is required\"\n    with pytest.raises(\n        TypeError,\n        match=err_msg,\n    ):\n        sampler.fit_resample(X_sparse, y)\n\n\n@pytest.mark.parametrize(\n    \"X, y\", [(X, y), (sparse.csr_matrix(X), y), (sparse.csc_matrix(X), y)]\n)\ndef test_function_sampler_identity(X, y):\n    sampler = FunctionSampler()\n    X_res, y_res = sampler.fit_resample(X, y)\n    assert_allclose_dense_sparse(X_res, X)\n    assert_array_equal(y_res, y)\n\n\n@pytest.mark.parametrize(\n    \"X, y\", [(X, y), (sparse.csr_matrix(X), y), (sparse.csc_matrix(X), y)]\n)\ndef test_function_sampler_func(X, y):\n    def func(X, y):\n        return X[:10], y[:10]\n\n    sampler = FunctionSampler(func=func)\n    X_res, y_res = sampler.fit_resample(X, y)\n    assert_allclose_dense_sparse(X_res, X[:10])\n    assert_array_equal(y_res, y[:10])\n\n\n@pytest.mark.parametrize(\n    \"X, y\", [(X, y), (sparse.csr_matrix(X), y), (sparse.csc_matrix(X), y)]\n)\ndef test_function_sampler_func_kwargs(X, y):\n    def func(X, y, sampling_strategy, random_state):\n        rus = RandomUnderSampler(\n            sampling_strategy=sampling_strategy, random_state=random_state\n        )\n        return rus.fit_resample(X, y)\n\n    sampler = FunctionSampler(\n        func=func, kw_args={\"sampling_strategy\": \"auto\", \"random_state\": 0}\n    )\n    X_res, y_res = sampler.fit_resample(X, y)\n    X_res_2, y_res_2 = RandomUnderSampler(random_state=0).fit_resample(X, y)\n    assert_allclose_dense_sparse(X_res, X_res_2)\n    assert_array_equal(y_res, y_res_2)\n\n\ndef test_function_sampler_validate():\n    # check that we can let a pass a regression variable by turning down the\n    # validation\n    X, y = make_regression()\n\n    def dummy_sampler(X, y):\n        indices = np.random.choice(np.arange(X.shape[0]), size=100)\n        return _safe_indexing(X, indices), _safe_indexing(y, indices)\n\n    sampler = FunctionSampler(func=dummy_sampler, validate=False)\n    pipeline = make_pipeline(sampler, LinearRegression())\n    y_pred = pipeline.fit(X, y).predict(X)\n\n    assert type_of_target(y_pred) == \"continuous\"\n\n\ndef test_function_resampler_fit():\n    # Check that the validation is bypass when calling `fit`\n    # Non-regression test for:\n    # https://github.com/scikit-learn-contrib/imbalanced-learn/issues/782\n    X = np.array([[1, np.nan], [2, 3], [np.inf, 4]])\n    y = np.array([0, 1, 1])\n\n    def func(X, y):\n        return X[:1], y[:1]\n\n    sampler = FunctionSampler(func=func, validate=False)\n    sampler.fit(X, y)\n    sampler.fit_resample(X, y)\n"
  },
  {
    "path": "imblearn/tests/test_common.py",
    "content": "\"\"\"Common tests\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport warnings\nfrom collections import OrderedDict\n\nimport numpy as np\nimport pytest\nfrom sklearn.exceptions import ConvergenceWarning\nfrom sklearn.utils._testing import ignore_warnings\nfrom sklearn_compat.utils.estimator_checks import (\n    parametrize_with_checks as parametrize_with_checks_sklearn,\n)\n\nfrom imblearn.over_sampling import RandomOverSampler\nfrom imblearn.under_sampling import RandomUnderSampler\nfrom imblearn.utils._test_common.instance_generator import (\n    _get_check_estimator_ids,\n    _get_expected_failed_checks,\n    _tested_estimators,\n)\nfrom imblearn.utils.estimator_checks import (\n    _set_checking_parameters,\n    check_dataframe_column_names_consistency,\n    check_param_validation,\n    parametrize_with_checks,\n)\nfrom imblearn.utils.testing import all_estimators\n\n\n@pytest.mark.parametrize(\"name, Estimator\", all_estimators())\ndef test_all_estimator_no_base_class(name, Estimator):\n    # test that all_estimators doesn't find abstract classes.\n    msg = f\"Base estimators such as {name} should not be included in all_estimators\"\n    assert not name.lower().startswith(\"base\"), msg\n\n\n@parametrize_with_checks_sklearn(\n    list(_tested_estimators()), expected_failed_checks=_get_expected_failed_checks\n)\ndef test_estimators_compatibility_sklearn(estimator, check, request):\n    _set_checking_parameters(estimator)\n    check(estimator)\n\n\n@parametrize_with_checks(\n    list(_tested_estimators()), expected_failed_checks=_get_expected_failed_checks\n)\ndef test_estimators_imblearn(estimator, check, request):\n    # Common tests for estimator instances\n    with ignore_warnings(\n        category=(\n            FutureWarning,\n            ConvergenceWarning,\n            UserWarning,\n            FutureWarning,\n        )\n    ):\n        _set_checking_parameters(estimator)\n        check(estimator)\n\n\n@pytest.mark.parametrize(\n    \"estimator\", _tested_estimators(), ids=_get_check_estimator_ids\n)\ndef test_check_param_validation(estimator):\n    name = estimator.__class__.__name__\n    _set_checking_parameters(estimator)\n    check_param_validation(name, estimator)\n\n\n@pytest.mark.parametrize(\"Sampler\", [RandomOverSampler, RandomUnderSampler])\ndef test_strategy_as_ordered_dict(Sampler):\n    \"\"\"Check that it is possible to pass an `OrderedDict` as strategy.\"\"\"\n    rng = np.random.RandomState(42)\n    X, y = rng.randn(30, 2), np.array([0] * 10 + [1] * 20)\n    sampler = Sampler(random_state=42)\n    if isinstance(sampler, RandomOverSampler):\n        strategy = OrderedDict({0: 20, 1: 20})\n    else:\n        strategy = OrderedDict({0: 10, 1: 10})\n    sampler.set_params(sampling_strategy=strategy)\n    X_res, y_res = sampler.fit_resample(X, y)\n    assert X_res.shape[0] == sum(strategy.values())\n    assert y_res.shape[0] == sum(strategy.values())\n\n\n@pytest.mark.parametrize(\n    \"estimator\", _tested_estimators(), ids=_get_check_estimator_ids\n)\ndef test_pandas_column_name_consistency(estimator):\n    _set_checking_parameters(estimator)\n    with ignore_warnings(category=(FutureWarning)):\n        with warnings.catch_warnings(record=True) as record:\n            check_dataframe_column_names_consistency(\n                estimator.__class__.__name__, estimator\n            )\n        for warning in record:\n            assert \"was fitted without feature names\" not in str(warning.message)\n"
  },
  {
    "path": "imblearn/tests/test_docstring_parameters.py",
    "content": "# Authors: Alexandre Gramfort <alexandre.gramfort@inria.fr>\n#          Raghav RV <rvraghav93@gmail.com>\n# License: BSD 3 clause\n\nimport importlib\nimport inspect\nimport warnings\nfrom inspect import signature\nfrom pkgutil import walk_packages\n\nimport pytest\nfrom sklearn.datasets import make_classification\nfrom sklearn.utils._testing import (\n    _get_func_name,\n    check_docstring_parameters,\n    ignore_warnings,\n)\nfrom sklearn.utils.deprecation import _is_deprecated\nfrom sklearn.utils.estimator_checks import (\n    _enforce_estimator_tags_X,\n    _enforce_estimator_tags_y,\n)\n\nimport imblearn\nfrom imblearn.base import is_sampler\nfrom imblearn.under_sampling import NearMiss\nfrom imblearn.utils._test_common.instance_generator import _tested_estimators\nfrom imblearn.utils.estimator_checks import _set_checking_parameters\n\n# walk_packages() ignores DeprecationWarnings, now we need to ignore\n# FutureWarnings\nwith warnings.catch_warnings():\n    warnings.simplefilter(\"ignore\", FutureWarning)\n    # mypy error: Module has no attribute \"__path__\"\n    imblearn_path = imblearn.__path__  # type: ignore  # mypy issue #1422\n    PUBLIC_MODULES = set(\n        [\n            pckg[1]\n            for pckg in walk_packages(prefix=\"imblearn.\", path=imblearn_path)\n            if not (\"._\" in pckg[1] or \".tests.\" in pckg[1])\n        ]\n    )\n\n# functions to ignore args / docstring of\n_DOCSTRING_IGNORES = [\"ValueDifferenceMetric\"]\n_IGNORE_ATTRIBUTES = {\n    NearMiss: [\"nn_ver3_\"],\n}\n\n# Methods where y param should be ignored if y=None by default\n_METHODS_IGNORE_NONE_Y = [\n    \"fit\",\n    \"score\",\n    \"fit_predict\",\n    \"fit_transform\",\n    \"partial_fit\",\n    \"predict\",\n]\n\n\n# numpydoc 0.8.0's docscrape tool raises because of collections.abc under\n# Python 3.7\n@pytest.mark.filterwarnings(\"ignore::FutureWarning\")\n@pytest.mark.filterwarnings(\"ignore::DeprecationWarning\")\ndef test_docstring_parameters():\n    # Test module docstring formatting\n\n    # Skip test if numpydoc is not found\n    pytest.importorskip(\n        \"numpydoc\", reason=\"numpydoc is required to test the docstrings\"\n    )\n\n    # XXX unreached code as of v0.22\n    from numpydoc import docscrape\n\n    incorrect = []\n    for name in PUBLIC_MODULES:\n        if name.endswith(\".conftest\"):\n            # pytest tooling, not part of the scikit-learn API\n            continue\n        with warnings.catch_warnings(record=True):\n            module = importlib.import_module(name)\n        classes = inspect.getmembers(module, inspect.isclass)\n        # Exclude non-scikit-learn classes\n        classes = [cls for cls in classes if cls[1].__module__.startswith(\"imblearn\")]\n        for cname, cls in classes:\n            this_incorrect = []\n            if cname in _DOCSTRING_IGNORES or cname.startswith(\"_\"):\n                continue\n            if inspect.isabstract(cls):\n                continue\n            with warnings.catch_warnings(record=True) as w:\n                cdoc = docscrape.ClassDoc(cls)\n            if len(w):\n                raise RuntimeError(f\"Error for __init__ of {cls} in {name}:\\n{w[0]}\")\n\n            cls_init = getattr(cls, \"__init__\", None)\n\n            if _is_deprecated(cls_init):\n                continue\n            elif cls_init is not None:\n                this_incorrect += check_docstring_parameters(cls.__init__, cdoc)\n\n            for method_name in cdoc.methods:\n                method = getattr(cls, method_name)\n                if _is_deprecated(method):\n                    continue\n                param_ignore = None\n                # Now skip docstring test for y when y is None\n                # by default for API reason\n                if method_name in _METHODS_IGNORE_NONE_Y:\n                    sig = signature(method)\n                    if \"y\" in sig.parameters and sig.parameters[\"y\"].default is None:\n                        param_ignore = [\"y\"]  # ignore y for fit and score\n                result = check_docstring_parameters(method, ignore=param_ignore)\n                this_incorrect += result\n\n            incorrect += this_incorrect\n\n        functions = inspect.getmembers(module, inspect.isfunction)\n        # Exclude imported functions\n        functions = [fn for fn in functions if fn[1].__module__ == name]\n        for fname, func in functions:\n            # Don't test private methods / functions\n            if fname.startswith(\"_\"):\n                continue\n            if fname == \"configuration\" and name.endswith(\"setup\"):\n                continue\n            name_ = _get_func_name(func)\n            if not any(d in name_ for d in _DOCSTRING_IGNORES) and not _is_deprecated(\n                func\n            ):\n                incorrect += check_docstring_parameters(func)\n\n    msg = \"\\n\".join(incorrect)\n    if len(incorrect) > 0:\n        raise AssertionError(\"Docstring Error:\\n\" + msg)\n\n\n@ignore_warnings(category=FutureWarning)\ndef test_tabs():\n    # Test that there are no tabs in our source files\n    for importer, modname, ispkg in walk_packages(\n        imblearn.__path__, prefix=\"imblearn.\"\n    ):\n        # because we don't import\n        mod = importlib.import_module(modname)\n\n        try:\n            source = inspect.getsource(mod)\n        except OSError:  # user probably should have run \"make clean\"\n            continue\n        assert \"\\t\" not in source, (\n            f'\"{modname}\" has tabs, please remove them or add it to the ignore list',\n        )\n\n\n@pytest.mark.parametrize(\"estimator\", list(_tested_estimators()))\ndef test_fit_docstring_attributes(estimator):\n    pytest.importorskip(\"numpydoc\")\n    from numpydoc import docscrape\n\n    Estimator = estimator.__class__\n    if Estimator.__name__ in _DOCSTRING_IGNORES:\n        return\n\n    doc = docscrape.ClassDoc(Estimator)\n    attributes = doc[\"Attributes\"]\n\n    _set_checking_parameters(estimator)\n\n    X, y = make_classification(\n        n_samples=20,\n        n_features=3,\n        n_redundant=0,\n        n_classes=2,\n        random_state=2,\n    )\n\n    y = _enforce_estimator_tags_y(estimator, y)\n    X = _enforce_estimator_tags_X(estimator, X)\n\n    if \"oob_score\" in estimator.get_params():\n        estimator.set_params(bootstrap=True, oob_score=True)\n\n    if is_sampler(estimator):\n        estimator.fit_resample(X, y)\n    else:\n        estimator.fit(X, y)\n\n    skipped_attributes = set(\n        [\n            \"base_estimator_\",  # this attribute exist with old version of sklearn\n        ]\n    )\n\n    for attr in attributes:\n        if attr.name in skipped_attributes:\n            continue\n        desc = \" \".join(attr.desc).lower()\n        # As certain attributes are present \"only\" if a certain parameter is\n        # provided, this checks if the word \"only\" is present in the attribute\n        # description, and if not the attribute is required to be present.\n        if \"only \" in desc:\n            continue\n        # ignore deprecation warnings\n        with ignore_warnings(category=FutureWarning):\n            if attr.name in _IGNORE_ATTRIBUTES.get(Estimator, []):\n                continue\n            assert hasattr(estimator, attr.name)\n\n    fit_attr = _get_all_fitted_attributes(estimator)\n    fit_attr_names = [attr.name for attr in attributes]\n    undocumented_attrs = set(fit_attr).difference(fit_attr_names)\n    undocumented_attrs = set(undocumented_attrs).difference(skipped_attributes)\n    if undocumented_attrs:\n        raise AssertionError(\n            f\"Undocumented attributes for {Estimator.__name__}: {undocumented_attrs}\"\n        )\n\n\ndef _get_all_fitted_attributes(estimator):\n    \"Get all the fitted attributes of an estimator including properties\"\n    # attributes\n    fit_attr = list(estimator.__dict__.keys())\n\n    # properties\n    with warnings.catch_warnings():\n        warnings.filterwarnings(\"error\", category=FutureWarning)\n\n        for name in dir(estimator.__class__):\n            obj = getattr(estimator.__class__, name)\n            if not isinstance(obj, property):\n                continue\n\n            # ignore properties that raises an AttributeError and deprecated\n            # properties\n            try:\n                getattr(estimator, name)\n            except (AttributeError, FutureWarning):\n                continue\n            fit_attr.append(name)\n\n    return [k for k in fit_attr if k.endswith(\"_\") and not k.startswith(\"_\")]\n"
  },
  {
    "path": "imblearn/tests/test_exceptions.py",
    "content": "\"\"\"Test for the exceptions modules\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nfrom pytest import raises\n\nfrom imblearn.exceptions import raise_isinstance_error\n\n\ndef test_raise_isinstance_error():\n    var = 10.0\n    with raises(ValueError, match=\"has to be one of\"):\n        raise_isinstance_error(\"var\", [int], var)\n"
  },
  {
    "path": "imblearn/tests/test_pipeline.py",
    "content": "\"\"\"\nTest the pipeline module.\n\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport itertools\nimport re\nimport shutil\nimport time\nfrom tempfile import mkdtemp\n\nimport numpy as np\nimport pytest\nfrom joblib import Memory\nfrom pytest import raises\nfrom sklearn import config_context\nfrom sklearn.base import BaseEstimator, ClassifierMixin, TransformerMixin, clone\nfrom sklearn.cluster import KMeans\nfrom sklearn.datasets import load_iris, make_classification\nfrom sklearn.decomposition import PCA\nfrom sklearn.feature_selection import SelectKBest, f_classif\nfrom sklearn.linear_model import LinearRegression, LogisticRegression\nfrom sklearn.neighbors import LocalOutlierFactor\nfrom sklearn.pipeline import FeatureUnion\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.svm import SVC\nfrom sklearn.utils._testing import (\n    assert_allclose,\n    assert_array_almost_equal,\n    assert_array_equal,\n)\nfrom sklearn.utils.fixes import parse_version\nfrom sklearn_compat._sklearn_compat import sklearn_version\nfrom sklearn_compat.utils._tags import Tags\n\nfrom imblearn.base import BaseSampler\nfrom imblearn.datasets import make_imbalance\nfrom imblearn.pipeline import Pipeline, make_pipeline\nfrom imblearn.under_sampling import EditedNearestNeighbours as ENN\nfrom imblearn.under_sampling import RandomUnderSampler\nfrom imblearn.utils.estimator_checks import check_param_validation\n\nJUNK_FOOD_DOCS = (\n    \"the pizza pizza beer copyright\",\n    \"the pizza burger beer copyright\",\n    \"the the pizza beer beer copyright\",\n    \"the burger beer beer copyright\",\n    \"the coke burger coke copyright\",\n    \"the coke burger burger\",\n)\n\nR_TOL = 1e-4\n\n\nclass NoFit:\n    \"\"\"Small class to test parameter dispatching.\"\"\"\n\n    def __init__(self, a=None, b=None):\n        self.a = a\n        self.b = b\n\n    def __sklearn_tags__(self):\n        return Tags()\n\n\nclass NoTrans(NoFit):\n    def fit(self, X, y):\n        return self\n\n    def get_params(self, deep=False):\n        return {\"a\": self.a, \"b\": self.b}\n\n    def set_params(self, **params):\n        self.a = params[\"a\"]\n        return self\n\n\nclass NoInvTransf(NoTrans):\n    def transform(self, X, y=None):\n        return X\n\n\nclass Transf(NoInvTransf):\n    def transform(self, X, y=None):\n        return X\n\n    def inverse_transform(self, X):\n        return X\n\n\nclass TransfFitParams(Transf):\n    def fit(self, X, y, **fit_params):\n        self.fit_params = fit_params\n        return self\n\n\nclass Mult(BaseEstimator):\n    def __init__(self, mult=1):\n        self.mult = mult\n\n    def __sklearn_is_fitted__(self):\n        return True\n\n    def fit(self, X, y):\n        return self\n\n    def transform(self, X):\n        return np.asarray(X) * self.mult\n\n    def inverse_transform(self, X):\n        return np.asarray(X) / self.mult\n\n    def predict(self, X):\n        return (np.asarray(X) * self.mult).sum(axis=1)\n\n    predict_proba = predict_log_proba = decision_function = predict\n\n    def score(self, X, y=None):\n        return np.sum(X)\n\n\nclass FitParamT(BaseEstimator):\n    \"\"\"Mock classifier\"\"\"\n\n    def __init__(self):\n        self.successful = False\n\n    def fit(self, X, y, should_succeed=False):\n        self.fitted_ = True\n        self.successful = should_succeed\n        return self\n\n    def predict(self, X):\n        return self.successful\n\n    def fit_predict(self, X, y, should_succeed=False):\n        self.fit(X, y, should_succeed=should_succeed)\n        return self.predict(X)\n\n    def score(self, X, y=None, sample_weight=None):\n        if sample_weight is not None:\n            X = X * sample_weight\n        return np.sum(X)\n\n\nclass DummyTransf(Transf):\n    \"\"\"Transformer which store the column means\"\"\"\n\n    def fit(self, X, y):\n        self.means_ = np.mean(X, axis=0)\n        # store timestamp to figure out whether the result of 'fit' has been\n        # cached or not\n        self.timestamp_ = time.time()\n        return self\n\n\nclass DummyEstimatorParams(BaseEstimator):\n    \"\"\"Mock classifier that takes params on predict\"\"\"\n\n    def __sklearn_is_fitted__(self):\n        return True\n\n    def fit(self, X, y):\n        return self\n\n    def predict(self, X, got_attribute=False):\n        self.got_attribute = got_attribute\n        return self\n\n\nclass DummySampler(NoTrans):\n    \"\"\"Samplers which returns a balanced number of samples\"\"\"\n\n    def fit_resample(self, X, y):\n        self.means_ = np.mean(X, axis=0)\n        # store timestamp to figure out whether the result of 'fit' has been\n        # cached or not\n        self.timestamp_ = time.time()\n        return X, y\n\n\nclass FitTransformSample(NoTrans):\n    \"\"\"Estimator implementing both transform and sample\"\"\"\n\n    def __sklearn_is_fitted__(self):\n        return True\n\n    def fit(self, X, y, should_succeed=False):\n        pass\n\n    def fit_resample(self, X, y=None):\n        return X, y\n\n    def fit_transform(self, X, y=None):\n        return self.fit(X, y).transform(X)\n\n    def transform(self, X, y=None):\n        return X\n\n\ndef test_pipeline_init_tuple():\n    # Pipeline accepts steps as tuple\n    X = np.array([[1, 2]])\n    pipe = Pipeline(((\"transf\", Transf()), (\"clf\", FitParamT())))\n    pipe.fit(X, y=None)\n    pipe.score(X)\n    pipe.set_params(transf=\"passthrough\")\n    pipe.fit(X, y=None)\n    pipe.score(X)\n\n\ndef test_pipeline_init():\n    # Test the various init parameters of the pipeline.\n    with raises(TypeError):\n        Pipeline()\n    # Check that we can't instantiate pipelines with objects without fit\n    # method\n    X, y = load_iris(return_X_y=True)\n    error_regex = (\n        \"Last step of Pipeline should implement fit or be the string 'passthrough'\"\n    )\n    with raises(TypeError, match=error_regex):\n        model = Pipeline([(\"clf\", NoFit())])\n        model.fit(X, y)\n    # Smoke test with only an estimator\n    clf = NoTrans()\n    pipe = Pipeline([(\"svc\", clf)])\n    expected = dict(svc__a=None, svc__b=None, svc=clf, **pipe.get_params(deep=False))\n    assert pipe.get_params(deep=True) == expected\n\n    # Check that params are set\n    pipe.set_params(svc__a=0.1)\n    assert clf.a == 0.1\n    assert clf.b is None\n    # Smoke test the repr:\n    repr(pipe)\n\n    # Test with two objects\n    clf = SVC(gamma=\"scale\")\n    filter1 = SelectKBest(f_classif)\n    pipe = Pipeline([(\"anova\", filter1), (\"svc\", clf)])\n\n    # Check that we can't instantiate with non-transformers on the way\n    # Note that NoTrans implements fit, but not transform\n    error_regex = \"implement fit and transform or fit_resample\"\n    with raises(TypeError, match=error_regex):\n        model = Pipeline([(\"t\", NoTrans()), (\"svc\", clf)])\n        model.fit(X, y)\n\n    # Check that params are set\n    pipe.set_params(svc__C=0.1)\n    assert clf.C == 0.1\n    # Smoke test the repr:\n    repr(pipe)\n\n    # Check that params are not set when naming them wrong\n    with raises(ValueError):\n        pipe.set_params(anova__C=0.1)\n\n    # Test clone\n    pipe2 = clone(pipe)\n    assert pipe.named_steps[\"svc\"] is not pipe2.named_steps[\"svc\"]\n\n    # Check that apart from estimators, the parameters are the same\n    params = pipe.get_params(deep=True)\n    params2 = pipe2.get_params(deep=True)\n\n    for x in pipe.get_params(deep=False):\n        params.pop(x)\n\n    for x in pipe2.get_params(deep=False):\n        params2.pop(x)\n\n    # Remove estimators that where copied\n    params.pop(\"svc\")\n    params.pop(\"anova\")\n    params2.pop(\"svc\")\n    params2.pop(\"anova\")\n    assert params == params2\n\n\ndef test_pipeline_methods_anova():\n    # Test the various methods of the pipeline (anova).\n    iris = load_iris()\n    X = iris.data\n    y = iris.target\n    # Test with Anova + LogisticRegression\n    clf = LogisticRegression()\n    filter1 = SelectKBest(f_classif, k=2)\n    pipe = Pipeline([(\"anova\", filter1), (\"logistic\", clf)])\n    pipe.fit(X, y)\n    pipe.predict(X)\n    pipe.predict_proba(X)\n    pipe.predict_log_proba(X)\n    pipe.score(X, y)\n\n\ndef test_pipeline_fit_params():\n    # Test that the pipeline can take fit parameters\n    pipe = Pipeline([(\"transf\", Transf()), (\"clf\", FitParamT())])\n    pipe.fit(X=None, y=None, clf__should_succeed=True)\n    # classifier should return True\n    assert pipe.predict(None)\n    # and transformer params should not be changed\n    assert pipe.named_steps[\"transf\"].a is None\n    assert pipe.named_steps[\"transf\"].b is None\n    # invalid parameters should raise an error message\n    with raises(TypeError, match=\"unexpected keyword argument\"):\n        pipe.fit(None, None, clf__bad=True)\n\n\ndef test_pipeline_sample_weight_supported():\n    # Pipeline should pass sample_weight\n    X = np.array([[1, 2]])\n    pipe = Pipeline([(\"transf\", Transf()), (\"clf\", FitParamT())])\n    pipe.fit(X, y=None)\n    assert pipe.score(X) == 3\n    assert pipe.score(X, y=None) == 3\n    assert pipe.score(X, y=None, sample_weight=None) == 3\n    assert pipe.score(X, sample_weight=np.array([2, 3])) == 8\n\n\ndef test_pipeline_sample_weight_unsupported():\n    # When sample_weight is None it shouldn't be passed\n    X = np.array([[1, 2]])\n    pipe = Pipeline([(\"transf\", Transf()), (\"clf\", Mult())])\n    pipe.fit(X, y=None)\n    assert pipe.score(X) == 3\n    assert pipe.score(X, sample_weight=None) == 3\n    with raises(TypeError, match=\"unexpected keyword argument\"):\n        pipe.score(X, sample_weight=np.array([2, 3]))\n\n\ndef test_pipeline_raise_set_params_error():\n    # Test pipeline raises set params error message for nested models.\n    pipe = Pipeline([(\"cls\", LinearRegression())])\n    with raises(ValueError, match=\"Invalid parameter\"):\n        pipe.set_params(fake=\"nope\")\n\n    # nested model check\n    with raises(ValueError, match=\"Invalid parameter\"):\n        pipe.set_params(fake__estimator=\"nope\")\n\n\ndef test_pipeline_methods_pca_svm():\n    # Test the various methods of the pipeline (pca + svm).\n    iris = load_iris()\n    X = iris.data\n    y = iris.target\n    # Test with PCA + SVC\n    clf = SVC(gamma=\"scale\", probability=True, random_state=0)\n    pca = PCA(svd_solver=\"full\", n_components=\"mle\", whiten=True)\n    pipe = Pipeline([(\"pca\", pca), (\"svc\", clf)])\n    pipe.fit(X, y)\n    pipe.predict(X)\n    pipe.predict_proba(X)\n    pipe.predict_log_proba(X)\n    pipe.score(X, y)\n\n\ndef test_pipeline_methods_preprocessing_svm():\n    # Test the various methods of the pipeline (preprocessing + svm).\n    iris = load_iris()\n    X = iris.data\n    y = iris.target\n    n_samples = X.shape[0]\n    n_classes = len(np.unique(y))\n    scaler = StandardScaler()\n    pca = PCA(n_components=2, svd_solver=\"randomized\", whiten=True)\n    clf = SVC(\n        gamma=\"scale\",\n        probability=True,\n        random_state=0,\n        decision_function_shape=\"ovr\",\n    )\n\n    for preprocessing in [scaler, pca]:\n        pipe = Pipeline([(\"preprocess\", preprocessing), (\"svc\", clf)])\n        pipe.fit(X, y)\n\n        # check shapes of various prediction functions\n        predict = pipe.predict(X)\n        assert predict.shape == (n_samples,)\n\n        proba = pipe.predict_proba(X)\n        assert proba.shape == (n_samples, n_classes)\n\n        log_proba = pipe.predict_log_proba(X)\n        assert log_proba.shape == (n_samples, n_classes)\n\n        decision_function = pipe.decision_function(X)\n        assert decision_function.shape == (n_samples, n_classes)\n\n        pipe.score(X, y)\n\n\ndef test_fit_predict_on_pipeline():\n    # test that the fit_predict method is implemented on a pipeline\n    # test that the fit_predict on pipeline yields same results as applying\n    # transform and clustering steps separately\n    iris = load_iris()\n    scaler = StandardScaler()\n    km = KMeans(random_state=0, n_init=10)\n    # As pipeline doesn't clone estimators on construction,\n    # it must have its own estimators\n    scaler_for_pipeline = StandardScaler()\n    km_for_pipeline = KMeans(random_state=0, n_init=10)\n\n    # first compute the transform and clustering step separately\n    scaled = scaler.fit_transform(iris.data)\n    separate_pred = km.fit_predict(scaled)\n\n    # use a pipeline to do the transform and clustering in one step\n    pipe = Pipeline([(\"scaler\", scaler_for_pipeline), (\"Kmeans\", km_for_pipeline)])\n    pipeline_pred = pipe.fit_predict(iris.data)\n\n    assert_array_almost_equal(pipeline_pred, separate_pred)\n\n\ndef test_fit_predict_on_pipeline_without_fit_predict():\n    # tests that a pipeline does not have fit_predict method when final\n    # step of pipeline does not have fit_predict defined\n    scaler = StandardScaler()\n    pca = PCA(svd_solver=\"full\")\n    pipe = Pipeline([(\"scaler\", scaler), (\"pca\", pca)])\n    error_regex = \"has no attribute 'fit_predict'\"\n    with raises(AttributeError, match=error_regex):\n        getattr(pipe, \"fit_predict\")\n\n\ndef test_fit_predict_with_intermediate_fit_params():\n    # tests that Pipeline passes fit_params to intermediate steps\n    # when fit_predict is invoked\n    pipe = Pipeline([(\"transf\", TransfFitParams()), (\"clf\", FitParamT())])\n    pipe.fit_predict(\n        X=None, y=None, transf__should_get_this=True, clf__should_succeed=True\n    )\n    assert pipe.named_steps[\"transf\"].fit_params[\"should_get_this\"]\n    assert pipe.named_steps[\"clf\"].successful\n    assert \"should_succeed\" not in pipe.named_steps[\"transf\"].fit_params\n\n\ndef test_pipeline_transform():\n    # Test whether pipeline works with a transformer at the end.\n    # Also test pipeline.transform and pipeline.inverse_transform\n    iris = load_iris()\n    X = iris.data\n    pca = PCA(n_components=2, svd_solver=\"full\")\n    pipeline = Pipeline([(\"pca\", pca)])\n\n    # test transform and fit_transform:\n    X_trans = pipeline.fit(X).transform(X)\n    X_trans2 = pipeline.fit_transform(X)\n    X_trans3 = pca.fit_transform(X)\n    assert_array_almost_equal(X_trans, X_trans2)\n    assert_array_almost_equal(X_trans, X_trans3)\n\n    X_back = pipeline.inverse_transform(X_trans)\n    X_back2 = pca.inverse_transform(X_trans)\n    assert_array_almost_equal(X_back, X_back2)\n\n\ndef test_pipeline_fit_transform():\n    # Test whether pipeline works with a transformer missing fit_transform\n    iris = load_iris()\n    X = iris.data\n    y = iris.target\n    transf = Transf()\n    pipeline = Pipeline([(\"mock\", transf)])\n\n    # test fit_transform:\n    X_trans = pipeline.fit_transform(X, y)\n    X_trans2 = transf.fit(X, y).transform(X)\n    assert_array_almost_equal(X_trans, X_trans2)\n\n\ndef test_set_pipeline_steps():\n    transf1 = Transf()\n    transf2 = Transf()\n    pipeline = Pipeline([(\"mock\", transf1)])\n    assert pipeline.named_steps[\"mock\"] is transf1\n\n    # Directly setting attr\n    pipeline.steps = [(\"mock2\", transf2)]\n    assert \"mock\" not in pipeline.named_steps\n    assert pipeline.named_steps[\"mock2\"] is transf2\n    assert [(\"mock2\", transf2)] == pipeline.steps\n\n    # Using set_params\n    pipeline.set_params(steps=[(\"mock\", transf1)])\n    assert [(\"mock\", transf1)] == pipeline.steps\n\n    # Using set_params to replace single step\n    pipeline.set_params(mock=transf2)\n    assert [(\"mock\", transf2)] == pipeline.steps\n\n    # With invalid data\n    pipeline.set_params(steps=[(\"junk\", ())])\n    with raises(TypeError):\n        pipeline.fit([[1]], [1])\n    with raises(AttributeError):\n        pipeline.fit_transform([[1]], [1])\n\n\n@pytest.mark.parametrize(\"passthrough\", [None, \"passthrough\"])\ndef test_pipeline_correctly_adjusts_steps(passthrough):\n    X = np.array([[1]])\n    y = np.array([1])\n    mult2 = Mult(mult=2)\n    mult3 = Mult(mult=3)\n    mult5 = Mult(mult=5)\n    pipeline = Pipeline(\n        [(\"m2\", mult2), (\"bad\", passthrough), (\"m3\", mult3), (\"m5\", mult5)]\n    )\n    pipeline.fit(X, y)\n    expected_names = [\"m2\", \"bad\", \"m3\", \"m5\"]\n    actual_names = [name for name, _ in pipeline.steps]\n    assert expected_names == actual_names\n\n\n@pytest.mark.parametrize(\"passthrough\", [None, \"passthrough\"])\ndef test_set_pipeline_step_passthrough(passthrough):\n    # Test setting Pipeline steps to None\n    X = np.array([[1]])\n    y = np.array([1])\n    mult2 = Mult(mult=2)\n    mult3 = Mult(mult=3)\n    mult5 = Mult(mult=5)\n\n    def make():\n        return Pipeline([(\"m2\", mult2), (\"m3\", mult3), (\"last\", mult5)])\n\n    pipeline = make()\n\n    exp = 2 * 3 * 5\n    assert_array_equal([[exp]], pipeline.fit_transform(X, y))\n    assert_array_equal([exp], pipeline.fit(X).predict(X))\n    assert_array_equal(X, pipeline.inverse_transform([[exp]]))\n\n    pipeline.set_params(m3=passthrough)\n    exp = 2 * 5\n    assert_array_equal([[exp]], pipeline.fit_transform(X, y))\n    assert_array_equal([exp], pipeline.fit(X).predict(X))\n    assert_array_equal(X, pipeline.inverse_transform([[exp]]))\n    expected_params = {\n        \"steps\": pipeline.steps,\n        \"m2\": mult2,\n        \"m3\": passthrough,\n        \"last\": mult5,\n        \"memory\": None,\n        \"m2__mult\": 2,\n        \"last__mult\": 5,\n        \"verbose\": False,\n        \"transform_input\": None,\n    }\n    assert pipeline.get_params(deep=True) == expected_params\n\n    pipeline.set_params(m2=passthrough)\n    exp = 5\n    assert_array_equal([[exp]], pipeline.fit_transform(X, y))\n    assert_array_equal([exp], pipeline.fit(X).predict(X))\n    assert_array_equal(X, pipeline.inverse_transform([[exp]]))\n\n    # for other methods, ensure no AttributeErrors on None:\n    other_methods = [\n        \"predict_proba\",\n        \"predict_log_proba\",\n        \"decision_function\",\n        \"transform\",\n        \"score\",\n    ]\n    for method in other_methods:\n        getattr(pipeline, method)(X)\n\n    pipeline.set_params(m2=mult2)\n    exp = 2 * 5\n    assert_array_equal([[exp]], pipeline.fit_transform(X, y))\n    assert_array_equal([exp], pipeline.fit(X).predict(X))\n    assert_array_equal(X, pipeline.inverse_transform([[exp]]))\n\n    pipeline = make()\n    pipeline.set_params(last=passthrough)\n    # mult2 and mult3 are active\n    exp = 6\n    pipeline.fit(X, y)\n    pipeline.transform(X)\n    assert_array_equal([[exp]], pipeline.fit(X, y).transform(X))\n    assert_array_equal([[exp]], pipeline.fit_transform(X, y))\n    assert_array_equal(X, pipeline.inverse_transform([[exp]]))\n    with raises(AttributeError, match=\"has no attribute 'predict'\"):\n        getattr(pipeline, \"predict\")\n\n    # Check 'passthrough' step at construction time\n    exp = 2 * 5\n    pipeline = Pipeline([(\"m2\", mult2), (\"m3\", passthrough), (\"last\", mult5)])\n    assert_array_equal([[exp]], pipeline.fit_transform(X, y))\n    assert_array_equal([exp], pipeline.fit(X).predict(X))\n    assert_array_equal(X, pipeline.inverse_transform([[exp]]))\n\n\ndef test_pipeline_ducktyping():\n    pipeline = make_pipeline(Mult(5))\n    pipeline.predict\n    pipeline.transform\n    pipeline.inverse_transform\n\n    pipeline = make_pipeline(Transf())\n    assert not hasattr(pipeline, \"predict\")\n    pipeline.transform\n    pipeline.inverse_transform\n\n    pipeline = make_pipeline(\"passthrough\")\n    assert pipeline.steps[0] == (\"passthrough\", \"passthrough\")\n    assert not hasattr(pipeline, \"predict\")\n    pipeline.transform\n    pipeline.inverse_transform\n\n    pipeline = make_pipeline(Transf(), NoInvTransf())\n    assert not hasattr(pipeline, \"predict\")\n    pipeline.transform\n    assert not hasattr(pipeline, \"inverse_transform\")\n\n    pipeline = make_pipeline(NoInvTransf(), Transf())\n    assert not hasattr(pipeline, \"predict\")\n    pipeline.transform\n    assert not hasattr(pipeline, \"inverse_transform\")\n\n\ndef test_make_pipeline():\n    t1 = Transf()\n    t2 = Transf()\n    pipe = make_pipeline(t1, t2)\n    assert isinstance(pipe, Pipeline)\n    assert pipe.steps[0][0] == \"transf-1\"\n    assert pipe.steps[1][0] == \"transf-2\"\n\n    pipe = make_pipeline(t1, t2, FitParamT())\n    assert isinstance(pipe, Pipeline)\n    assert pipe.steps[0][0] == \"transf-1\"\n    assert pipe.steps[1][0] == \"transf-2\"\n    assert pipe.steps[2][0] == \"fitparamt\"\n\n\ndef test_classes_property():\n    iris = load_iris()\n    X = iris.data\n    y = iris.target\n\n    reg = make_pipeline(SelectKBest(k=1), LinearRegression())\n    reg.fit(X, y)\n    with raises(AttributeError):\n        getattr(reg, \"classes_\")\n\n    clf = make_pipeline(\n        SelectKBest(k=1),\n        LogisticRegression(),\n    )\n    with raises(AttributeError):\n        getattr(clf, \"classes_\")\n    clf.fit(X, y)\n    assert_array_equal(clf.classes_, np.unique(y))\n\n\ndef test_pipeline_memory_transformer():\n    iris = load_iris()\n    X = iris.data\n    y = iris.target\n    cachedir = mkdtemp()\n    try:\n        memory = Memory(cachedir, verbose=10)\n        # Test with Transformer + SVC\n        clf = SVC(gamma=\"scale\", probability=True, random_state=0)\n        transf = DummyTransf()\n        pipe = Pipeline([(\"transf\", clone(transf)), (\"svc\", clf)])\n        cached_pipe = Pipeline([(\"transf\", transf), (\"svc\", clf)], memory=memory)\n\n        # Memoize the transformer at the first fit\n        cached_pipe.fit(X, y)\n        pipe.fit(X, y)\n        # Get the time stamp of the tranformer in the cached pipeline\n        expected_ts = cached_pipe.named_steps[\"transf\"].timestamp_\n        # Check that cached_pipe and pipe yield identical results\n        assert_array_equal(pipe.predict(X), cached_pipe.predict(X))\n        assert_array_equal(pipe.predict_proba(X), cached_pipe.predict_proba(X))\n        assert_array_equal(pipe.predict_log_proba(X), cached_pipe.predict_log_proba(X))\n        assert_array_equal(pipe.score(X, y), cached_pipe.score(X, y))\n        assert_array_equal(\n            pipe.named_steps[\"transf\"].means_,\n            cached_pipe.named_steps[\"transf\"].means_,\n        )\n        assert not hasattr(transf, \"means_\")\n        # Check that we are reading the cache while fitting\n        # a second time\n        cached_pipe.fit(X, y)\n        # Check that cached_pipe and pipe yield identical results\n        assert_array_equal(pipe.predict(X), cached_pipe.predict(X))\n        assert_array_equal(pipe.predict_proba(X), cached_pipe.predict_proba(X))\n        assert_array_equal(pipe.predict_log_proba(X), cached_pipe.predict_log_proba(X))\n        assert_array_equal(pipe.score(X, y), cached_pipe.score(X, y))\n        assert_array_equal(\n            pipe.named_steps[\"transf\"].means_,\n            cached_pipe.named_steps[\"transf\"].means_,\n        )\n        assert cached_pipe.named_steps[\"transf\"].timestamp_ == expected_ts\n        # Create a new pipeline with cloned estimators\n        # Check that even changing the name step does not affect the cache hit\n        clf_2 = SVC(gamma=\"scale\", probability=True, random_state=0)\n        transf_2 = DummyTransf()\n        cached_pipe_2 = Pipeline(\n            [(\"transf_2\", transf_2), (\"svc\", clf_2)], memory=memory\n        )\n        cached_pipe_2.fit(X, y)\n\n        # Check that cached_pipe and pipe yield identical results\n        assert_array_equal(pipe.predict(X), cached_pipe_2.predict(X))\n        assert_array_equal(pipe.predict_proba(X), cached_pipe_2.predict_proba(X))\n        assert_array_equal(\n            pipe.predict_log_proba(X), cached_pipe_2.predict_log_proba(X)\n        )\n        assert_array_equal(pipe.score(X, y), cached_pipe_2.score(X, y))\n        assert_array_equal(\n            pipe.named_steps[\"transf\"].means_,\n            cached_pipe_2.named_steps[\"transf_2\"].means_,\n        )\n        assert cached_pipe_2.named_steps[\"transf_2\"].timestamp_ == expected_ts\n    finally:\n        shutil.rmtree(cachedir)\n\n\ndef test_pipeline_memory_sampler():\n    X, y = make_classification(\n        n_classes=2,\n        class_sep=2,\n        weights=[0.1, 0.9],\n        n_informative=3,\n        n_redundant=1,\n        flip_y=0,\n        n_features=20,\n        n_clusters_per_class=1,\n        n_samples=5000,\n        random_state=0,\n    )\n    cachedir = mkdtemp()\n    try:\n        memory = Memory(cachedir, verbose=10)\n        # Test with Transformer + SVC\n        clf = SVC(gamma=\"scale\", probability=True, random_state=0)\n        transf = DummySampler()\n        pipe = Pipeline([(\"transf\", clone(transf)), (\"svc\", clf)])\n        cached_pipe = Pipeline([(\"transf\", transf), (\"svc\", clf)], memory=memory)\n\n        # Memoize the transformer at the first fit\n        cached_pipe.fit(X, y)\n        pipe.fit(X, y)\n        # Get the time stamp of the tranformer in the cached pipeline\n        expected_ts = cached_pipe.named_steps[\"transf\"].timestamp_\n        # Check that cached_pipe and pipe yield identical results\n        assert_array_equal(pipe.predict(X), cached_pipe.predict(X))\n        assert_array_equal(pipe.predict_proba(X), cached_pipe.predict_proba(X))\n        assert_array_equal(pipe.predict_log_proba(X), cached_pipe.predict_log_proba(X))\n        assert_array_equal(pipe.score(X, y), cached_pipe.score(X, y))\n        assert_array_equal(\n            pipe.named_steps[\"transf\"].means_,\n            cached_pipe.named_steps[\"transf\"].means_,\n        )\n        assert not hasattr(transf, \"means_\")\n        # Check that we are reading the cache while fitting\n        # a second time\n        cached_pipe.fit(X, y)\n        # Check that cached_pipe and pipe yield identical results\n        assert_array_equal(pipe.predict(X), cached_pipe.predict(X))\n        assert_array_equal(pipe.predict_proba(X), cached_pipe.predict_proba(X))\n        assert_array_equal(pipe.predict_log_proba(X), cached_pipe.predict_log_proba(X))\n        assert_array_equal(pipe.score(X, y), cached_pipe.score(X, y))\n        assert_array_equal(\n            pipe.named_steps[\"transf\"].means_,\n            cached_pipe.named_steps[\"transf\"].means_,\n        )\n        assert cached_pipe.named_steps[\"transf\"].timestamp_ == expected_ts\n        # Create a new pipeline with cloned estimators\n        # Check that even changing the name step does not affect the cache hit\n        clf_2 = SVC(gamma=\"scale\", probability=True, random_state=0)\n        transf_2 = DummySampler()\n        cached_pipe_2 = Pipeline(\n            [(\"transf_2\", transf_2), (\"svc\", clf_2)], memory=memory\n        )\n        cached_pipe_2.fit(X, y)\n\n        # Check that cached_pipe and pipe yield identical results\n        assert_array_equal(pipe.predict(X), cached_pipe_2.predict(X))\n        assert_array_equal(pipe.predict_proba(X), cached_pipe_2.predict_proba(X))\n        assert_array_equal(\n            pipe.predict_log_proba(X), cached_pipe_2.predict_log_proba(X)\n        )\n        assert_array_equal(pipe.score(X, y), cached_pipe_2.score(X, y))\n        assert_array_equal(\n            pipe.named_steps[\"transf\"].means_,\n            cached_pipe_2.named_steps[\"transf_2\"].means_,\n        )\n        assert cached_pipe_2.named_steps[\"transf_2\"].timestamp_ == expected_ts\n    finally:\n        shutil.rmtree(cachedir)\n\n\ndef test_pipeline_methods_pca_rus_svm():\n    # Test the various methods of the pipeline (pca + svm).\n    X, y = make_classification(\n        n_classes=2,\n        class_sep=2,\n        weights=[0.1, 0.9],\n        n_informative=3,\n        n_redundant=1,\n        flip_y=0,\n        n_features=20,\n        n_clusters_per_class=1,\n        n_samples=5000,\n        random_state=0,\n    )\n\n    # Test with PCA + SVC\n    clf = SVC(gamma=\"scale\", probability=True, random_state=0)\n    pca = PCA()\n    rus = RandomUnderSampler(random_state=0)\n    pipe = Pipeline([(\"pca\", pca), (\"rus\", rus), (\"svc\", clf)])\n    pipe.fit(X, y)\n    pipe.predict(X)\n    pipe.predict_proba(X)\n    pipe.predict_log_proba(X)\n    pipe.score(X, y)\n\n\ndef test_pipeline_methods_rus_pca_svm():\n    # Test the various methods of the pipeline (pca + svm).\n    X, y = make_classification(\n        n_classes=2,\n        class_sep=2,\n        weights=[0.1, 0.9],\n        n_informative=3,\n        n_redundant=1,\n        flip_y=0,\n        n_features=20,\n        n_clusters_per_class=1,\n        n_samples=5000,\n        random_state=0,\n    )\n\n    # Test with PCA + SVC\n    clf = SVC(gamma=\"scale\", probability=True, random_state=0)\n    pca = PCA()\n    rus = RandomUnderSampler(random_state=0)\n    pipe = Pipeline([(\"rus\", rus), (\"pca\", pca), (\"svc\", clf)])\n    pipe.fit(X, y)\n    pipe.predict(X)\n    pipe.predict_proba(X)\n    pipe.predict_log_proba(X)\n    pipe.score(X, y)\n\n\ndef test_pipeline_sample():\n    # Test whether pipeline works with a sampler at the end.\n    # Also test pipeline.sampler\n    X, y = make_classification(\n        n_classes=2,\n        class_sep=2,\n        weights=[0.1, 0.9],\n        n_informative=3,\n        n_redundant=1,\n        flip_y=0,\n        n_features=20,\n        n_clusters_per_class=1,\n        n_samples=5000,\n        random_state=0,\n    )\n\n    rus = RandomUnderSampler(random_state=0)\n    pipeline = Pipeline([(\"rus\", rus)])\n\n    # test transform and fit_transform:\n    X_trans, y_trans = pipeline.fit_resample(X, y)\n    X_trans2, y_trans2 = rus.fit_resample(X, y)\n    assert_allclose(X_trans, X_trans2, rtol=R_TOL)\n    assert_allclose(y_trans, y_trans2, rtol=R_TOL)\n\n    pca = PCA()\n    pipeline = Pipeline([(\"pca\", PCA()), (\"rus\", rus)])\n\n    X_trans, y_trans = pipeline.fit_resample(X, y)\n    X_pca = pca.fit_transform(X)\n    X_trans2, y_trans2 = rus.fit_resample(X_pca, y)\n    # We round the value near to zero. It seems that PCA has some issue\n    # with that\n    X_trans[np.bitwise_and(X_trans < R_TOL, X_trans > -R_TOL)] = 0\n    X_trans2[np.bitwise_and(X_trans2 < R_TOL, X_trans2 > -R_TOL)] = 0\n    assert_allclose(X_trans, X_trans2, rtol=R_TOL)\n    assert_allclose(y_trans, y_trans2, rtol=R_TOL)\n\n\ndef test_pipeline_sample_transform():\n    # Test whether pipeline works with a sampler at the end.\n    # Also test pipeline.sampler\n    X, y = make_classification(\n        n_classes=2,\n        class_sep=2,\n        weights=[0.1, 0.9],\n        n_informative=3,\n        n_redundant=1,\n        flip_y=0,\n        n_features=20,\n        n_clusters_per_class=1,\n        n_samples=5000,\n        random_state=0,\n    )\n\n    rus = RandomUnderSampler(random_state=0)\n    pca = PCA()\n    pca2 = PCA()\n    pipeline = Pipeline([(\"pca\", pca), (\"rus\", rus), (\"pca2\", pca2)])\n\n    pipeline.fit(X, y).transform(X)\n\n\ndef test_pipeline_none_classifier():\n    # Test pipeline using None as preprocessing step and a classifier\n    X, y = make_classification(\n        n_classes=2,\n        class_sep=2,\n        weights=[0.1, 0.9],\n        n_informative=3,\n        n_redundant=1,\n        flip_y=0,\n        n_features=20,\n        n_clusters_per_class=1,\n        n_samples=5000,\n        random_state=0,\n    )\n    clf = LogisticRegression(solver=\"lbfgs\", random_state=0)\n    pipe = make_pipeline(None, clf)\n    pipe.fit(X, y)\n    pipe.predict(X)\n    pipe.predict_proba(X)\n    pipe.decision_function(X)\n    pipe.score(X, y)\n\n\ndef test_pipeline_none_sampler_classifier():\n    # Test pipeline using None, RUS and a classifier\n    X, y = make_classification(\n        n_classes=2,\n        class_sep=2,\n        weights=[0.1, 0.9],\n        n_informative=3,\n        n_redundant=1,\n        flip_y=0,\n        n_features=20,\n        n_clusters_per_class=1,\n        n_samples=5000,\n        random_state=0,\n    )\n    clf = LogisticRegression(solver=\"lbfgs\", random_state=0)\n    rus = RandomUnderSampler(random_state=0)\n    pipe = make_pipeline(None, rus, clf)\n    pipe.fit(X, y)\n    pipe.predict(X)\n    pipe.predict_proba(X)\n    pipe.decision_function(X)\n    pipe.score(X, y)\n\n\ndef test_pipeline_sampler_none_classifier():\n    # Test pipeline using RUS, None and a classifier\n    X, y = make_classification(\n        n_classes=2,\n        class_sep=2,\n        weights=[0.1, 0.9],\n        n_informative=3,\n        n_redundant=1,\n        flip_y=0,\n        n_features=20,\n        n_clusters_per_class=1,\n        n_samples=5000,\n        random_state=0,\n    )\n    clf = LogisticRegression(solver=\"lbfgs\", random_state=0)\n    rus = RandomUnderSampler(random_state=0)\n    pipe = make_pipeline(rus, None, clf)\n    pipe.fit(X, y)\n    pipe.predict(X)\n    pipe.predict_proba(X)\n    pipe.decision_function(X)\n    pipe.score(X, y)\n\n\ndef test_pipeline_none_sampler_sample():\n    # Test pipeline using None step and a sampler\n    X, y = make_classification(\n        n_classes=2,\n        class_sep=2,\n        weights=[0.1, 0.9],\n        n_informative=3,\n        n_redundant=1,\n        flip_y=0,\n        n_features=20,\n        n_clusters_per_class=1,\n        n_samples=5000,\n        random_state=0,\n    )\n\n    rus = RandomUnderSampler(random_state=0)\n    pipe = make_pipeline(None, rus)\n    pipe.fit_resample(X, y)\n\n\ndef test_pipeline_none_transformer():\n    # Test pipeline using None and a transformer that implements transform and\n    # inverse_transform\n    X, y = make_classification(\n        n_classes=2,\n        class_sep=2,\n        weights=[0.1, 0.9],\n        n_informative=3,\n        n_redundant=1,\n        flip_y=0,\n        n_features=20,\n        n_clusters_per_class=1,\n        n_samples=5000,\n        random_state=0,\n    )\n\n    pca = PCA(whiten=True)\n    pipe = make_pipeline(None, pca)\n    pipe.fit(X, y)\n    X_trans = pipe.transform(X)\n    X_inversed = pipe.inverse_transform(X_trans)\n    assert_array_almost_equal(X, X_inversed)\n\n\ndef test_pipeline_methods_anova_rus():\n    # Test the various methods of the pipeline (anova).\n    X, y = make_classification(\n        n_classes=2,\n        class_sep=2,\n        weights=[0.1, 0.9],\n        n_informative=3,\n        n_redundant=1,\n        flip_y=0,\n        n_features=20,\n        n_clusters_per_class=1,\n        n_samples=5000,\n        random_state=0,\n    )\n    # Test with RandomUnderSampling + Anova + LogisticRegression\n    clf = LogisticRegression(solver=\"lbfgs\")\n    rus = RandomUnderSampler(random_state=0)\n    filter1 = SelectKBest(f_classif, k=2)\n    pipe = Pipeline([(\"rus\", rus), (\"anova\", filter1), (\"logistic\", clf)])\n    pipe.fit(X, y)\n    pipe.predict(X)\n    pipe.predict_proba(X)\n    pipe.predict_log_proba(X)\n    pipe.score(X, y)\n\n\ndef test_pipeline_with_step_that_implements_both_sample_and_transform():\n    # Test the various methods of the pipeline (anova).\n    X, y = make_classification(\n        n_classes=2,\n        class_sep=2,\n        weights=[0.1, 0.9],\n        n_informative=3,\n        n_redundant=1,\n        flip_y=0,\n        n_features=20,\n        n_clusters_per_class=1,\n        n_samples=5000,\n        random_state=0,\n    )\n\n    clf = LogisticRegression(solver=\"lbfgs\")\n    with raises(TypeError):\n        pipeline = Pipeline([(\"step\", FitTransformSample()), (\"logistic\", clf)])\n        pipeline.fit(X, y)\n\n\ndef test_pipeline_with_step_that_it_is_pipeline():\n    # Test the various methods of the pipeline (anova).\n    X, y = make_classification(\n        n_classes=2,\n        class_sep=2,\n        weights=[0.1, 0.9],\n        n_informative=3,\n        n_redundant=1,\n        flip_y=0,\n        n_features=20,\n        n_clusters_per_class=1,\n        n_samples=5000,\n        random_state=0,\n    )\n    # Test with RandomUnderSampling + Anova + LogisticRegression\n    clf = LogisticRegression(solver=\"lbfgs\")\n    rus = RandomUnderSampler(random_state=0)\n    filter1 = SelectKBest(f_classif, k=2)\n    pipe1 = Pipeline([(\"rus\", rus), (\"anova\", filter1)])\n    with raises(TypeError):\n        pipe2 = Pipeline([(\"pipe1\", pipe1), (\"logistic\", clf)])\n        pipe2.fit(X, y)\n\n\ndef test_pipeline_fit_then_sample_with_sampler_last_estimator():\n    X, y = make_classification(\n        n_classes=2,\n        class_sep=2,\n        weights=[0.1, 0.9],\n        n_informative=3,\n        n_redundant=1,\n        flip_y=0,\n        n_features=20,\n        n_clusters_per_class=1,\n        n_samples=50000,\n        random_state=0,\n    )\n\n    rus = RandomUnderSampler(random_state=42)\n    enn = ENN()\n    pipeline = make_pipeline(rus, enn)\n    X_fit_resample_resampled, y_fit_resample_resampled = pipeline.fit_resample(X, y)\n    pipeline = make_pipeline(rus, enn)\n    pipeline.fit(X, y)\n    X_fit_then_sample_res, y_fit_then_sample_res = pipeline.fit_resample(X, y)\n    assert_array_equal(X_fit_resample_resampled, X_fit_then_sample_res)\n    assert_array_equal(y_fit_resample_resampled, y_fit_then_sample_res)\n\n\ndef test_pipeline_fit_then_sample_3_samplers_with_sampler_last_estimator():\n    X, y = make_classification(\n        n_classes=2,\n        class_sep=2,\n        weights=[0.1, 0.9],\n        n_informative=3,\n        n_redundant=1,\n        flip_y=0,\n        n_features=20,\n        n_clusters_per_class=1,\n        n_samples=50000,\n        random_state=0,\n    )\n\n    rus = RandomUnderSampler(random_state=42)\n    enn = ENN()\n    pipeline = make_pipeline(rus, enn, rus)\n    X_fit_resample_resampled, y_fit_resample_resampled = pipeline.fit_resample(X, y)\n    pipeline = make_pipeline(rus, enn, rus)\n    pipeline.fit(X, y)\n    X_fit_then_sample_res, y_fit_then_sample_res = pipeline.fit_resample(X, y)\n    assert_array_equal(X_fit_resample_resampled, X_fit_then_sample_res)\n    assert_array_equal(y_fit_resample_resampled, y_fit_then_sample_res)\n\n\ndef test_make_pipeline_memory():\n    cachedir = mkdtemp()\n    try:\n        memory = Memory(cachedir, verbose=10)\n        pipeline = make_pipeline(DummyTransf(), SVC(gamma=\"scale\"), memory=memory)\n        assert pipeline.memory is memory\n        pipeline = make_pipeline(DummyTransf(), SVC(gamma=\"scale\"))\n        assert pipeline.memory is None\n    finally:\n        shutil.rmtree(cachedir)\n\n\ndef test_predict_with_predict_params():\n    # tests that Pipeline passes predict_params to the final estimator\n    # when predict is invoked\n    pipe = Pipeline([(\"transf\", Transf()), (\"clf\", DummyEstimatorParams())])\n    pipe.fit(None, None)\n    pipe.predict(X=None, got_attribute=True)\n    assert pipe.named_steps[\"clf\"].got_attribute\n\n\ndef test_resampler_last_stage_passthrough():\n    X, y = make_classification(\n        n_classes=2,\n        class_sep=2,\n        weights=[0.1, 0.9],\n        n_informative=3,\n        n_redundant=1,\n        flip_y=0,\n        n_features=20,\n        n_clusters_per_class=1,\n        n_samples=50000,\n        random_state=0,\n    )\n\n    rus = RandomUnderSampler(random_state=42)\n    pipe = make_pipeline(rus, None)\n    pipe.fit_resample(X, y)\n\n\ndef test_pipeline_score_samples_pca_lof_binary():\n    X, y = make_classification(\n        n_classes=2,\n        class_sep=2,\n        weights=[0.3, 0.7],\n        n_informative=3,\n        n_redundant=1,\n        flip_y=0,\n        n_features=20,\n        n_clusters_per_class=1,\n        n_samples=500,\n        random_state=0,\n    )\n    # Test that the score_samples method is implemented on a pipeline.\n    # Test that the score_samples method on pipeline yields same results as\n    # applying transform and score_samples steps separately.\n    rus = RandomUnderSampler(random_state=42)\n    pca = PCA(svd_solver=\"full\", n_components=\"mle\", whiten=True)\n    lof = LocalOutlierFactor(novelty=True)\n    pipe = Pipeline([(\"rus\", rus), (\"pca\", pca), (\"lof\", lof)])\n    pipe.fit(X, y)\n    # Check the shapes\n    assert pipe.score_samples(X).shape == (X.shape[0],)\n    # Check the values\n    X_res, _ = rus.fit_resample(X, y)\n    lof.fit(pca.fit_transform(X_res))\n    assert_allclose(pipe.score_samples(X), lof.score_samples(pca.transform(X)))\n\n\ndef test_score_samples_on_pipeline_without_score_samples():\n    X = np.array([[1], [2]])\n    y = np.array([1, 2])\n    # Test that a pipeline does not have score_samples method when the final\n    # step of the pipeline does not have score_samples defined.\n    pipe = make_pipeline(LogisticRegression())\n    pipe.fit(X, y)\n    with pytest.raises(\n        AttributeError,\n        match=\"has no attribute 'score_samples'\",\n    ):\n        pipe.score_samples(X)\n\n\ndef test_pipeline_param_error():\n    clf = make_pipeline(LogisticRegression())\n    with pytest.raises(\n        ValueError,\n        match=\"Pipeline.fit does not accept the sample_weight parameter\",\n    ):\n        clf.fit([[0], [0]], [0, 1], sample_weight=[1, 1])\n\n\nparameter_grid_test_verbose = (\n    (est, pattern, method)\n    for (est, pattern), method in itertools.product(\n        [\n            (\n                Pipeline([(\"transf\", Transf()), (\"clf\", FitParamT())]),\n                r\"\\[Pipeline\\].*\\(step 1 of 2\\) Processing transf.* total=.*\\n\"\n                r\"\\[Pipeline\\].*\\(step 2 of 2\\) Processing clf.* total=.*\\n$\",\n            ),\n            (\n                Pipeline([(\"transf\", Transf()), (\"noop\", None), (\"clf\", FitParamT())]),\n                r\"\\[Pipeline\\].*\\(step 1 of 3\\) Processing transf.* total=.*\\n\"\n                r\"\\[Pipeline\\].*\\(step 2 of 3\\) Processing noop.* total=.*\\n\"\n                r\"\\[Pipeline\\].*\\(step 3 of 3\\) Processing clf.* total=.*\\n$\",\n            ),\n            (\n                Pipeline(\n                    [\n                        (\"transf\", Transf()),\n                        (\"noop\", \"passthrough\"),\n                        (\"clf\", FitParamT()),\n                    ]\n                ),\n                r\"\\[Pipeline\\].*\\(step 1 of 3\\) Processing transf.* total=.*\\n\"\n                r\"\\[Pipeline\\].*\\(step 2 of 3\\) Processing noop.* total=.*\\n\"\n                r\"\\[Pipeline\\].*\\(step 3 of 3\\) Processing clf.* total=.*\\n$\",\n            ),\n            (\n                Pipeline([(\"transf\", Transf()), (\"clf\", None)]),\n                r\"\\[Pipeline\\].*\\(step 1 of 2\\) Processing transf.* total=.*\\n\"\n                r\"\\[Pipeline\\].*\\(step 2 of 2\\) Processing clf.* total=.*\\n$\",\n            ),\n            (\n                Pipeline([(\"transf\", None), (\"mult\", Mult())]),\n                r\"\\[Pipeline\\].*\\(step 1 of 2\\) Processing transf.* total=.*\\n\"\n                r\"\\[Pipeline\\].*\\(step 2 of 2\\) Processing mult.* total=.*\\n$\",\n            ),\n            (\n                Pipeline([(\"transf\", \"passthrough\"), (\"mult\", Mult())]),\n                r\"\\[Pipeline\\].*\\(step 1 of 2\\) Processing transf.* total=.*\\n\"\n                r\"\\[Pipeline\\].*\\(step 2 of 2\\) Processing mult.* total=.*\\n$\",\n            ),\n            (\n                FeatureUnion([(\"mult1\", Mult()), (\"mult2\", Mult())]),\n                r\"\\[FeatureUnion\\].*\\(step 1 of 2\\) Processing mult1.* total=.*\\n\"\n                r\"\\[FeatureUnion\\].*\\(step 2 of 2\\) Processing mult2.* total=.*\\n$\",\n            ),\n            (\n                FeatureUnion([(\"mult1\", \"drop\"), (\"mult2\", Mult()), (\"mult3\", \"drop\")]),\n                r\"\\[FeatureUnion\\].*\\(step 1 of 1\\) Processing mult2.* total=.*\\n$\",\n            ),\n        ],\n        [\"fit\", \"fit_transform\", \"fit_predict\"],\n    )\n    if hasattr(est, method)\n    and not (\n        method == \"fit_transform\"\n        and hasattr(est, \"steps\")\n        and isinstance(est.steps[-1][1], FitParamT)\n    )\n)\n\n\n@pytest.mark.parametrize(\"est, pattern, method\", parameter_grid_test_verbose)\ndef test_verbose(est, method, pattern, capsys):\n    func = getattr(est, method)\n\n    X = [[1, 2, 3], [4, 5, 6]]\n    y = [[7], [8]]\n\n    est.set_params(verbose=False)\n    func(X, y)\n    assert not capsys.readouterr().out, \"Got output for verbose=False\"\n\n    est.set_params(verbose=True)\n    func(X, y)\n    assert re.match(pattern, capsys.readouterr().out)\n\n\ndef test_pipeline_score_samples_pca_lof_multiclass():\n    X, y = load_iris(return_X_y=True)\n    sampling_strategy = {0: 50, 1: 30, 2: 20}\n    X, y = make_imbalance(X, y, sampling_strategy=sampling_strategy)\n    # Test that the score_samples method is implemented on a pipeline.\n    # Test that the score_samples method on pipeline yields same results as\n    # applying transform and score_samples steps separately.\n    rus = RandomUnderSampler()\n    pca = PCA(svd_solver=\"full\", n_components=\"mle\", whiten=True)\n    lof = LocalOutlierFactor(novelty=True)\n    pipe = Pipeline([(\"rus\", rus), (\"pca\", pca), (\"lof\", lof)])\n    pipe.fit(X, y)\n    # Check the shapes\n    assert pipe.score_samples(X).shape == (X.shape[0],)\n    # Check the values\n    lof.fit(pca.fit_transform(X))\n    assert_allclose(pipe.score_samples(X), lof.score_samples(pca.transform(X)))\n\n\ndef test_pipeline_param_validation():\n    model = Pipeline(\n        [(\"sampler\", RandomUnderSampler()), (\"classifier\", LogisticRegression())]\n    )\n    check_param_validation(\"Pipeline\", model)\n\n\ndef test_pipeline_with_set_output():\n    pd = pytest.importorskip(\"pandas\")\n    X, y = load_iris(return_X_y=True, as_frame=True)\n    pipeline = make_pipeline(\n        StandardScaler(), RandomUnderSampler(), LogisticRegression()\n    ).set_output(transform=\"default\")\n    pipeline.fit(X, y)\n\n    X_res, y_res = pipeline[:-1].fit_resample(X, y)\n    assert isinstance(X_res, np.ndarray)\n    # transformer will not change `y` and sampler will always preserve the type of `y`\n    assert isinstance(y_res, type(y))\n\n    pipeline.set_output(transform=\"pandas\")\n    X_res, y_res = pipeline[:-1].fit_resample(X, y)\n\n    assert isinstance(X_res, pd.DataFrame)\n    # transformer will not change `y` and sampler will always preserve the type of `y`\n    assert isinstance(y_res, type(y))\n\n\n# TODO(0.15): change warning to checking for NotFittedError\n@pytest.mark.parametrize(\n    \"method\",\n    [\n        \"predict\",\n        \"predict_proba\",\n        \"predict_log_proba\",\n        \"decision_function\",\n        \"score\",\n        \"score_samples\",\n        \"transform\",\n        \"inverse_transform\",\n    ],\n)\ndef test_pipeline_warns_not_fitted(method):\n    class StatelessEstimator(BaseEstimator):\n        \"\"\"Stateless estimator that doesn't check if it's fitted.\n        Stateless estimators that don't require fit, should properly set the\n        `requires_fit` flag and implement a `__sklearn_check_is_fitted__` returning\n        `True`.\n        \"\"\"\n\n        def fit(self, X, y):\n            return self  # pragma: no cover\n\n        def transform(self, X):\n            return X\n\n        def predict(self, X):\n            return np.ones(len(X))\n\n        def predict_proba(self, X):\n            return np.ones(len(X))\n\n        def predict_log_proba(self, X):\n            return np.zeros(len(X))\n\n        def decision_function(self, X):\n            return np.ones(len(X))\n\n        def score(self, X, y):\n            return 1\n\n        def score_samples(self, X):\n            return np.ones(len(X))\n\n        def inverse_transform(self, X):\n            return X\n\n    pipe = Pipeline([(\"estimator\", StatelessEstimator())])\n    with pytest.warns(FutureWarning, match=\"This Pipeline instance is not fitted yet.\"):\n        getattr(pipe, method)([[1]])\n\n\n# transform_input tests\n# =====================\n\n\n@pytest.mark.skipif(\n    sklearn_version < parse_version(\"1.4\"),\n    reason=\"scikit-learn < 1.4 does not support transform_input\",\n)\n@config_context(enable_metadata_routing=True)\ndef test_transform_input_explicit_value_check():\n    \"\"\"Test that the right transformed values are passed to `fit`.\"\"\"\n\n    class Transformer(TransformerMixin, BaseEstimator):\n        def fit(self, X, y):\n            self.fitted_ = True\n            return self\n\n        def transform(self, X):\n            return X + 1\n\n    class Estimator(ClassifierMixin, BaseEstimator):\n        def fit(self, X, y, X_val=None, y_val=None):\n            assert_array_equal(X, np.array([[1, 2]]))\n            assert_array_equal(y, np.array([0, 1]))\n            assert_array_equal(X_val, np.array([[2, 3]]))\n            assert_array_equal(y_val, np.array([0, 1]))\n            return self\n\n    X = np.array([[0, 1]])\n    y = np.array([0, 1])\n    X_val = np.array([[1, 2]])\n    y_val = np.array([0, 1])\n    pipe = Pipeline(\n        [\n            (\"transformer\", Transformer()),\n            (\"estimator\", Estimator().set_fit_request(X_val=True, y_val=True)),\n        ],\n        transform_input=[\"X_val\"],\n    )\n    pipe.fit(X, y, X_val=X_val, y_val=y_val)\n\n\ndef test_transform_input_no_slep6():\n    \"\"\"Make sure the right error is raised if slep6 is not enabled.\"\"\"\n    X = np.array([[1, 2], [3, 4]])\n    y = np.array([0, 1])\n    msg = \"The `transform_input` parameter can only be set if metadata\"\n    with pytest.raises(ValueError, match=msg):\n        make_pipeline(DummyTransf(), transform_input=[\"blah\"]).fit(X, y)\n\n\n@pytest.mark.skipif(\n    sklearn_version >= parse_version(\"1.4\"),\n    reason=\"scikit-learn >= 1.4 supports transform_input\",\n)\n@config_context(enable_metadata_routing=True)\ndef test_transform_input_sklearn_version():\n    \"\"\"Test that transform_input raises error with sklearn < 1.4.\"\"\"\n    X = np.array([[1, 2], [3, 4]])\n    y = np.array([0, 1])\n    msg = (\n        \"The `transform_input` parameter is not supported in scikit-learn versions \"\n        \"prior to 1.4\"\n    )\n    with pytest.raises(ValueError, match=msg):\n        make_pipeline(DummyTransf(), transform_input=[\"blah\"]).fit(X, y)\n\n\n# end of transform_input tests\n# =============================\n\n\ndef test_metadata_routing_with_sampler():\n    \"\"\"Check that we can use a sampler with metadata routing.\"\"\"\n    X, y = make_classification()\n    cost_matrix = np.random.rand(X.shape[0], 2, 2)\n\n    class CostSensitiveSampler(BaseSampler):\n        def fit_resample(self, X, y, cost_matrix=None):\n            return self._fit_resample(X, y, cost_matrix=cost_matrix)\n\n        def _fit_resample(self, X, y, cost_matrix=None):\n            self.cost_matrix_ = cost_matrix\n            return X, y\n\n    with config_context(enable_metadata_routing=True):\n        sampler = CostSensitiveSampler().set_fit_resample_request(cost_matrix=True)\n        pipeline = Pipeline([(\"sampler\", sampler), (\"model\", LogisticRegression())])\n        pipeline.fit(X, y, cost_matrix=cost_matrix)\n\n        assert_allclose(pipeline[0].cost_matrix_, cost_matrix)\n"
  },
  {
    "path": "imblearn/tests/test_public_functions.py",
    "content": "\"\"\"This is a copy of sklearn/tests/test_public_functions.py. It can be\nremoved when we support scikit-learn >= 1.2.\n\"\"\"\nfrom importlib import import_module\nfrom inspect import signature\n\nimport pytest\nfrom sklearn.utils._param_validation import (\n    generate_invalid_param_val,\n    generate_valid_param,\n    make_constraint,\n)\n\nPARAM_VALIDATION_FUNCTION_LIST = [\n    \"imblearn.datasets.fetch_datasets\",\n    \"imblearn.datasets.make_imbalance\",\n    \"imblearn.metrics.classification_report_imbalanced\",\n    \"imblearn.metrics.geometric_mean_score\",\n    \"imblearn.metrics.macro_averaged_mean_absolute_error\",\n    \"imblearn.metrics.make_index_balanced_accuracy\",\n    \"imblearn.metrics.sensitivity_specificity_support\",\n    \"imblearn.metrics.sensitivity_score\",\n    \"imblearn.metrics.specificity_score\",\n    \"imblearn.pipeline.make_pipeline\",\n]\n\n\n@pytest.mark.parametrize(\"func_module\", PARAM_VALIDATION_FUNCTION_LIST)\ndef test_function_param_validation(func_module):\n    \"\"\"Check that an informative error is raised when the value of a parameter does not\n    have an appropriate type or value.\n    \"\"\"\n    module_name, func_name = func_module.rsplit(\".\", 1)\n    module = import_module(module_name)\n    func = getattr(module, func_name)\n\n    func_sig = signature(func)\n    func_params = [\n        p.name\n        for p in func_sig.parameters.values()\n        if p.kind not in (p.VAR_POSITIONAL, p.VAR_KEYWORD)\n    ]\n    parameter_constraints = getattr(func, \"_skl_parameter_constraints\")\n\n    # Generate valid values for the required parameters\n    # The parameters `*args` and `**kwargs` are ignored since we cannot generate\n    # constraints.\n    required_params = [\n        p.name\n        for p in func_sig.parameters.values()\n        if p.default is p.empty and p.kind not in (p.VAR_POSITIONAL, p.VAR_KEYWORD)\n    ]\n    valid_required_params = {}\n    for param_name in required_params:\n        if parameter_constraints[param_name] == \"no_validation\":\n            valid_required_params[param_name] = 1\n        else:\n            valid_required_params[param_name] = generate_valid_param(\n                make_constraint(parameter_constraints[param_name][0])\n            )\n\n    # check that there is a constraint for each parameter\n    if func_params:\n        validation_params = parameter_constraints.keys()\n        unexpected_params = set(validation_params) - set(func_params)\n        missing_params = set(func_params) - set(validation_params)\n        err_msg = (\n            \"Mismatch between _parameter_constraints and the parameters of\"\n            f\" {func_name}.\\nConsider the unexpected parameters {unexpected_params} and\"\n            f\" expected but missing parameters {missing_params}\\n\"\n        )\n        assert set(validation_params) == set(func_params), err_msg\n\n    # this object does not have a valid type for sure for all params\n    param_with_bad_type = type(\"BadType\", (), {})()\n\n    for param_name in func_params:\n        constraints = parameter_constraints[param_name]\n\n        if constraints == \"no_validation\":\n            # This parameter is not validated\n            continue\n\n        match = (\n            rf\"The '{param_name}' parameter of {func_name} must be .* Got .* instead.\"\n        )\n\n        # First, check that the error is raised if param doesn't match any valid type.\n        with pytest.raises(ValueError, match=match):\n            func(**{**valid_required_params, param_name: param_with_bad_type})\n\n        # Then, for constraints that are more than a type constraint, check that the\n        # error is raised if param does match a valid type but does not match any valid\n        # value for this type.\n        constraints = [make_constraint(constraint) for constraint in constraints]\n\n        for constraint in constraints:\n            try:\n                bad_value = generate_invalid_param_val(constraint)\n            except NotImplementedError:\n                continue\n\n            with pytest.raises(ValueError, match=match):\n                func(**{**valid_required_params, param_name: bad_value})\n"
  },
  {
    "path": "imblearn/under_sampling/__init__.py",
    "content": "\"\"\"\nThe :mod:`imblearn.under_sampling` provides methods to under-sample\na dataset.\n\"\"\"\n\nfrom imblearn.under_sampling._prototype_generation import ClusterCentroids\nfrom imblearn.under_sampling._prototype_selection import (\n    AllKNN,\n    CondensedNearestNeighbour,\n    EditedNearestNeighbours,\n    InstanceHardnessThreshold,\n    NearMiss,\n    NeighbourhoodCleaningRule,\n    OneSidedSelection,\n    RandomUnderSampler,\n    RepeatedEditedNearestNeighbours,\n    TomekLinks,\n)\n\n__all__ = [\n    \"ClusterCentroids\",\n    \"RandomUnderSampler\",\n    \"InstanceHardnessThreshold\",\n    \"NearMiss\",\n    \"TomekLinks\",\n    \"EditedNearestNeighbours\",\n    \"RepeatedEditedNearestNeighbours\",\n    \"AllKNN\",\n    \"OneSidedSelection\",\n    \"CondensedNearestNeighbour\",\n    \"NeighbourhoodCleaningRule\",\n]\n"
  },
  {
    "path": "imblearn/under_sampling/_prototype_generation/__init__.py",
    "content": "\"\"\"\nThe :mod:`imblearn.under_sampling.prototype_generation` submodule contains\nmethods that generate new samples in order to balance the dataset.\n\"\"\"\n\nfrom imblearn.under_sampling._prototype_generation._cluster_centroids import (\n    ClusterCentroids,\n)\n\n__all__ = [\"ClusterCentroids\"]\n"
  },
  {
    "path": "imblearn/under_sampling/_prototype_generation/_cluster_centroids.py",
    "content": "\"\"\"Class to perform under-sampling by generating centroids based on\nclustering.\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Fernando Nogueira\n#          Christos Aridas\n# License: MIT\n\nimport numpy as np\nfrom scipy import sparse\nfrom sklearn.base import clone\nfrom sklearn.cluster import KMeans\nfrom sklearn.neighbors import NearestNeighbors\nfrom sklearn.utils import _safe_indexing\nfrom sklearn.utils._param_validation import HasMethods, StrOptions\n\nfrom imblearn.under_sampling.base import BaseUnderSampler\nfrom imblearn.utils import Substitution\nfrom imblearn.utils._docstring import _random_state_docstring\n\nVOTING_KIND = (\"auto\", \"hard\", \"soft\")\n\n\n@Substitution(\n    sampling_strategy=BaseUnderSampler._sampling_strategy_docstring,\n    random_state=_random_state_docstring,\n)\nclass ClusterCentroids(BaseUnderSampler):\n    \"\"\"Undersample by generating centroids based on clustering methods.\n\n    Method that under samples the majority class by replacing a\n    cluster of majority samples by the cluster centroid of a KMeans\n    algorithm.  This algorithm keeps N majority samples by fitting the\n    KMeans algorithm with N cluster to the majority class and using\n    the coordinates of the N cluster centroids as the new majority\n    samples.\n\n    Read more in the :ref:`User Guide <cluster_centroids>`.\n\n    Parameters\n    ----------\n    {sampling_strategy}\n\n    {random_state}\n\n    estimator : estimator object, default=None\n        A scikit-learn compatible clustering method that exposes a `n_clusters`\n        parameter and a `cluster_centers_` fitted attribute. By default, it will\n        be a default :class:`~sklearn.cluster.KMeans` estimator.\n\n    voting : {{\"hard\", \"soft\", \"auto\"}}, default='auto'\n        Voting strategy to generate the new samples:\n\n        - If ``'hard'``, the nearest-neighbors of the centroids found using the\n          clustering algorithm will be used.\n        - If ``'soft'``, the centroids found by the clustering algorithm will\n          be used.\n        - If ``'auto'``, if the input is sparse, it will default on ``'hard'``\n          otherwise, ``'soft'`` will be used.\n\n        .. versionadded:: 0.3.0\n\n    Attributes\n    ----------\n    sampling_strategy_ : dict\n        Dictionary containing the information to sample the dataset. The keys\n        corresponds to the class labels from which to sample and the values\n        are the number of samples to sample.\n\n    estimator_ : estimator object\n        The validated estimator created from the `estimator` parameter.\n\n    voting_ : str\n        The validated voting strategy.\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.10\n\n    See Also\n    --------\n    EditedNearestNeighbours : Under-sampling by editing samples.\n\n    CondensedNearestNeighbour: Under-sampling by condensing samples.\n\n    Notes\n    -----\n    Supports multi-class resampling by sampling each class independently.\n\n    Examples\n    --------\n\n    >>> from collections import Counter\n    >>> from sklearn.datasets import make_classification\n    >>> from sklearn.cluster import MiniBatchKMeans\n    >>> from imblearn.under_sampling import ClusterCentroids\n    >>> X, y = make_classification(n_classes=2, class_sep=2,\n    ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,\n    ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)\n    >>> print('Original dataset shape %s' % Counter(y))\n    Original dataset shape Counter({{1: 900, 0: 100}})\n    >>> cc = ClusterCentroids(\n    ...     estimator=MiniBatchKMeans(n_init=1, random_state=0), random_state=42\n    ... )\n    >>> X_res, y_res = cc.fit_resample(X, y)\n    >>> print('Resampled dataset shape %s' % Counter(y_res))\n    Resampled dataset shape Counter({{...}})\n    \"\"\"\n\n    _parameter_constraints: dict = {\n        **BaseUnderSampler._parameter_constraints,\n        \"estimator\": [HasMethods([\"fit\", \"predict\"]), None],\n        \"voting\": [StrOptions({\"auto\", \"hard\", \"soft\"})],\n        \"random_state\": [\"random_state\"],\n    }\n\n    def __init__(\n        self,\n        *,\n        sampling_strategy=\"auto\",\n        random_state=None,\n        estimator=None,\n        voting=\"auto\",\n    ):\n        super().__init__(sampling_strategy=sampling_strategy)\n        self.random_state = random_state\n        self.estimator = estimator\n        self.voting = voting\n\n    def _validate_estimator(self):\n        \"\"\"Private function to create the KMeans estimator\"\"\"\n        if self.estimator is None:\n            self.estimator_ = KMeans(random_state=self.random_state)\n        else:\n            self.estimator_ = clone(self.estimator)\n            if \"n_clusters\" not in self.estimator_.get_params():\n                raise ValueError(\n                    \"`estimator` should be a clustering estimator exposing a parameter\"\n                    \" `n_clusters` and a fitted parameter `cluster_centers_`.\"\n                )\n\n    def _generate_sample(self, X, y, centroids, target_class):\n        if self.voting_ == \"hard\":\n            nearest_neighbors = NearestNeighbors(n_neighbors=1)\n            nearest_neighbors.fit(X, y)\n            indices = nearest_neighbors.kneighbors(centroids, return_distance=False)\n            X_new = _safe_indexing(X, np.squeeze(indices))\n        else:\n            if sparse.issparse(X):\n                X_new = sparse.csr_matrix(centroids, dtype=X.dtype)\n            else:\n                X_new = centroids\n        y_new = np.array([target_class] * centroids.shape[0], dtype=y.dtype)\n\n        return X_new, y_new\n\n    def _fit_resample(self, X, y):\n        self._validate_estimator()\n\n        if self.voting == \"auto\":\n            self.voting_ = \"hard\" if sparse.issparse(X) else \"soft\"\n        else:\n            self.voting_ = self.voting\n\n        X_resampled, y_resampled = [], []\n        for target_class in np.unique(y):\n            target_class_indices = np.flatnonzero(y == target_class)\n            if target_class in self.sampling_strategy_.keys():\n                n_samples = self.sampling_strategy_[target_class]\n                self.estimator_.set_params(**{\"n_clusters\": n_samples})\n                self.estimator_.fit(_safe_indexing(X, target_class_indices))\n                if not hasattr(self.estimator_, \"cluster_centers_\"):\n                    raise RuntimeError(\n                        \"`estimator` should be a clustering estimator exposing a \"\n                        \"fitted parameter `cluster_centers_`.\"\n                    )\n                X_new, y_new = self._generate_sample(\n                    _safe_indexing(X, target_class_indices),\n                    _safe_indexing(y, target_class_indices),\n                    self.estimator_.cluster_centers_,\n                    target_class,\n                )\n                X_resampled.append(X_new)\n                y_resampled.append(y_new)\n            else:\n                X_resampled.append(_safe_indexing(X, target_class_indices))\n                y_resampled.append(_safe_indexing(y, target_class_indices))\n\n        if sparse.issparse(X):\n            X_resampled = sparse.vstack(X_resampled)\n        else:\n            X_resampled = np.vstack(X_resampled)\n        y_resampled = np.hstack(y_resampled)\n\n        return X_resampled, np.array(y_resampled, dtype=y.dtype)\n\n    def _more_tags(self):\n        return {\"sample_indices\": False}\n\n    def __sklearn_tags__(self):\n        tags = super().__sklearn_tags__()\n        tags.sampler_tags.sample_indices = False\n        return tags\n"
  },
  {
    "path": "imblearn/under_sampling/_prototype_generation/tests/__init__.py",
    "content": ""
  },
  {
    "path": "imblearn/under_sampling/_prototype_generation/tests/test_cluster_centroids.py",
    "content": "\"\"\"Test the module cluster centroids.\"\"\"\nfrom collections import Counter\n\nimport numpy as np\nimport pytest\nfrom scipy import sparse\nfrom sklearn.cluster import KMeans\nfrom sklearn.datasets import make_classification\nfrom sklearn.linear_model import LogisticRegression\n\nfrom imblearn.under_sampling import ClusterCentroids\nfrom imblearn.utils.testing import _CustomClusterer\n\nRND_SEED = 0\nX = np.array(\n    [\n        [0.04352327, -0.20515826],\n        [0.92923648, 0.76103773],\n        [0.20792588, 1.49407907],\n        [0.47104475, 0.44386323],\n        [0.22950086, 0.33367433],\n        [0.15490546, 0.3130677],\n        [0.09125309, -0.85409574],\n        [0.12372842, 0.6536186],\n        [0.13347175, 0.12167502],\n        [0.094035, -2.55298982],\n    ]\n)\nY = np.array([1, 0, 1, 0, 1, 1, 1, 1, 0, 1])\nR_TOL = 1e-4\n\n\n@pytest.mark.parametrize(\n    \"X, expected_voting\", [(X, \"soft\"), (sparse.csr_matrix(X), \"hard\")]\n)\n@pytest.mark.filterwarnings(\"ignore:The default value of `n_init` will change\")\ndef test_fit_resample_check_voting(X, expected_voting):\n    cc = ClusterCentroids(random_state=RND_SEED)\n    cc.fit_resample(X, Y)\n    assert cc.voting_ == expected_voting\n\n\n@pytest.mark.filterwarnings(\"ignore:The default value of `n_init` will change\")\ndef test_fit_resample_auto():\n    sampling_strategy = \"auto\"\n    cc = ClusterCentroids(sampling_strategy=sampling_strategy, random_state=RND_SEED)\n    X_resampled, y_resampled = cc.fit_resample(X, Y)\n    assert X_resampled.shape == (6, 2)\n    assert y_resampled.shape == (6,)\n\n\n@pytest.mark.filterwarnings(\"ignore:The default value of `n_init` will change\")\ndef test_fit_resample_half():\n    sampling_strategy = {0: 3, 1: 6}\n    cc = ClusterCentroids(sampling_strategy=sampling_strategy, random_state=RND_SEED)\n    X_resampled, y_resampled = cc.fit_resample(X, Y)\n    assert X_resampled.shape == (9, 2)\n    assert y_resampled.shape == (9,)\n\n\n@pytest.mark.filterwarnings(\"ignore:The default value of `n_init` will change\")\ndef test_multiclass_fit_resample():\n    y = Y.copy()\n    y[5] = 2\n    y[6] = 2\n    cc = ClusterCentroids(random_state=RND_SEED)\n    _, y_resampled = cc.fit_resample(X, y)\n    count_y_res = Counter(y_resampled)\n    assert count_y_res[0] == 2\n    assert count_y_res[1] == 2\n    assert count_y_res[2] == 2\n\n\ndef test_fit_resample_object():\n    sampling_strategy = \"auto\"\n    cluster = KMeans(random_state=RND_SEED, n_init=1)\n    cc = ClusterCentroids(\n        sampling_strategy=sampling_strategy,\n        random_state=RND_SEED,\n        estimator=cluster,\n    )\n\n    X_resampled, y_resampled = cc.fit_resample(X, Y)\n    assert X_resampled.shape == (6, 2)\n    assert y_resampled.shape == (6,)\n\n\ndef test_fit_hard_voting():\n    sampling_strategy = \"auto\"\n    voting = \"hard\"\n    cluster = KMeans(random_state=RND_SEED, n_init=1)\n    cc = ClusterCentroids(\n        sampling_strategy=sampling_strategy,\n        random_state=RND_SEED,\n        estimator=cluster,\n        voting=voting,\n    )\n\n    X_resampled, y_resampled = cc.fit_resample(X, Y)\n    assert X_resampled.shape == (6, 2)\n    assert y_resampled.shape == (6,)\n    for x in X_resampled:\n        assert np.any(np.all(x == X, axis=1))\n\n\n@pytest.mark.filterwarnings(\"ignore:The default value of `n_init` will change\")\ndef test_cluster_centroids_hard_target_class():\n    # check that the samples selecting by the hard voting corresponds to the\n    # targeted class\n    # non-regression test for:\n    # https://github.com/scikit-learn-contrib/imbalanced-learn/issues/738\n    X, y = make_classification(\n        n_samples=1000,\n        n_features=2,\n        n_informative=1,\n        n_redundant=0,\n        n_repeated=0,\n        n_clusters_per_class=1,\n        weights=[0.3, 0.7],\n        class_sep=0.01,\n        random_state=0,\n    )\n\n    cc = ClusterCentroids(voting=\"hard\", random_state=0)\n    X_res, y_res = cc.fit_resample(X, y)\n\n    minority_class_indices = np.flatnonzero(y == 0)\n    X_minority_class = X[minority_class_indices]\n\n    resampled_majority_class_indices = np.flatnonzero(y_res == 1)\n    X_res_majority = X_res[resampled_majority_class_indices]\n\n    sample_from_minority_in_majority = [\n        np.all(np.isclose(selected_sample, minority_sample))\n        for selected_sample in X_res_majority\n        for minority_sample in X_minority_class\n    ]\n    assert sum(sample_from_minority_in_majority) == 0\n\n\ndef test_cluster_centroids_custom_clusterer():\n    clusterer = _CustomClusterer()\n    cc = ClusterCentroids(estimator=clusterer, random_state=RND_SEED)\n    cc.fit_resample(X, Y)\n    assert isinstance(cc.estimator_.cluster_centers_, np.ndarray)\n\n    clusterer = _CustomClusterer(expose_cluster_centers=False)\n    cc = ClusterCentroids(estimator=clusterer, random_state=RND_SEED)\n    err_msg = (\n        \"`estimator` should be a clustering estimator exposing a fitted parameter \"\n        \"`cluster_centers_`.\"\n    )\n    with pytest.raises(RuntimeError, match=err_msg):\n        cc.fit_resample(X, Y)\n\n    clusterer = LogisticRegression()\n    cc = ClusterCentroids(estimator=clusterer, random_state=RND_SEED)\n    err_msg = (\n        \"`estimator` should be a clustering estimator exposing a parameter \"\n        \"`n_clusters` and a fitted parameter `cluster_centers_`.\"\n    )\n    with pytest.raises(ValueError, match=err_msg):\n        cc.fit_resample(X, Y)\n"
  },
  {
    "path": "imblearn/under_sampling/_prototype_selection/__init__.py",
    "content": "\"\"\"\nThe :mod:`imblearn.under_sampling.prototype_selection` submodule contains\nmethods that select samples in order to balance the dataset.\n\"\"\"\n\nfrom imblearn.under_sampling._prototype_selection._condensed_nearest_neighbour import (\n    CondensedNearestNeighbour,\n)\nfrom imblearn.under_sampling._prototype_selection._edited_nearest_neighbours import (\n    AllKNN,\n    EditedNearestNeighbours,\n    RepeatedEditedNearestNeighbours,\n)\nfrom imblearn.under_sampling._prototype_selection._instance_hardness_threshold import (\n    InstanceHardnessThreshold,\n)\nfrom imblearn.under_sampling._prototype_selection._nearmiss import NearMiss\nfrom imblearn.under_sampling._prototype_selection._neighbourhood_cleaning_rule import (\n    NeighbourhoodCleaningRule,\n)\nfrom imblearn.under_sampling._prototype_selection._one_sided_selection import (\n    OneSidedSelection,\n)\nfrom imblearn.under_sampling._prototype_selection._random_under_sampler import (\n    RandomUnderSampler,\n)\nfrom imblearn.under_sampling._prototype_selection._tomek_links import TomekLinks\n\n__all__ = [\n    \"RandomUnderSampler\",\n    \"InstanceHardnessThreshold\",\n    \"NearMiss\",\n    \"TomekLinks\",\n    \"EditedNearestNeighbours\",\n    \"RepeatedEditedNearestNeighbours\",\n    \"AllKNN\",\n    \"OneSidedSelection\",\n    \"CondensedNearestNeighbour\",\n    \"NeighbourhoodCleaningRule\",\n]\n"
  },
  {
    "path": "imblearn/under_sampling/_prototype_selection/_condensed_nearest_neighbour.py",
    "content": "\"\"\"Class to perform under-sampling based on the condensed nearest neighbour\nmethod.\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport numbers\nfrom collections import Counter\n\nimport numpy as np\nfrom scipy.sparse import issparse\nfrom sklearn.base import clone\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.utils import _safe_indexing, check_random_state\nfrom sklearn.utils._param_validation import HasMethods, Interval\n\nfrom imblearn.under_sampling.base import BaseCleaningSampler\nfrom imblearn.utils import Substitution\nfrom imblearn.utils._docstring import _n_jobs_docstring, _random_state_docstring\n\n\n@Substitution(\n    sampling_strategy=BaseCleaningSampler._sampling_strategy_docstring,\n    n_jobs=_n_jobs_docstring,\n    random_state=_random_state_docstring,\n)\nclass CondensedNearestNeighbour(BaseCleaningSampler):\n    \"\"\"Undersample based on the condensed nearest neighbour method.\n\n    Read more in the :ref:`User Guide <condensed_nearest_neighbors>`.\n\n    Parameters\n    ----------\n    {sampling_strategy}\n\n    {random_state}\n\n    n_neighbors : int or estimator object, default=None\n        If ``int``, size of the neighbourhood to consider to compute the\n        nearest neighbors. If object, an estimator that inherits from\n        :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to\n        find the nearest-neighbors.  If `None`, a\n        :class:`~sklearn.neighbors.KNeighborsClassifier` with a 1-NN rules will\n        be used.\n\n    n_seeds_S : int, default=1\n        Number of samples to extract in order to build the set S.\n\n    {n_jobs}\n\n    Attributes\n    ----------\n    sampling_strategy_ : dict\n        Dictionary containing the information to sample the dataset. The keys\n        corresponds to the class labels from which to sample and the values\n        are the number of samples to sample.\n\n    estimators_ : list of estimator objects of shape (n_resampled_classes - 1,)\n        Contains the K-nearest neighbor estimator used for per of classes.\n\n        .. versionadded:: 0.12\n\n    sample_indices_ : ndarray of shape (n_new_samples,)\n        Indices of the samples selected.\n\n        .. versionadded:: 0.4\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.10\n\n    See Also\n    --------\n    EditedNearestNeighbours : Undersample by editing samples.\n\n    RepeatedEditedNearestNeighbours : Undersample by repeating ENN algorithm.\n\n    AllKNN : Undersample using ENN and various number of neighbours.\n\n    Notes\n    -----\n    The method is based on [1]_.\n\n    Supports multi-class resampling: a strategy one (minority) vs. each other\n    classes is applied.\n\n    References\n    ----------\n    .. [1] P. Hart, \"The condensed nearest neighbor rule,\"\n       In Information Theory, IEEE Transactions on, vol. 14(3),\n       pp. 515-516, 1968.\n\n    Examples\n    --------\n    >>> from collections import Counter  # doctest: +SKIP\n    >>> from sklearn.datasets import fetch_openml  # doctest: +SKIP\n    >>> from sklearn.preprocessing import scale  # doctest: +SKIP\n    >>> from imblearn.under_sampling import \\\nCondensedNearestNeighbour  # doctest: +SKIP\n    >>> X, y = fetch_openml('diabetes', version=1, return_X_y=True)  # doctest: +SKIP\n    >>> X = scale(X)  # doctest: +SKIP\n    >>> print('Original dataset shape %s' % Counter(y))  # doctest: +SKIP\n    Original dataset shape Counter({{'tested_negative': 500, \\\n        'tested_positive': 268}})  # doctest: +SKIP\n    >>> cnn = CondensedNearestNeighbour(random_state=42)  # doctest: +SKIP\n    >>> X_res, y_res = cnn.fit_resample(X, y)  #doctest: +SKIP\n    >>> print('Resampled dataset shape %s' % Counter(y_res))  # doctest: +SKIP\n    Resampled dataset shape Counter({{'tested_positive': 268, \\\n        'tested_negative': 181}})  # doctest: +SKIP\n    \"\"\"\n\n    _parameter_constraints: dict = {\n        **BaseCleaningSampler._parameter_constraints,\n        \"n_neighbors\": [\n            Interval(numbers.Integral, 1, None, closed=\"left\"),\n            HasMethods([\"kneighbors\", \"kneighbors_graph\"]),\n            None,\n        ],\n        \"n_seeds_S\": [Interval(numbers.Integral, 1, None, closed=\"left\")],\n        \"n_jobs\": [numbers.Integral, None],\n        \"random_state\": [\"random_state\"],\n    }\n\n    def __init__(\n        self,\n        *,\n        sampling_strategy=\"auto\",\n        random_state=None,\n        n_neighbors=None,\n        n_seeds_S=1,\n        n_jobs=None,\n    ):\n        super().__init__(sampling_strategy=sampling_strategy)\n        self.random_state = random_state\n        self.n_neighbors = n_neighbors\n        self.n_seeds_S = n_seeds_S\n        self.n_jobs = n_jobs\n\n    def _validate_estimator(self):\n        \"\"\"Private function to create the NN estimator\"\"\"\n        if self.n_neighbors is None:\n            estimator = KNeighborsClassifier(n_neighbors=1, n_jobs=self.n_jobs)\n        elif isinstance(self.n_neighbors, numbers.Integral):\n            estimator = KNeighborsClassifier(\n                n_neighbors=self.n_neighbors, n_jobs=self.n_jobs\n            )\n        elif isinstance(self.n_neighbors, KNeighborsClassifier):\n            estimator = clone(self.n_neighbors)\n\n        return estimator\n\n    def _fit_resample(self, X, y):\n        estimator = self._validate_estimator()\n\n        random_state = check_random_state(self.random_state)\n        target_stats = Counter(y)\n        class_minority = min(target_stats, key=target_stats.get)\n        idx_under = np.empty((0,), dtype=int)\n\n        self.estimators_ = []\n        for target_class in np.unique(y):\n            if target_class in self.sampling_strategy_.keys():\n                # Randomly get one sample from the majority class\n                # Generate the index to select\n                idx_maj = np.flatnonzero(y == target_class)\n                idx_maj_sample = idx_maj[\n                    random_state.randint(\n                        low=0,\n                        high=target_stats[target_class],\n                        size=self.n_seeds_S,\n                    )\n                ]\n\n                # Create the set C - One majority samples and all minority\n                C_indices = np.append(\n                    np.flatnonzero(y == class_minority), idx_maj_sample\n                )\n                C_x = _safe_indexing(X, C_indices)\n                C_y = _safe_indexing(y, C_indices)\n\n                # Create the set S - all majority samples\n                S_indices = np.flatnonzero(y == target_class)\n                S_x = _safe_indexing(X, S_indices)\n                S_y = _safe_indexing(y, S_indices)\n\n                # fit knn on C\n                self.estimators_.append(clone(estimator).fit(C_x, C_y))\n\n                good_classif_label = idx_maj_sample.copy()\n                # Check each sample in S if we keep it or drop it\n                for idx_sam, (x_sam, y_sam) in enumerate(zip(S_x, S_y)):\n                    # Do not select sample which are already well classified\n                    if idx_sam in good_classif_label:\n                        continue\n\n                    # Classify on S\n                    if not issparse(x_sam):\n                        x_sam = x_sam.reshape(1, -1)\n                    pred_y = self.estimators_[-1].predict(x_sam)\n\n                    # If the prediction do not agree with the true label\n                    # append it in C_x\n                    if y_sam != pred_y:\n                        # Keep the index for later\n                        idx_maj_sample = np.append(idx_maj_sample, idx_maj[idx_sam])\n\n                        # Update C\n                        C_indices = np.append(C_indices, idx_maj[idx_sam])\n                        C_x = _safe_indexing(X, C_indices)\n                        C_y = _safe_indexing(y, C_indices)\n\n                        # fit a knn on C\n                        self.estimators_[-1].fit(C_x, C_y)\n\n                        # This experimental to speed up the search\n                        # Classify all the element in S and avoid to test the\n                        # well classified elements\n                        pred_S_y = self.estimators_[-1].predict(S_x)\n                        good_classif_label = np.unique(\n                            np.append(idx_maj_sample, np.flatnonzero(pred_S_y == S_y))\n                        )\n\n                idx_under = np.concatenate((idx_under, idx_maj_sample), axis=0)\n            else:\n                idx_under = np.concatenate(\n                    (idx_under, np.flatnonzero(y == target_class)), axis=0\n                )\n\n        self.sample_indices_ = idx_under\n\n        return _safe_indexing(X, idx_under), _safe_indexing(y, idx_under)\n\n    def _more_tags(self):\n        return {\"sample_indices\": True}\n\n    def __sklearn_tags__(self):\n        tags = super().__sklearn_tags__()\n        tags.sampler_tags.sample_indices = True\n        return tags\n"
  },
  {
    "path": "imblearn/under_sampling/_prototype_selection/_edited_nearest_neighbours.py",
    "content": "\"\"\"Classes to perform under-sampling based on the edited nearest neighbour\nmethod.\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Dayvid Oliveira\n#          Christos Aridas\n# License: MIT\n\nimport numbers\nfrom collections import Counter\n\nimport numpy as np\nfrom scipy.stats import mode\nfrom sklearn.utils import _safe_indexing\nfrom sklearn.utils._param_validation import HasMethods, Interval, StrOptions\n\nfrom imblearn.under_sampling.base import BaseCleaningSampler\nfrom imblearn.utils import Substitution, check_neighbors_object\nfrom imblearn.utils._docstring import _n_jobs_docstring\n\nSEL_KIND = (\"all\", \"mode\")\n\n\n@Substitution(\n    sampling_strategy=BaseCleaningSampler._sampling_strategy_docstring,\n    n_jobs=_n_jobs_docstring,\n)\nclass EditedNearestNeighbours(BaseCleaningSampler):\n    \"\"\"Undersample based on the edited nearest neighbour method.\n\n    This method cleans the dataset by removing samples close to the\n    decision boundary. It removes observations from the majority class or\n    classes when any or most of its closest neighours are from a different class.\n\n    Read more in the :ref:`User Guide <edited_nearest_neighbors>`.\n\n    Parameters\n    ----------\n    {sampling_strategy}\n\n    n_neighbors : int or object, default=3\n        If ``int``, size of the neighbourhood to consider for the undersampling, i.e.,\n        if `n_neighbors=3`, a sample will be removed when any or most of its 3 closest\n        neighbours are from a different class. If object, an estimator that inherits\n        from :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to\n        find the nearest-neighbors. Note that if you want to examine the 3 closest\n        neighbours of a sample for the undersampling, you need to pass a 4-KNN.\n\n    kind_sel : {{'all', 'mode'}}, default='all'\n        Strategy to use to exclude samples.\n\n        - If ``'all'``, all neighbours should be of the same class of the examined\n          sample for it not be excluded.\n        - If ``'mode'``, most neighbours should be of the same class of the examined\n          sample for it not be excluded.\n\n        The strategy `\"all\"` will be less conservative than `'mode'`. Thus,\n        more samples will be removed when `kind_sel=\"all\"`, generally.\n\n    {n_jobs}\n\n    Attributes\n    ----------\n    sampling_strategy_ : dict\n        Dictionary containing the information to sample the dataset. The keys\n        correspond to the class labels from which to sample and the values\n        are the number of samples to sample.\n\n    nn_ : estimator object\n        Validated K-nearest Neighbours instance created from `n_neighbors` parameter.\n\n    sample_indices_ : ndarray of shape (n_new_samples,)\n        Indices of the samples selected.\n\n        .. versionadded:: 0.4\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.10\n\n    See Also\n    --------\n    CondensedNearestNeighbour : Undersample by condensing samples.\n\n    RepeatedEditedNearestNeighbours : Undersample by repeating the ENN algorithm.\n\n    AllKNN : Undersample using ENN with varying neighbours.\n\n    Notes\n    -----\n    The method is based on [1]_.\n\n    Supports multi-class resampling. A one-vs.-rest scheme is used when\n    sampling a class as proposed in [1]_.\n\n    References\n    ----------\n    .. [1] D. Wilson, Asymptotic\" Properties of Nearest Neighbor Rules Using\n       Edited Data,\" In IEEE Transactions on Systems, Man, and Cybernetrics,\n       vol. 2 (3), pp. 408-421, 1972.\n\n    Examples\n    --------\n\n    >>> from collections import Counter\n    >>> from sklearn.datasets import make_classification\n    >>> from imblearn.under_sampling import EditedNearestNeighbours\n    >>> X, y = make_classification(n_classes=2, class_sep=2,\n    ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,\n    ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)\n    >>> print('Original dataset shape %s' % Counter(y))\n    Original dataset shape Counter({{1: 900, 0: 100}})\n    >>> enn = EditedNearestNeighbours()\n    >>> X_res, y_res = enn.fit_resample(X, y)\n    >>> print('Resampled dataset shape %s' % Counter(y_res))\n    Resampled dataset shape Counter({{1: 887, 0: 100}})\n    \"\"\"\n\n    _parameter_constraints: dict = {\n        **BaseCleaningSampler._parameter_constraints,\n        \"n_neighbors\": [\n            Interval(numbers.Integral, 1, None, closed=\"left\"),\n            HasMethods([\"kneighbors\", \"kneighbors_graph\"]),\n        ],\n        \"kind_sel\": [StrOptions({\"all\", \"mode\"})],\n        \"n_jobs\": [numbers.Integral, None],\n    }\n\n    def __init__(\n        self,\n        *,\n        sampling_strategy=\"auto\",\n        n_neighbors=3,\n        kind_sel=\"all\",\n        n_jobs=None,\n    ):\n        super().__init__(sampling_strategy=sampling_strategy)\n        self.n_neighbors = n_neighbors\n        self.kind_sel = kind_sel\n        self.n_jobs = n_jobs\n\n    def _validate_estimator(self):\n        \"\"\"Validate the estimator created in the ENN.\"\"\"\n        self.nn_ = check_neighbors_object(\n            \"n_neighbors\", self.n_neighbors, additional_neighbor=1\n        )\n        self.nn_.set_params(**{\"n_jobs\": self.n_jobs})\n\n    def _fit_resample(self, X, y):\n        self._validate_estimator()\n\n        idx_under = np.empty((0,), dtype=int)\n\n        self.nn_.fit(X)\n\n        for target_class in np.unique(y):\n            if target_class in self.sampling_strategy_.keys():\n                target_class_indices = np.flatnonzero(y == target_class)\n                X_class = _safe_indexing(X, target_class_indices)\n                y_class = _safe_indexing(y, target_class_indices)\n                nnhood_idx = self.nn_.kneighbors(X_class, return_distance=False)[:, 1:]\n                nnhood_label = y[nnhood_idx]\n                if self.kind_sel == \"mode\":\n                    nnhood_label, _ = mode(nnhood_label, axis=1, keepdims=False)\n                    nnhood_bool = np.ravel(nnhood_label) == y_class\n                elif self.kind_sel == \"all\":\n                    nnhood_label = nnhood_label == target_class\n                    nnhood_bool = np.all(nnhood_label, axis=1)\n                index_target_class = np.flatnonzero(nnhood_bool)\n            else:\n                index_target_class = slice(None)\n\n            idx_under = np.concatenate(\n                (\n                    idx_under,\n                    np.flatnonzero(y == target_class)[index_target_class],\n                ),\n                axis=0,\n            )\n\n        self.sample_indices_ = idx_under\n\n        return _safe_indexing(X, idx_under), _safe_indexing(y, idx_under)\n\n    def _more_tags(self):\n        return {\"sample_indices\": True}\n\n    def __sklearn_tags__(self):\n        tags = super().__sklearn_tags__()\n        tags.sampler_tags.sample_indices = True\n        return tags\n\n\n@Substitution(\n    sampling_strategy=BaseCleaningSampler._sampling_strategy_docstring,\n    n_jobs=_n_jobs_docstring,\n)\nclass RepeatedEditedNearestNeighbours(BaseCleaningSampler):\n    \"\"\"Undersample based on the repeated edited nearest neighbour method.\n\n    This method repeats the :class:`EditedNearestNeighbours` algorithm several times.\n    The repetitions will stop when i) the maximum number of iterations is reached,\n    or ii) no more observations are being removed, or iii) one of the majority classes\n    becomes a minority class or iv) one of the majority classes disappears\n    during undersampling.\n\n    Read more in the :ref:`User Guide <edited_nearest_neighbors>`.\n\n    Parameters\n    ----------\n    {sampling_strategy}\n\n    n_neighbors : int or object, default=3\n        If ``int``, size of the neighbourhood to consider for the undersampling, i.e.,\n        if `n_neighbors=3`, a sample will be removed when any or most of its 3 closest\n        neighbours are from a different class. If object, an estimator that inherits\n        from :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to\n        find the nearest-neighbors. Note that if you want to examine the 3 closest\n        neighbours of a sample for the undersampling, you need to pass a 4-KNN.\n\n    max_iter : int, default=100\n        Maximum number of iterations of the edited nearest neighbours.\n\n    kind_sel : {{'all', 'mode'}}, default='all'\n        Strategy to use to exclude samples.\n\n        - If ``'all'``, all neighbours should be of the same class of the examined\n          sample for it not be excluded.\n        - If ``'mode'``, most neighbours should be of the same class of the examined\n          sample for it not be excluded.\n\n        The strategy `\"all\"` will be less conservative than `'mode'`. Thus,\n        more samples will be removed when `kind_sel=\"all\"`, generally.\n\n    {n_jobs}\n\n    Attributes\n    ----------\n    sampling_strategy_ : dict\n        Dictionary containing the information to sample the dataset. The keys\n        correspond to the class labels from which to sample and the values\n        are the number of samples to sample.\n\n    nn_ : estimator object\n        Validated K-nearest Neighbours estimator linked to the parameter `n_neighbors`.\n\n    enn_ : sampler object\n        The validated :class:`~imblearn.under_sampling.EditedNearestNeighbours`\n        instance.\n\n    sample_indices_ : ndarray of shape (n_new_samples,)\n        Indices of the samples selected.\n\n        .. versionadded:: 0.4\n\n    n_iter_ : int\n        Number of iterations run.\n\n        .. versionadded:: 0.6\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.10\n\n    See Also\n    --------\n    CondensedNearestNeighbour : Undersample by condensing samples.\n\n    EditedNearestNeighbours : Undersample by editing samples.\n\n    AllKNN : Undersample using ENN with varying neighbours.\n\n    Notes\n    -----\n    The method is based on [1]_. A one-vs.-rest scheme is used when\n    sampling a class as proposed in [1]_.\n\n    Supports multi-class resampling.\n\n    References\n    ----------\n    .. [1] I. Tomek, \"An Experiment with the Edited Nearest-Neighbor\n       Rule,\" IEEE Transactions on Systems, Man, and Cybernetics, vol. 6(6),\n       pp. 448-452, June 1976.\n\n    Examples\n    --------\n    >>> from collections import Counter\n    >>> from sklearn.datasets import make_classification\n    >>> from imblearn.under_sampling import RepeatedEditedNearestNeighbours\n    >>> X, y = make_classification(n_classes=2, class_sep=2,\n    ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,\n    ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)\n    >>> print('Original dataset shape %s' % Counter(y))\n    Original dataset shape Counter({{1: 900, 0: 100}})\n    >>> renn = RepeatedEditedNearestNeighbours()\n    >>> X_res, y_res = renn.fit_resample(X, y)\n    >>> print('Resampled dataset shape %s' % Counter(y_res))\n    Resampled dataset shape Counter({{1: 887, 0: 100}})\n    \"\"\"\n\n    _parameter_constraints: dict = {\n        **BaseCleaningSampler._parameter_constraints,\n        \"n_neighbors\": [\n            Interval(numbers.Integral, 1, None, closed=\"left\"),\n            HasMethods([\"kneighbors\", \"kneighbors_graph\"]),\n        ],\n        \"max_iter\": [Interval(numbers.Integral, 1, None, closed=\"left\")],\n        \"kind_sel\": [StrOptions({\"all\", \"mode\"})],\n        \"n_jobs\": [numbers.Integral, None],\n    }\n\n    def __init__(\n        self,\n        *,\n        sampling_strategy=\"auto\",\n        n_neighbors=3,\n        max_iter=100,\n        kind_sel=\"all\",\n        n_jobs=None,\n    ):\n        super().__init__(sampling_strategy=sampling_strategy)\n        self.n_neighbors = n_neighbors\n        self.kind_sel = kind_sel\n        self.n_jobs = n_jobs\n        self.max_iter = max_iter\n\n    def _validate_estimator(self):\n        \"\"\"Private function to create the NN estimator\"\"\"\n        self.nn_ = check_neighbors_object(\n            \"n_neighbors\", self.n_neighbors, additional_neighbor=1\n        )\n\n        self.enn_ = EditedNearestNeighbours(\n            sampling_strategy=self.sampling_strategy,\n            n_neighbors=self.nn_,\n            kind_sel=self.kind_sel,\n            n_jobs=self.n_jobs,\n        )\n\n    def _fit_resample(self, X, y):\n        self._validate_estimator()\n\n        X_, y_ = X, y\n        self.sample_indices_ = np.arange(X.shape[0], dtype=int)\n        target_stats = Counter(y)\n        class_minority = min(target_stats, key=target_stats.get)\n\n        for n_iter in range(self.max_iter):\n            prev_len = y_.shape[0]\n            X_enn, y_enn = self.enn_.fit_resample(X_, y_)\n\n            # Check the stopping criterion\n            # 1. If there is no changes for the vector y\n            # 2. If the number of samples in the other class become inferior to\n            # the number of samples in the majority class\n            # 3. If one of the class is disappearing\n\n            # Case 1\n            b_conv = prev_len == y_enn.shape[0]\n\n            # Case 2\n            stats_enn = Counter(y_enn)\n            count_non_min = np.array(\n                [\n                    val\n                    for val, key in zip(stats_enn.values(), stats_enn.keys())\n                    if key != class_minority\n                ]\n            )\n            b_min_bec_maj = np.any(count_non_min < target_stats[class_minority])\n\n            # Case 3\n            b_remove_maj_class = len(stats_enn) < len(target_stats)\n\n            (\n                X_,\n                y_,\n            ) = (\n                X_enn,\n                y_enn,\n            )\n            self.sample_indices_ = self.sample_indices_[self.enn_.sample_indices_]\n\n            if b_conv or b_min_bec_maj or b_remove_maj_class:\n                if b_conv:\n                    (\n                        X_,\n                        y_,\n                    ) = (\n                        X_enn,\n                        y_enn,\n                    )\n                    self.sample_indices_ = self.sample_indices_[\n                        self.enn_.sample_indices_\n                    ]\n                break\n\n        self.n_iter_ = n_iter + 1\n        X_resampled, y_resampled = X_, y_\n\n        return X_resampled, y_resampled\n\n    def _more_tags(self):\n        return {\"sample_indices\": True}\n\n    def __sklearn_tags__(self):\n        tags = super().__sklearn_tags__()\n        tags.sampler_tags.sample_indices = True\n        return tags\n\n\n@Substitution(\n    sampling_strategy=BaseCleaningSampler._sampling_strategy_docstring,\n    n_jobs=_n_jobs_docstring,\n)\nclass AllKNN(BaseCleaningSampler):\n    \"\"\"Undersample based on the AllKNN method.\n\n    This method will apply :class:`EditedNearestNeighbours` several times varying the\n    number of nearest neighbours at each round. It begins by examining 1 closest\n    neighbour, and it incrases the neighbourhood by 1 at each round.\n\n    The algorithm stops when the maximum number of neighbours are examined or\n    when the majority class becomes the minority class, whichever comes first.\n\n    Read more in the :ref:`User Guide <edited_nearest_neighbors>`.\n\n    Parameters\n    ----------\n    {sampling_strategy}\n\n    n_neighbors : int or estimator object, default=3\n        If ``int``, size of the maximum neighbourhood to examine for the undersampling.\n        If `n_neighbors=3`, in the first iteration the algorithm will examine 1 closest\n        neigbhour, in the second round 2, and in the final round 3. If object, an\n        estimator that inherits from :class:`~sklearn.neighbors.base.KNeighborsMixin`\n        that will be used to find the nearest-neighbors. Note that if you want to\n        examine the 3 closest neighbours of a sample, you need to pass a 4-KNN.\n\n    kind_sel : {{'all', 'mode'}}, default='all'\n        Strategy to use to exclude samples.\n\n        - If ``'all'``, all neighbours should be of the same class of the examined\n          sample for it not be excluded.\n        - If ``'mode'``, most neighbours should be of the same class of the examined\n          sample for it not be excluded.\n\n        The strategy `\"all\"` will be less conservative than `'mode'`. Thus,\n        more samples will be removed when `kind_sel=\"all\"`, generally.\n\n    allow_minority : bool, default=False\n        If ``True``, it allows the majority classes to become the minority\n        class without early stopping.\n\n        .. versionadded:: 0.3\n\n    {n_jobs}\n\n    Attributes\n    ----------\n    sampling_strategy_ : dict\n        Dictionary containing the information to sample the dataset. The keys\n        correspond to the class labels from which to sample and the values\n        are the number of samples to sample.\n\n    nn_ : estimator object\n        Validated K-nearest Neighbours estimator linked to the parameter `n_neighbors`.\n\n    enn_ : sampler object\n        The validated :class:`~imblearn.under_sampling.EditedNearestNeighbours`\n        instance.\n\n    sample_indices_ : ndarray of shape (n_new_samples,)\n        Indices of the samples selected.\n\n        .. versionadded:: 0.4\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.10\n\n    See Also\n    --------\n    CondensedNearestNeighbour: Under-sampling by condensing samples.\n\n    EditedNearestNeighbours: Under-sampling by editing samples.\n\n    RepeatedEditedNearestNeighbours: Under-sampling by repeating ENN.\n\n    Notes\n    -----\n    The method is based on [1]_.\n\n    Supports multi-class resampling. A one-vs.-rest scheme is used when\n    sampling a class as proposed in [1]_.\n\n    References\n    ----------\n    .. [1] I. Tomek, \"An Experiment with the Edited Nearest-Neighbor\n       Rule,\" IEEE Transactions on Systems, Man, and Cybernetics, vol. 6(6),\n       pp. 448-452, June 1976.\n\n    Examples\n    --------\n    >>> from collections import Counter\n    >>> from sklearn.datasets import make_classification\n    >>> from imblearn.under_sampling import AllKNN\n    >>> X, y = make_classification(n_classes=2, class_sep=2,\n    ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,\n    ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)\n    >>> print('Original dataset shape %s' % Counter(y))\n    Original dataset shape Counter({{1: 900, 0: 100}})\n    >>> allknn = AllKNN()\n    >>> X_res, y_res = allknn.fit_resample(X, y)\n    >>> print('Resampled dataset shape %s' % Counter(y_res))\n    Resampled dataset shape Counter({{1: 887, 0: 100}})\n    \"\"\"\n\n    _parameter_constraints: dict = {\n        **BaseCleaningSampler._parameter_constraints,\n        \"n_neighbors\": [\n            Interval(numbers.Integral, 1, None, closed=\"left\"),\n            HasMethods([\"kneighbors\", \"kneighbors_graph\"]),\n        ],\n        \"kind_sel\": [StrOptions({\"all\", \"mode\"})],\n        \"allow_minority\": [\"boolean\"],\n        \"n_jobs\": [numbers.Integral, None],\n    }\n\n    def __init__(\n        self,\n        *,\n        sampling_strategy=\"auto\",\n        n_neighbors=3,\n        kind_sel=\"all\",\n        allow_minority=False,\n        n_jobs=None,\n    ):\n        super().__init__(sampling_strategy=sampling_strategy)\n        self.n_neighbors = n_neighbors\n        self.kind_sel = kind_sel\n        self.allow_minority = allow_minority\n        self.n_jobs = n_jobs\n\n    def _validate_estimator(self):\n        \"\"\"Create objects required by AllKNN\"\"\"\n        self.nn_ = check_neighbors_object(\n            \"n_neighbors\", self.n_neighbors, additional_neighbor=1\n        )\n\n        self.enn_ = EditedNearestNeighbours(\n            sampling_strategy=self.sampling_strategy,\n            n_neighbors=self.nn_,\n            kind_sel=self.kind_sel,\n            n_jobs=self.n_jobs,\n        )\n\n    def _fit_resample(self, X, y):\n        self._validate_estimator()\n\n        X_, y_ = X, y\n        target_stats = Counter(y)\n        class_minority = min(target_stats, key=target_stats.get)\n\n        self.sample_indices_ = np.arange(X.shape[0], dtype=int)\n\n        for curr_size_ngh in range(1, self.nn_.n_neighbors):\n            self.enn_.n_neighbors = curr_size_ngh\n\n            X_enn, y_enn = self.enn_.fit_resample(X_, y_)\n\n            # Check the stopping criterion\n            # 1. If the number of samples in the other class become inferior to\n            # the number of samples in the majority class\n            # 2. If one of the class is disappearing\n            # Case 1else:\n\n            stats_enn = Counter(y_enn)\n            count_non_min = np.array(\n                [\n                    val\n                    for val, key in zip(stats_enn.values(), stats_enn.keys())\n                    if key != class_minority\n                ]\n            )\n            b_min_bec_maj = np.any(count_non_min < target_stats[class_minority])\n            if self.allow_minority:\n                # overwrite b_min_bec_maj\n                b_min_bec_maj = False\n\n            # Case 2\n            b_remove_maj_class = len(stats_enn) < len(target_stats)\n\n            (\n                X_,\n                y_,\n            ) = (\n                X_enn,\n                y_enn,\n            )\n            self.sample_indices_ = self.sample_indices_[self.enn_.sample_indices_]\n\n            if b_min_bec_maj or b_remove_maj_class:\n                break\n\n        X_resampled, y_resampled = X_, y_\n\n        return X_resampled, y_resampled\n\n    def _more_tags(self):\n        return {\"sample_indices\": True}\n\n    def __sklearn_tags__(self):\n        tags = super().__sklearn_tags__()\n        tags.sampler_tags.sample_indices = True\n        return tags\n"
  },
  {
    "path": "imblearn/under_sampling/_prototype_selection/_instance_hardness_threshold.py",
    "content": "\"\"\"Class to perform under-sampling based on the instance hardness\nthreshold.\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Dayvid Oliveira\n#          Christos Aridas\n# License: MIT\n\nimport numbers\nfrom collections import Counter\n\nimport numpy as np\nfrom sklearn.base import clone, is_classifier\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.ensemble._base import _set_random_states\nfrom sklearn.model_selection import StratifiedKFold, cross_val_predict\nfrom sklearn.utils import _safe_indexing, check_random_state\nfrom sklearn.utils._param_validation import HasMethods\n\nfrom imblearn.under_sampling.base import BaseUnderSampler\nfrom imblearn.utils import Substitution\nfrom imblearn.utils._docstring import _n_jobs_docstring, _random_state_docstring\n\n\n@Substitution(\n    sampling_strategy=BaseUnderSampler._sampling_strategy_docstring,\n    n_jobs=_n_jobs_docstring,\n    random_state=_random_state_docstring,\n)\nclass InstanceHardnessThreshold(BaseUnderSampler):\n    \"\"\"Undersample based on the instance hardness threshold.\n\n    Read more in the :ref:`User Guide <instance_hardness_threshold>`.\n\n    Parameters\n    ----------\n    estimator : estimator object, default=None\n        Classifier to be used to estimate instance hardness of the samples.\n        This classifier should implement `predict_proba`.\n\n    {sampling_strategy}\n\n    {random_state}\n\n    cv : int, default=5\n        Number of folds to be used when estimating samples' instance hardness.\n\n    {n_jobs}\n\n    Attributes\n    ----------\n    sampling_strategy_ : dict\n        Dictionary containing the information to sample the dataset. The keys\n        correspond to the class labels from which to sample and the values\n        are the number of samples to sample.\n\n    estimator_ : estimator object\n        The validated classifier used to estimate the instance hardness of the samples.\n\n    sample_indices_ : ndarray of shape (n_new_samples,)\n        Indices of the samples selected.\n\n        .. versionadded:: 0.4\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.10\n\n    See Also\n    --------\n    NearMiss : Undersample based on near-miss search.\n\n    RandomUnderSampler : Random under-sampling.\n\n    Notes\n    -----\n    The method is based on [1]_.\n\n    Supports multi-class resampling: from each class to be under-sampled, it\n    retains the observations with the highest probability of being correctly\n    classified.\n\n    References\n    ----------\n    .. [1] D. Smith, Michael R., Tony Martinez, and Christophe Giraud-Carrier.\n       \"An instance level analysis of data complexity.\" Machine learning\n       95.2 (2014): 225-256.\n\n    Examples\n    --------\n    >>> from collections import Counter\n    >>> from sklearn.datasets import make_classification\n    >>> from imblearn.under_sampling import InstanceHardnessThreshold\n    >>> X, y = make_classification(n_classes=2, class_sep=2,\n    ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,\n    ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)\n    >>> print('Original dataset shape %s' % Counter(y))\n    Original dataset shape Counter({{1: 900, 0: 100}})\n    >>> iht = InstanceHardnessThreshold(random_state=42)\n    >>> X_res, y_res = iht.fit_resample(X, y)\n    >>> print('Resampled dataset shape %s' % Counter(y_res))\n    Resampled dataset shape Counter({{1: 5..., 0: 100}})\n    \"\"\"\n\n    _parameter_constraints: dict = {\n        **BaseUnderSampler._parameter_constraints,\n        \"estimator\": [\n            HasMethods([\"fit\", \"predict_proba\"]),\n            None,\n        ],\n        \"cv\": [\"cv_object\"],\n        \"n_jobs\": [numbers.Integral, None],\n        \"random_state\": [\"random_state\"],\n    }\n\n    def __init__(\n        self,\n        *,\n        estimator=None,\n        sampling_strategy=\"auto\",\n        random_state=None,\n        cv=5,\n        n_jobs=None,\n    ):\n        super().__init__(sampling_strategy=sampling_strategy)\n        self.random_state = random_state\n        self.estimator = estimator\n        self.cv = cv\n        self.n_jobs = n_jobs\n\n    def _validate_estimator(self, random_state):\n        \"\"\"Private function to create the classifier\"\"\"\n\n        if (\n            self.estimator is not None\n            and is_classifier(self.estimator)\n            and hasattr(self.estimator, \"predict_proba\")\n        ):\n            self.estimator_ = clone(self.estimator)\n            _set_random_states(self.estimator_, random_state)\n\n        elif self.estimator is None:\n            self.estimator_ = RandomForestClassifier(\n                n_estimators=100,\n                random_state=self.random_state,\n                n_jobs=self.n_jobs,\n            )\n\n    def _fit_resample(self, X, y):\n        random_state = check_random_state(self.random_state)\n        self._validate_estimator(random_state)\n\n        target_stats = Counter(y)\n        skf = StratifiedKFold(\n            n_splits=self.cv,\n            shuffle=True,\n            random_state=random_state,\n        )\n        probabilities = cross_val_predict(\n            self.estimator_,\n            X,\n            y,\n            cv=skf,\n            n_jobs=self.n_jobs,\n            method=\"predict_proba\",\n        )\n        probabilities = probabilities[range(len(y)), y]\n\n        idx_under = np.empty((0,), dtype=int)\n\n        for target_class in np.unique(y):\n            if target_class in self.sampling_strategy_.keys():\n                n_samples = self.sampling_strategy_[target_class]\n                threshold = np.percentile(\n                    probabilities[y == target_class],\n                    (1.0 - (n_samples / target_stats[target_class])) * 100.0,\n                )\n                index_target_class = np.flatnonzero(\n                    probabilities[y == target_class] >= threshold\n                )\n            else:\n                index_target_class = slice(None)\n\n            idx_under = np.concatenate(\n                (\n                    idx_under,\n                    np.flatnonzero(y == target_class)[index_target_class],\n                ),\n                axis=0,\n            )\n\n        self.sample_indices_ = idx_under\n\n        return _safe_indexing(X, idx_under), _safe_indexing(y, idx_under)\n\n    def _more_tags(self):\n        return {\"sample_indices\": True}\n\n    def __sklearn_tags__(self):\n        tags = super().__sklearn_tags__()\n        tags.sampler_tags.sample_indices = True\n        return tags\n"
  },
  {
    "path": "imblearn/under_sampling/_prototype_selection/_nearmiss.py",
    "content": "\"\"\"Class to perform under-sampling based on nearmiss methods.\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport numbers\nimport warnings\nfrom collections import Counter\n\nimport numpy as np\nfrom sklearn.utils import _safe_indexing\nfrom sklearn.utils._param_validation import HasMethods, Interval\n\nfrom imblearn.under_sampling.base import BaseUnderSampler\nfrom imblearn.utils import Substitution, check_neighbors_object\nfrom imblearn.utils._docstring import _n_jobs_docstring\n\n\n@Substitution(\n    sampling_strategy=BaseUnderSampler._sampling_strategy_docstring,\n    n_jobs=_n_jobs_docstring,\n)\nclass NearMiss(BaseUnderSampler):\n    \"\"\"Class to perform under-sampling based on NearMiss methods.\n\n    Read more in the :ref:`User Guide <controlled_under_sampling>`.\n\n    Parameters\n    ----------\n    {sampling_strategy}\n\n    version : int, default=1\n        Version of the NearMiss to use. Possible values are 1, 2 or 3.\n\n    n_neighbors : int or estimator object, default=3\n        If ``int``, size of the neighbourhood to consider to compute the\n        average distance to the minority point samples.  If object, an\n        estimator that inherits from\n        :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to\n        find the k_neighbors.\n        By default, it will be a 3-NN.\n\n    n_neighbors_ver3 : int or estimator object, default=3\n        If ``int``, NearMiss-3 algorithm start by a phase of re-sampling. This\n        parameter correspond to the number of neighbours selected create the\n        subset in which the selection will be performed.  If object, an\n        estimator that inherits from\n        :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to\n        find the k_neighbors.\n        By default, it will be a 3-NN.\n\n    {n_jobs}\n\n    Attributes\n    ----------\n    sampling_strategy_ : dict\n        Dictionary containing the information to sample the dataset. The keys\n        corresponds to the class labels from which to sample and the values\n        are the number of samples to sample.\n\n    nn_ : estimator object\n        Validated K-nearest Neighbours object created from `n_neighbors` parameter.\n\n    nn_ver3_ : estimator object\n        Validated K-nearest Neighbours object created from `n_neighbors_ver3` parameter.\n\n    sample_indices_ : ndarray of shape (n_new_samples,)\n        Indices of the samples selected.\n\n        .. versionadded:: 0.4\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.10\n\n    See Also\n    --------\n    RandomUnderSampler : Random undersample the dataset.\n\n    InstanceHardnessThreshold : Use of classifier to undersample a dataset.\n\n    Notes\n    -----\n    The methods are based on [1]_.\n\n    Supports multi-class resampling.\n\n    References\n    ----------\n    .. [1] I. Mani, I. Zhang. \"kNN approach to unbalanced data distributions:\n       a case study involving information extraction,\" In Proceedings of\n       workshop on learning from imbalanced datasets, 2003.\n\n    Examples\n    --------\n    >>> from collections import Counter\n    >>> from sklearn.datasets import make_classification\n    >>> from imblearn.under_sampling import NearMiss\n    >>> X, y = make_classification(n_classes=2, class_sep=2,\n    ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,\n    ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)\n    >>> print('Original dataset shape %s' % Counter(y))\n    Original dataset shape Counter({{1: 900, 0: 100}})\n    >>> nm = NearMiss()\n    >>> X_res, y_res = nm.fit_resample(X, y)\n    >>> print('Resampled dataset shape %s' % Counter(y_res))\n    Resampled dataset shape Counter({{0: 100, 1: 100}})\n    \"\"\"\n\n    _parameter_constraints: dict = {\n        **BaseUnderSampler._parameter_constraints,\n        \"version\": [Interval(numbers.Integral, 1, 3, closed=\"both\")],\n        \"n_neighbors\": [\n            Interval(numbers.Integral, 1, None, closed=\"left\"),\n            HasMethods([\"kneighbors\", \"kneighbors_graph\"]),\n        ],\n        \"n_neighbors_ver3\": [\n            Interval(numbers.Integral, 1, None, closed=\"left\"),\n            HasMethods([\"kneighbors\", \"kneighbors_graph\"]),\n        ],\n        \"n_jobs\": [numbers.Integral, None],\n    }\n\n    def __init__(\n        self,\n        *,\n        sampling_strategy=\"auto\",\n        version=1,\n        n_neighbors=3,\n        n_neighbors_ver3=3,\n        n_jobs=None,\n    ):\n        super().__init__(sampling_strategy=sampling_strategy)\n        self.version = version\n        self.n_neighbors = n_neighbors\n        self.n_neighbors_ver3 = n_neighbors_ver3\n        self.n_jobs = n_jobs\n\n    def _selection_dist_based(\n        self, X, y, dist_vec, num_samples, key, sel_strategy=\"nearest\"\n    ):\n        \"\"\"Select the appropriate samples depending of the strategy selected.\n\n        Parameters\n        ----------\n        X : {array-like, sparse matrix}, shape (n_samples, n_features)\n            Original samples.\n\n        y : array-like, shape (n_samples,)\n            Associated label to X.\n\n        dist_vec : ndarray, shape (n_samples, )\n            The distance matrix to the nearest neigbour.\n\n        num_samples: int\n            The desired number of samples to select.\n\n        key : str or int,\n            The target class.\n\n        sel_strategy : str, optional (default='nearest')\n            Strategy to select the samples. Either 'nearest' or 'farthest'\n\n        Returns\n        -------\n        idx_sel : ndarray, shape (num_samples,)\n            The list of the indices of the selected samples.\n\n        \"\"\"\n\n        # Compute the distance considering the farthest neighbour\n        dist_avg_vec = np.sum(dist_vec[:, -self.nn_.n_neighbors :], axis=1)\n\n        target_class_indices = np.flatnonzero(y == key)\n        if dist_vec.shape[0] != _safe_indexing(X, target_class_indices).shape[0]:\n            raise RuntimeError(\n                \"The samples to be selected do not correspond\"\n                \" to the distance matrix given. Ensure that\"\n                \" both `X[y == key]` and `dist_vec` are\"\n                \" related.\"\n            )\n\n        # Sort the list of distance and get the index\n        if sel_strategy == \"nearest\":\n            sort_way = False\n        else:  # sel_strategy == \"farthest\":\n            sort_way = True\n\n        sorted_idx = sorted(\n            range(len(dist_avg_vec)),\n            key=dist_avg_vec.__getitem__,\n            reverse=sort_way,\n        )\n\n        # Throw a warning to tell the user that we did not have enough samples\n        # to select and that we just select everything\n        if len(sorted_idx) < num_samples:\n            warnings.warn(\n                \"The number of the samples to be selected is larger\"\n                \" than the number of samples available. The\"\n                \" balancing ratio cannot be ensure and all samples\"\n                \" will be returned.\"\n            )\n\n        # Select the desired number of samples\n        return sorted_idx[:num_samples]\n\n    def _validate_estimator(self):\n        \"\"\"Private function to create the NN estimator\"\"\"\n\n        self.nn_ = check_neighbors_object(\"n_neighbors\", self.n_neighbors)\n        self.nn_.set_params(**{\"n_jobs\": self.n_jobs})\n\n        if self.version == 3:\n            self.nn_ver3_ = check_neighbors_object(\n                \"n_neighbors_ver3\", self.n_neighbors_ver3\n            )\n            self.nn_ver3_.set_params(**{\"n_jobs\": self.n_jobs})\n\n    def _fit_resample(self, X, y):\n        self._validate_estimator()\n\n        idx_under = np.empty((0,), dtype=int)\n\n        target_stats = Counter(y)\n        class_minority = min(target_stats, key=target_stats.get)\n        minority_class_indices = np.flatnonzero(y == class_minority)\n\n        self.nn_.fit(_safe_indexing(X, minority_class_indices))\n\n        for target_class in np.unique(y):\n            if target_class in self.sampling_strategy_.keys():\n                n_samples = self.sampling_strategy_[target_class]\n                target_class_indices = np.flatnonzero(y == target_class)\n                X_class = _safe_indexing(X, target_class_indices)\n                y_class = _safe_indexing(y, target_class_indices)\n\n                if self.version == 1:\n                    dist_vec, idx_vec = self.nn_.kneighbors(\n                        X_class, n_neighbors=self.nn_.n_neighbors\n                    )\n                    index_target_class = self._selection_dist_based(\n                        X,\n                        y,\n                        dist_vec,\n                        n_samples,\n                        target_class,\n                        sel_strategy=\"nearest\",\n                    )\n                elif self.version == 2:\n                    dist_vec, idx_vec = self.nn_.kneighbors(\n                        X_class, n_neighbors=target_stats[class_minority]\n                    )\n                    index_target_class = self._selection_dist_based(\n                        X,\n                        y,\n                        dist_vec,\n                        n_samples,\n                        target_class,\n                        sel_strategy=\"nearest\",\n                    )\n                elif self.version == 3:\n                    self.nn_ver3_.fit(X_class)\n                    dist_vec, idx_vec = self.nn_ver3_.kneighbors(\n                        _safe_indexing(X, minority_class_indices)\n                    )\n                    idx_vec_farthest = np.unique(idx_vec.reshape(-1))\n                    X_class_selected = _safe_indexing(X_class, idx_vec_farthest)\n                    y_class_selected = _safe_indexing(y_class, idx_vec_farthest)\n\n                    dist_vec, idx_vec = self.nn_.kneighbors(\n                        X_class_selected, n_neighbors=self.nn_.n_neighbors\n                    )\n                    index_target_class = self._selection_dist_based(\n                        X_class_selected,\n                        y_class_selected,\n                        dist_vec,\n                        n_samples,\n                        target_class,\n                        sel_strategy=\"farthest\",\n                    )\n                    # idx_tmp is relative to the feature selected in the\n                    # previous step and we need to find the indirection\n                    index_target_class = idx_vec_farthest[index_target_class]\n            else:\n                index_target_class = slice(None)\n\n            idx_under = np.concatenate(\n                (\n                    idx_under,\n                    np.flatnonzero(y == target_class)[index_target_class],\n                ),\n                axis=0,\n            )\n\n        self.sample_indices_ = idx_under\n\n        return _safe_indexing(X, idx_under), _safe_indexing(y, idx_under)\n\n    # fmt: off\n    def _more_tags(self):\n        return {\n            \"sample_indices\": True,\n            \"_xfail_checks\": {\n                \"check_samplers_fit_resample\":\n                \"Fails for NearMiss-3 with less samples than expected\"\n            }\n        }\n    # fmt: on\n\n    def __sklearn_tags__(self):\n        tags = super().__sklearn_tags__()\n        tags.sampler_tags.sample_indices = True\n        return tags\n"
  },
  {
    "path": "imblearn/under_sampling/_prototype_selection/_neighbourhood_cleaning_rule.py",
    "content": "\"\"\"Class performing under-sampling based on the neighbourhood cleaning rule.\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport numbers\nfrom collections import Counter\n\nimport numpy as np\nfrom sklearn.base import clone\nfrom sklearn.neighbors import KNeighborsClassifier, NearestNeighbors\nfrom sklearn.utils import _safe_indexing\nfrom sklearn.utils._param_validation import HasMethods, Interval\n\nfrom imblearn.under_sampling._prototype_selection._edited_nearest_neighbours import (\n    EditedNearestNeighbours,\n)\nfrom imblearn.under_sampling.base import BaseCleaningSampler\nfrom imblearn.utils import Substitution\nfrom imblearn.utils._docstring import _n_jobs_docstring\n\nSEL_KIND = (\"all\", \"mode\")\n\n\n@Substitution(\n    sampling_strategy=BaseCleaningSampler._sampling_strategy_docstring,\n    n_jobs=_n_jobs_docstring,\n)\nclass NeighbourhoodCleaningRule(BaseCleaningSampler):\n    \"\"\"Undersample based on the neighbourhood cleaning rule.\n\n    This class uses ENN and a k-NN to remove noisy samples from the datasets.\n\n    Read more in the :ref:`User Guide <condensed_nearest_neighbors>`.\n\n    Parameters\n    ----------\n    {sampling_strategy}\n\n    edited_nearest_neighbours : estimator object, default=None\n        The :class:`~imblearn.under_sampling.EditedNearestNeighbours` (ENN)\n        object to clean the dataset. If `None`, a default ENN is created with\n        `kind_sel=\"mode\"` and `n_neighbors=n_neighbors`.\n\n    n_neighbors : int or estimator object, default=3\n        If ``int``, size of the neighbourhood to consider to compute the\n        K-nearest neighbors. If object, an estimator that inherits from\n        :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to\n        find the nearest-neighbors. By default, it will be a 3-NN.\n\n    threshold_cleaning : float, default=0.5\n        Threshold used to whether consider a class or not during the cleaning\n        after applying ENN. A class will be considered during cleaning when:\n\n        Ci > C x T ,\n\n        where Ci and C is the number of samples in the class and the data set,\n        respectively and theta is the threshold.\n\n    {n_jobs}\n\n    Attributes\n    ----------\n    sampling_strategy_ : dict\n        Dictionary containing the information to sample the dataset. The keys\n        corresponds to the class labels from which to sample and the values\n        are the number of samples to sample.\n\n    edited_nearest_neighbours_ : estimator object\n        The edited nearest neighbour object used to make the first resampling.\n\n    nn_ : estimator object\n        Validated K-nearest Neighbours object created from `n_neighbors` parameter.\n\n    classes_to_clean_ : list\n        The classes considered with under-sampling by `nn_` in the second cleaning\n        phase.\n\n    sample_indices_ : ndarray of shape (n_new_samples,)\n        Indices of the samples selected.\n\n        .. versionadded:: 0.4\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.10\n\n    See Also\n    --------\n    EditedNearestNeighbours : Undersample by editing noisy samples.\n\n    Notes\n    -----\n    See the original paper: [1]_.\n\n    Supports multi-class resampling. A one-vs.-rest scheme is used when\n    sampling a class as proposed in [1]_.\n\n    References\n    ----------\n    .. [1] J. Laurikkala, \"Improving identification of difficult small classes\n       by balancing class distribution,\" Springer Berlin Heidelberg, 2001.\n\n    Examples\n    --------\n    >>> from collections import Counter\n    >>> from sklearn.datasets import make_classification\n    >>> from imblearn.under_sampling import NeighbourhoodCleaningRule\n    >>> X, y = make_classification(n_classes=2, class_sep=2,\n    ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,\n    ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)\n    >>> print('Original dataset shape %s' % Counter(y))\n    Original dataset shape Counter({{1: 900, 0: 100}})\n    >>> ncr = NeighbourhoodCleaningRule()\n    >>> X_res, y_res = ncr.fit_resample(X, y)\n    >>> print('Resampled dataset shape %s' % Counter(y_res))\n    Resampled dataset shape Counter({{1: 888, 0: 100}})\n    \"\"\"\n\n    _parameter_constraints: dict = {\n        **BaseCleaningSampler._parameter_constraints,\n        \"edited_nearest_neighbours\": [\n            HasMethods([\"fit_resample\"]),\n            None,\n        ],\n        \"n_neighbors\": [\n            Interval(numbers.Integral, 1, None, closed=\"left\"),\n            HasMethods([\"kneighbors\", \"kneighbors_graph\"]),\n        ],\n        \"threshold_cleaning\": [Interval(numbers.Real, 0, None, closed=\"neither\")],\n        \"n_jobs\": [numbers.Integral, None],\n    }\n\n    def __init__(\n        self,\n        *,\n        sampling_strategy=\"auto\",\n        edited_nearest_neighbours=None,\n        n_neighbors=3,\n        threshold_cleaning=0.5,\n        n_jobs=None,\n    ):\n        super().__init__(sampling_strategy=sampling_strategy)\n        self.edited_nearest_neighbours = edited_nearest_neighbours\n        self.n_neighbors = n_neighbors\n        self.threshold_cleaning = threshold_cleaning\n        self.n_jobs = n_jobs\n\n    def _validate_estimator(self):\n        \"\"\"Create the objects required by NCR.\"\"\"\n        if isinstance(self.n_neighbors, numbers.Integral):\n            self.nn_ = KNeighborsClassifier(\n                n_neighbors=self.n_neighbors, n_jobs=self.n_jobs\n            )\n        elif isinstance(self.n_neighbors, NearestNeighbors):\n            # backward compatibility when passing a NearestNeighbors object\n            self.nn_ = KNeighborsClassifier(\n                n_neighbors=self.n_neighbors.n_neighbors - 1, n_jobs=self.n_jobs\n            )\n        else:\n            self.nn_ = clone(self.n_neighbors)\n\n        if self.edited_nearest_neighbours is None:\n            self.edited_nearest_neighbours_ = EditedNearestNeighbours(\n                sampling_strategy=self.sampling_strategy,\n                n_neighbors=self.n_neighbors,\n                kind_sel=\"mode\",\n                n_jobs=self.n_jobs,\n            )\n        else:\n            self.edited_nearest_neighbours_ = clone(self.edited_nearest_neighbours)\n\n    def _fit_resample(self, X, y):\n        self._validate_estimator()\n        self.edited_nearest_neighbours_.fit_resample(X, y)\n        index_not_a1 = self.edited_nearest_neighbours_.sample_indices_\n        index_a1 = np.ones(y.shape, dtype=bool)\n        index_a1[index_not_a1] = False\n        index_a1 = np.flatnonzero(index_a1)\n\n        # clean the neighborhood\n        target_stats = Counter(y)\n        class_minority = min(target_stats, key=target_stats.get)\n        # compute which classes to consider for cleaning for the A2 group\n        self.classes_to_clean_ = [\n            c\n            for c, n_samples in target_stats.items()\n            if (\n                c in self.sampling_strategy_.keys()\n                and (n_samples > target_stats[class_minority] * self.threshold_cleaning)\n            )\n        ]\n        self.nn_.fit(X, y)\n\n        class_minority_indices = np.flatnonzero(y == class_minority)\n        X_minority = _safe_indexing(X, class_minority_indices)\n        y_minority = _safe_indexing(y, class_minority_indices)\n\n        y_pred_minority = self.nn_.predict(X_minority)\n        # add an additional sample since the query points contains the original dataset\n        neighbors_to_minority_indices = self.nn_.kneighbors(\n            X_minority, n_neighbors=self.nn_.n_neighbors + 1, return_distance=False\n        )[:, 1:]\n\n        mask_misclassified_minority = y_pred_minority != y_minority\n        index_a2 = np.ravel(neighbors_to_minority_indices[mask_misclassified_minority])\n        index_a2 = np.array(\n            [\n                index\n                for index in np.unique(index_a2)\n                if y[index] in self.classes_to_clean_\n            ]\n        )\n\n        union_a1_a2 = np.union1d(index_a1, index_a2).astype(int)\n        selected_samples = np.ones(y.shape, dtype=bool)\n        selected_samples[union_a1_a2] = False\n        self.sample_indices_ = np.flatnonzero(selected_samples)\n\n        return (\n            _safe_indexing(X, self.sample_indices_),\n            _safe_indexing(y, self.sample_indices_),\n        )\n\n    def _more_tags(self):\n        return {\"sample_indices\": True}\n\n    def __sklearn_tags__(self):\n        tags = super().__sklearn_tags__()\n        tags.sampler_tags.sample_indices = True\n        return tags\n"
  },
  {
    "path": "imblearn/under_sampling/_prototype_selection/_one_sided_selection.py",
    "content": "\"\"\"Class to perform under-sampling based on one-sided selection method.\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport numbers\nfrom collections import Counter\n\nimport numpy as np\nfrom sklearn.base import clone\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.utils import _safe_indexing, check_random_state\nfrom sklearn.utils._param_validation import HasMethods, Interval\n\nfrom imblearn.under_sampling._prototype_selection._tomek_links import TomekLinks\nfrom imblearn.under_sampling.base import BaseCleaningSampler\nfrom imblearn.utils import Substitution\nfrom imblearn.utils._docstring import _n_jobs_docstring, _random_state_docstring\n\n\n@Substitution(\n    sampling_strategy=BaseCleaningSampler._sampling_strategy_docstring,\n    n_jobs=_n_jobs_docstring,\n    random_state=_random_state_docstring,\n)\nclass OneSidedSelection(BaseCleaningSampler):\n    \"\"\"Class to perform under-sampling based on one-sided selection method.\n\n    Read more in the :ref:`User Guide <condensed_nearest_neighbors>`.\n\n    Parameters\n    ----------\n    {sampling_strategy}\n\n    {random_state}\n\n    n_neighbors : int or estimator object, default=None\n        If ``int``, size of the neighbourhood to consider to compute the\n        nearest neighbors. If object, an estimator that inherits from\n        :class:`~sklearn.neighbors.base.KNeighborsMixin` that will be used to\n        find the nearest-neighbors. If `None`, a\n        :class:`~sklearn.neighbors.KNeighborsClassifier` with a 1-NN rules will\n        be used.\n\n    n_seeds_S : int, default=1\n        Number of samples to extract in order to build the set S.\n\n    {n_jobs}\n\n    Attributes\n    ----------\n    sampling_strategy_ : dict\n        Dictionary containing the information to sample the dataset. The keys\n        corresponds to the class labels from which to sample and the values\n        are the number of samples to sample.\n\n    estimators_ : list of estimator objects of shape (n_resampled_classes - 1,)\n        Contains the K-nearest neighbor estimator used for per of classes.\n\n        .. versionadded:: 0.12\n\n    sample_indices_ : ndarray of shape (n_new_samples,)\n        Indices of the samples selected.\n\n        .. versionadded:: 0.4\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.10\n\n    See Also\n    --------\n    EditedNearestNeighbours : Undersample by editing noisy samples.\n\n    Notes\n    -----\n    The method is based on [1]_.\n\n    Supports multi-class resampling. A one-vs.-one scheme is used when sampling\n    a class as proposed in [1]_. For each class to be sampled, all samples of\n    this class and the minority class are used during the sampling procedure.\n\n    References\n    ----------\n    .. [1] M. Kubat, S. Matwin, \"Addressing the curse of imbalanced training\n       sets: one-sided selection,\" In ICML, vol. 97, pp. 179-186, 1997.\n\n    Examples\n    --------\n\n    >>> from collections import Counter\n    >>> from sklearn.datasets import make_classification\n    >>> from imblearn.under_sampling import OneSidedSelection\n    >>> X, y = make_classification(n_classes=2, class_sep=2,\n    ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,\n    ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)\n    >>> print('Original dataset shape %s' % Counter(y))\n    Original dataset shape Counter({{1: 900, 0: 100}})\n    >>> oss = OneSidedSelection(random_state=42)\n    >>> X_res, y_res = oss.fit_resample(X, y)\n    >>> print('Resampled dataset shape %s' % Counter(y_res))\n    Resampled dataset shape Counter({{1: 496, 0: 100}})\n    \"\"\"\n\n    _parameter_constraints: dict = {\n        **BaseCleaningSampler._parameter_constraints,\n        \"n_neighbors\": [\n            Interval(numbers.Integral, 1, None, closed=\"left\"),\n            HasMethods([\"kneighbors\", \"kneighbors_graph\"]),\n            None,\n        ],\n        \"n_seeds_S\": [Interval(numbers.Integral, 1, None, closed=\"left\")],\n        \"n_jobs\": [numbers.Integral, None],\n        \"random_state\": [\"random_state\"],\n    }\n\n    def __init__(\n        self,\n        *,\n        sampling_strategy=\"auto\",\n        random_state=None,\n        n_neighbors=None,\n        n_seeds_S=1,\n        n_jobs=None,\n    ):\n        super().__init__(sampling_strategy=sampling_strategy)\n        self.random_state = random_state\n        self.n_neighbors = n_neighbors\n        self.n_seeds_S = n_seeds_S\n        self.n_jobs = n_jobs\n\n    def _validate_estimator(self):\n        \"\"\"Private function to create the NN estimator\"\"\"\n        if self.n_neighbors is None:\n            estimator = KNeighborsClassifier(n_neighbors=1, n_jobs=self.n_jobs)\n        elif isinstance(self.n_neighbors, int):\n            estimator = KNeighborsClassifier(\n                n_neighbors=self.n_neighbors, n_jobs=self.n_jobs\n            )\n        elif isinstance(self.n_neighbors, KNeighborsClassifier):\n            estimator = clone(self.n_neighbors)\n\n        return estimator\n\n    def _fit_resample(self, X, y):\n        estimator = self._validate_estimator()\n\n        random_state = check_random_state(self.random_state)\n        target_stats = Counter(y)\n        class_minority = min(target_stats, key=target_stats.get)\n\n        idx_under = np.empty((0,), dtype=int)\n\n        self.estimators_ = []\n        for target_class in np.unique(y):\n            if target_class in self.sampling_strategy_.keys():\n                # select a sample from the current class\n                idx_maj = np.flatnonzero(y == target_class)\n                sel_idx_maj = random_state.randint(\n                    low=0, high=target_stats[target_class], size=self.n_seeds_S\n                )\n                idx_maj_sample = idx_maj[sel_idx_maj]\n\n                minority_class_indices = np.flatnonzero(y == class_minority)\n                C_indices = np.append(minority_class_indices, idx_maj_sample)\n\n                # create the set composed of all minority samples and one\n                # sample from the current class.\n                C_x = _safe_indexing(X, C_indices)\n                C_y = _safe_indexing(y, C_indices)\n\n                # create the set S with removing the seed from S\n                # since that it will be added anyway\n                idx_maj_extracted = np.delete(idx_maj, sel_idx_maj, axis=0)\n                S_x = _safe_indexing(X, idx_maj_extracted)\n                S_y = _safe_indexing(y, idx_maj_extracted)\n                self.estimators_.append(clone(estimator).fit(C_x, C_y))\n                pred_S_y = self.estimators_[-1].predict(S_x)\n\n                S_misclassified_indices = np.flatnonzero(pred_S_y != S_y)\n                idx_tmp = idx_maj_extracted[S_misclassified_indices]\n                idx_under = np.concatenate((idx_under, idx_maj_sample, idx_tmp), axis=0)\n            else:\n                idx_under = np.concatenate(\n                    (idx_under, np.flatnonzero(y == target_class)), axis=0\n                )\n\n        X_resampled = _safe_indexing(X, idx_under)\n        y_resampled = _safe_indexing(y, idx_under)\n\n        # apply Tomek cleaning\n        tl = TomekLinks(sampling_strategy=list(self.sampling_strategy_.keys()))\n        X_cleaned, y_cleaned = tl.fit_resample(X_resampled, y_resampled)\n\n        self.sample_indices_ = _safe_indexing(idx_under, tl.sample_indices_)\n\n        return X_cleaned, y_cleaned\n\n    def _more_tags(self):\n        return {\"sample_indices\": True}\n\n    def __sklearn_tags__(self):\n        tags = super().__sklearn_tags__()\n        tags.sampler_tags.sample_indices = True\n        return tags\n"
  },
  {
    "path": "imblearn/under_sampling/_prototype_selection/_random_under_sampler.py",
    "content": "\"\"\"Class to perform random under-sampling.\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport numpy as np\nfrom sklearn.utils import _safe_indexing, check_random_state\nfrom sklearn_compat.utils.validation import validate_data\n\nfrom imblearn.under_sampling.base import BaseUnderSampler\nfrom imblearn.utils import Substitution, check_target_type\nfrom imblearn.utils._docstring import _random_state_docstring\nfrom imblearn.utils._validation import _check_X\n\n\n@Substitution(\n    sampling_strategy=BaseUnderSampler._sampling_strategy_docstring,\n    random_state=_random_state_docstring,\n)\nclass RandomUnderSampler(BaseUnderSampler):\n    \"\"\"Class to perform random under-sampling.\n\n    Under-sample the majority class(es) by randomly picking samples\n    with or without replacement.\n\n    Read more in the :ref:`User Guide <controlled_under_sampling>`.\n\n    Parameters\n    ----------\n    {sampling_strategy}\n\n    {random_state}\n\n    replacement : bool, default=False\n        Whether the sample is with or without replacement.\n\n    Attributes\n    ----------\n    sampling_strategy_ : dict\n        Dictionary containing the information to sample the dataset. The keys\n        corresponds to the class labels from which to sample and the values\n        are the number of samples to sample.\n\n    sample_indices_ : ndarray of shape (n_new_samples,)\n        Indices of the samples selected.\n\n        .. versionadded:: 0.4\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.10\n\n    See Also\n    --------\n    NearMiss : Undersample using near-miss samples.\n\n    Notes\n    -----\n    Supports multi-class resampling by sampling each class independently.\n    Supports heterogeneous data as object array containing string and numeric\n    data.\n\n    Examples\n    --------\n    >>> from collections import Counter\n    >>> from sklearn.datasets import make_classification\n    >>> from imblearn.under_sampling import RandomUnderSampler\n    >>> X, y = make_classification(n_classes=2, class_sep=2,\n    ...  weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,\n    ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)\n    >>> print('Original dataset shape %s' % Counter(y))\n    Original dataset shape Counter({{1: 900, 0: 100}})\n    >>> rus = RandomUnderSampler(random_state=42)\n    >>> X_res, y_res = rus.fit_resample(X, y)\n    >>> print('Resampled dataset shape %s' % Counter(y_res))\n    Resampled dataset shape Counter({{0: 100, 1: 100}})\n    \"\"\"\n\n    _parameter_constraints: dict = {\n        **BaseUnderSampler._parameter_constraints,\n        \"replacement\": [\"boolean\"],\n        \"random_state\": [\"random_state\"],\n    }\n\n    def __init__(\n        self, *, sampling_strategy=\"auto\", random_state=None, replacement=False\n    ):\n        super().__init__(sampling_strategy=sampling_strategy)\n        self.random_state = random_state\n        self.replacement = replacement\n\n    def _check_X_y(self, X, y):\n        y, binarize_y = check_target_type(y, indicate_one_vs_all=True)\n        X = _check_X(X)\n        X, y = validate_data(self, X=X, y=y, reset=True, skip_check_array=True)\n        return X, y, binarize_y\n\n    def _fit_resample(self, X, y):\n        random_state = check_random_state(self.random_state)\n\n        idx_under = np.empty((0,), dtype=int)\n\n        for target_class in np.unique(y):\n            if target_class in self.sampling_strategy_.keys():\n                n_samples = self.sampling_strategy_[target_class]\n                index_target_class = random_state.choice(\n                    range(np.count_nonzero(y == target_class)),\n                    size=n_samples,\n                    replace=self.replacement,\n                )\n            else:\n                index_target_class = slice(None)\n\n            idx_under = np.concatenate(\n                (\n                    idx_under,\n                    np.flatnonzero(y == target_class)[index_target_class],\n                ),\n                axis=0,\n            )\n\n        self.sample_indices_ = idx_under\n\n        return _safe_indexing(X, idx_under), _safe_indexing(y, idx_under)\n\n    def _more_tags(self):\n        return {\n            \"X_types\": [\"2darray\", \"string\", \"sparse\", \"dataframe\"],\n            \"sample_indices\": True,\n            \"allow_nan\": True,\n            \"_xfail_checks\": {\n                \"check_complex_data\": \"Robust to this type of data.\",\n            },\n        }\n\n    def __sklearn_tags__(self):\n        tags = super().__sklearn_tags__()\n        tags.input_tags.allow_nan = True\n        tags.input_tags.string = True\n        tags.sampler_tags.sample_indices = True\n        return tags\n"
  },
  {
    "path": "imblearn/under_sampling/_prototype_selection/_tomek_links.py",
    "content": "\"\"\"Class to perform under-sampling by removing Tomek's links.\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Fernando Nogueira\n#          Christos Aridas\n# License: MIT\n\nimport numbers\n\nimport numpy as np\nfrom sklearn.neighbors import NearestNeighbors\nfrom sklearn.utils import _safe_indexing\n\nfrom imblearn.under_sampling.base import BaseCleaningSampler\nfrom imblearn.utils import Substitution\nfrom imblearn.utils._docstring import _n_jobs_docstring\n\n\n@Substitution(\n    sampling_strategy=BaseCleaningSampler._sampling_strategy_docstring,\n    n_jobs=_n_jobs_docstring,\n)\nclass TomekLinks(BaseCleaningSampler):\n    \"\"\"Under-sampling by removing Tomek's links.\n\n    Read more in the :ref:`User Guide <tomek_links>`.\n\n    Parameters\n    ----------\n    {sampling_strategy}\n\n    {n_jobs}\n\n    Attributes\n    ----------\n    sampling_strategy_ : dict\n        Dictionary containing the information to sample the dataset. The keys\n        corresponds to the class labels from which to sample and the values\n        are the number of samples to sample.\n\n    sample_indices_ : ndarray of shape (n_new_samples,)\n        Indices of the samples selected.\n\n        .. versionadded:: 0.4\n\n    n_features_in_ : int\n        Number of features in the input dataset.\n\n        .. versionadded:: 0.9\n\n    feature_names_in_ : ndarray of shape (`n_features_in_`,)\n        Names of features seen during `fit`. Defined only when `X` has feature\n        names that are all strings.\n\n        .. versionadded:: 0.10\n\n    See Also\n    --------\n    EditedNearestNeighbours : Undersample by samples edition.\n\n    CondensedNearestNeighbour : Undersample by samples condensation.\n\n    RandomUnderSampler : Randomly under-sample the dataset.\n\n    Notes\n    -----\n    This method is based on [1]_.\n\n    Supports multi-class resampling. A one-vs.-rest scheme is used as\n    originally proposed in [1]_.\n\n    References\n    ----------\n    .. [1] I. Tomek, \"Two modifications of CNN,\" In Systems, Man, and\n       Cybernetics, IEEE Transactions on, vol. 6, pp 769-772, 1976.\n\n    Examples\n    --------\n    >>> from collections import Counter\n    >>> from sklearn.datasets import make_classification\n    >>> from imblearn.under_sampling import TomekLinks\n    >>> X, y = make_classification(n_classes=2, class_sep=2,\n    ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,\n    ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)\n    >>> print('Original dataset shape %s' % Counter(y))\n    Original dataset shape Counter({{1: 900, 0: 100}})\n    >>> tl = TomekLinks()\n    >>> X_res, y_res = tl.fit_resample(X, y)\n    >>> print('Resampled dataset shape %s' % Counter(y_res))\n    Resampled dataset shape Counter({{1: 897, 0: 100}})\n    \"\"\"\n\n    _parameter_constraints: dict = {\n        **BaseCleaningSampler._parameter_constraints,\n        \"n_jobs\": [numbers.Integral, None],\n    }\n\n    def __init__(self, *, sampling_strategy=\"auto\", n_jobs=None):\n        super().__init__(sampling_strategy=sampling_strategy)\n        self.n_jobs = n_jobs\n\n    @staticmethod\n    def is_tomek(y, nn_index, class_type):\n        \"\"\"Detect if samples are Tomek's link.\n\n        More precisely, it uses the target vector and the first neighbour of\n        every sample point and looks for Tomek pairs. Returning a boolean\n        vector with True for majority Tomek links.\n\n        Parameters\n        ----------\n        y : ndarray of shape (n_samples,)\n            Target vector of the data set, necessary to keep track of whether a\n            sample belongs to minority or not.\n\n        nn_index : ndarray of shape (len(y),)\n            The index of the closes nearest neighbour to a sample point.\n\n        class_type : int or str\n            The label of the minority class.\n\n        Returns\n        -------\n        is_tomek : ndarray of shape (len(y), )\n            Boolean vector on len( # samples ), with True for majority samples\n            that are Tomek links.\n        \"\"\"\n        links = np.zeros(len(y), dtype=bool)\n\n        # find which class to not consider\n        class_excluded = [c for c in np.unique(y) if c not in class_type]\n\n        # there is a Tomek link between two samples if they are both nearest\n        # neighbors of each others.\n        for index_sample, target_sample in enumerate(y):\n            if target_sample in class_excluded:\n                continue\n\n            if y[nn_index[index_sample]] != target_sample:\n                if nn_index[nn_index[index_sample]] == index_sample:\n                    links[index_sample] = True\n\n        return links\n\n    def _fit_resample(self, X, y):\n        # Find the nearest neighbour of every point\n        nn = NearestNeighbors(n_neighbors=2, n_jobs=self.n_jobs)\n        nn.fit(X)\n        nns = nn.kneighbors(X, return_distance=False)[:, 1]\n\n        links = self.is_tomek(y, nns, self.sampling_strategy_)\n        self.sample_indices_ = np.flatnonzero(np.logical_not(links))\n\n        return (\n            _safe_indexing(X, self.sample_indices_),\n            _safe_indexing(y, self.sample_indices_),\n        )\n\n    def _more_tags(self):\n        return {\"sample_indices\": True}\n\n    def __sklearn_tags__(self):\n        tags = super().__sklearn_tags__()\n        tags.sampler_tags.sample_indices = True\n        return tags\n"
  },
  {
    "path": "imblearn/under_sampling/_prototype_selection/tests/__init__.py",
    "content": ""
  },
  {
    "path": "imblearn/under_sampling/_prototype_selection/tests/test_allknn.py",
    "content": "\"\"\"Test the module repeated edited nearest neighbour.\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport numpy as np\nimport pytest\nfrom sklearn.datasets import make_classification\nfrom sklearn.neighbors import NearestNeighbors\nfrom sklearn.utils._testing import assert_allclose, assert_array_equal\n\nfrom imblearn.under_sampling import AllKNN\n\nX = np.array(\n    [\n        [-0.12840393, 0.66446571],\n        [1.32319756, -0.13181616],\n        [0.04296502, -0.37981873],\n        [0.83631853, 0.18569783],\n        [1.02956816, 0.36061601],\n        [1.12202806, 0.33811558],\n        [-0.53171468, -0.53735182],\n        [1.3381556, 0.35956356],\n        [-0.35946678, 0.72510189],\n        [1.32326943, 0.28393874],\n        [2.94290565, -0.13986434],\n        [0.28294738, -1.00125525],\n        [0.34218094, -0.58781961],\n        [-0.88864036, -0.33782387],\n        [-1.10146139, 0.91782682],\n        [-0.7969716, -0.50493969],\n        [0.73489726, 0.43915195],\n        [0.2096964, -0.61814058],\n        [-0.28479268, 0.70459548],\n        [1.84864913, 0.14729596],\n        [1.59068979, -0.96622933],\n        [0.73418199, -0.02222847],\n        [0.50307437, 0.498805],\n        [0.84929742, 0.41042894],\n        [0.62649535, 0.46600596],\n        [0.79270821, -0.41386668],\n        [1.16606871, -0.25641059],\n        [1.57356906, 0.30390519],\n        [1.0304995, -0.16955962],\n        [1.67314371, 0.19231498],\n        [0.98382284, 0.37184502],\n        [0.48921682, -1.38504507],\n        [-0.46226554, -0.50481004],\n        [-0.03918551, -0.68540745],\n        [0.24991051, -1.00864997],\n        [0.80541964, -0.34465185],\n        [0.1732627, -1.61323172],\n        [0.69804044, 0.44810796],\n        [-0.5506368, -0.42072426],\n        [-0.34474418, 0.21969797],\n    ]\n)\nY = np.array(\n    [\n        1,\n        2,\n        2,\n        2,\n        1,\n        1,\n        0,\n        2,\n        1,\n        1,\n        1,\n        2,\n        2,\n        0,\n        1,\n        2,\n        1,\n        2,\n        1,\n        1,\n        2,\n        2,\n        1,\n        1,\n        1,\n        2,\n        2,\n        2,\n        2,\n        1,\n        1,\n        2,\n        0,\n        2,\n        2,\n        2,\n        2,\n        1,\n        2,\n        0,\n    ]\n)\nR_TOL = 1e-4\n\n\ndef test_allknn_fit_resample():\n    allknn = AllKNN()\n    X_resampled, y_resampled = allknn.fit_resample(X, Y)\n\n    X_gt = np.array(\n        [\n            [-0.53171468, -0.53735182],\n            [-0.88864036, -0.33782387],\n            [-0.46226554, -0.50481004],\n            [-0.34474418, 0.21969797],\n            [1.02956816, 0.36061601],\n            [1.12202806, 0.33811558],\n            [-1.10146139, 0.91782682],\n            [0.73489726, 0.43915195],\n            [0.50307437, 0.498805],\n            [0.84929742, 0.41042894],\n            [0.62649535, 0.46600596],\n            [0.98382284, 0.37184502],\n            [0.69804044, 0.44810796],\n            [0.04296502, -0.37981873],\n            [0.28294738, -1.00125525],\n            [0.34218094, -0.58781961],\n            [0.2096964, -0.61814058],\n            [1.59068979, -0.96622933],\n            [0.73418199, -0.02222847],\n            [0.79270821, -0.41386668],\n            [1.16606871, -0.25641059],\n            [1.0304995, -0.16955962],\n            [0.48921682, -1.38504507],\n            [-0.03918551, -0.68540745],\n            [0.24991051, -1.00864997],\n            [0.80541964, -0.34465185],\n            [0.1732627, -1.61323172],\n        ]\n    )\n    y_gt = np.array(\n        [\n            0,\n            0,\n            0,\n            0,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n        ]\n    )\n    assert_allclose(X_resampled, X_gt, rtol=R_TOL)\n    assert_allclose(y_resampled, y_gt, rtol=R_TOL)\n\n\ndef test_all_knn_allow_minority():\n    X, y = make_classification(\n        n_samples=10000,\n        n_features=2,\n        n_informative=2,\n        n_redundant=0,\n        n_repeated=0,\n        n_classes=3,\n        n_clusters_per_class=1,\n        weights=[0.2, 0.3, 0.5],\n        class_sep=0.4,\n        random_state=0,\n    )\n\n    allknn = AllKNN(allow_minority=True)\n    X_res_1, y_res_1 = allknn.fit_resample(X, y)\n    allknn = AllKNN()\n    X_res_2, y_res_2 = allknn.fit_resample(X, y)\n    assert len(y_res_1) < len(y_res_2)\n\n\ndef test_allknn_fit_resample_mode():\n    allknn = AllKNN(kind_sel=\"mode\")\n    X_resampled, y_resampled = allknn.fit_resample(X, Y)\n\n    X_gt = np.array(\n        [\n            [-0.53171468, -0.53735182],\n            [-0.88864036, -0.33782387],\n            [-0.46226554, -0.50481004],\n            [-0.34474418, 0.21969797],\n            [-0.12840393, 0.66446571],\n            [1.02956816, 0.36061601],\n            [1.12202806, 0.33811558],\n            [-0.35946678, 0.72510189],\n            [-1.10146139, 0.91782682],\n            [0.73489726, 0.43915195],\n            [-0.28479268, 0.70459548],\n            [0.50307437, 0.498805],\n            [0.84929742, 0.41042894],\n            [0.62649535, 0.46600596],\n            [0.98382284, 0.37184502],\n            [0.69804044, 0.44810796],\n            [1.32319756, -0.13181616],\n            [0.04296502, -0.37981873],\n            [0.28294738, -1.00125525],\n            [0.34218094, -0.58781961],\n            [0.2096964, -0.61814058],\n            [1.59068979, -0.96622933],\n            [0.73418199, -0.02222847],\n            [0.79270821, -0.41386668],\n            [1.16606871, -0.25641059],\n            [1.0304995, -0.16955962],\n            [0.48921682, -1.38504507],\n            [-0.03918551, -0.68540745],\n            [0.24991051, -1.00864997],\n            [0.80541964, -0.34465185],\n            [0.1732627, -1.61323172],\n        ]\n    )\n    y_gt = np.array(\n        [\n            0,\n            0,\n            0,\n            0,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n        ]\n    )\n    assert_array_equal(X_resampled, X_gt)\n    assert_array_equal(y_resampled, y_gt)\n\n\ndef test_allknn_fit_resample_with_nn_object():\n    nn = NearestNeighbors(n_neighbors=4)\n    allknn = AllKNN(n_neighbors=nn, kind_sel=\"mode\")\n    X_resampled, y_resampled = allknn.fit_resample(X, Y)\n\n    X_gt = np.array(\n        [\n            [-0.53171468, -0.53735182],\n            [-0.88864036, -0.33782387],\n            [-0.46226554, -0.50481004],\n            [-0.34474418, 0.21969797],\n            [-0.12840393, 0.66446571],\n            [1.02956816, 0.36061601],\n            [1.12202806, 0.33811558],\n            [-0.35946678, 0.72510189],\n            [-1.10146139, 0.91782682],\n            [0.73489726, 0.43915195],\n            [-0.28479268, 0.70459548],\n            [0.50307437, 0.498805],\n            [0.84929742, 0.41042894],\n            [0.62649535, 0.46600596],\n            [0.98382284, 0.37184502],\n            [0.69804044, 0.44810796],\n            [1.32319756, -0.13181616],\n            [0.04296502, -0.37981873],\n            [0.28294738, -1.00125525],\n            [0.34218094, -0.58781961],\n            [0.2096964, -0.61814058],\n            [1.59068979, -0.96622933],\n            [0.73418199, -0.02222847],\n            [0.79270821, -0.41386668],\n            [1.16606871, -0.25641059],\n            [1.0304995, -0.16955962],\n            [0.48921682, -1.38504507],\n            [-0.03918551, -0.68540745],\n            [0.24991051, -1.00864997],\n            [0.80541964, -0.34465185],\n            [0.1732627, -1.61323172],\n        ]\n    )\n    y_gt = np.array(\n        [\n            0,\n            0,\n            0,\n            0,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n        ]\n    )\n    assert_array_equal(X_resampled, X_gt)\n    assert_array_equal(y_resampled, y_gt)\n\n\ndef test_alknn_not_good_object():\n    nn = \"rnd\"\n    allknn = AllKNN(n_neighbors=nn, kind_sel=\"mode\")\n    with pytest.raises(ValueError):\n        allknn.fit_resample(X, Y)\n"
  },
  {
    "path": "imblearn/under_sampling/_prototype_selection/tests/test_condensed_nearest_neighbour.py",
    "content": "\"\"\"Test the module condensed nearest neighbour.\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport numpy as np\nimport pytest\nfrom sklearn.datasets import make_classification\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.utils._testing import assert_array_equal\n\nfrom imblearn.under_sampling import CondensedNearestNeighbour\n\nRND_SEED = 0\nX = np.array(\n    [\n        [2.59928271, 0.93323465],\n        [0.25738379, 0.95564169],\n        [1.42772181, 0.526027],\n        [1.92365863, 0.82718767],\n        [-0.10903849, -0.12085181],\n        [-0.284881, -0.62730973],\n        [0.57062627, 1.19528323],\n        [0.03394306, 0.03986753],\n        [0.78318102, 2.59153329],\n        [0.35831463, 1.33483198],\n        [-0.14313184, -1.0412815],\n        [0.01936241, 0.17799828],\n        [-1.25020462, -0.40402054],\n        [-0.09816301, -0.74662486],\n        [-0.01252787, 0.34102657],\n        [0.52726792, -0.38735648],\n        [0.2821046, -0.07862747],\n        [0.05230552, 0.09043907],\n        [0.15198585, 0.12512646],\n        [0.70524765, 0.39816382],\n    ]\n)\nY = np.array([1, 2, 1, 1, 0, 2, 2, 2, 2, 2, 2, 0, 1, 2, 2, 2, 2, 1, 2, 1])\n\n\ndef test_cnn_init():\n    cnn = CondensedNearestNeighbour(random_state=RND_SEED)\n\n    assert cnn.n_seeds_S == 1\n    assert cnn.n_jobs is None\n\n\ndef test_cnn_fit_resample():\n    cnn = CondensedNearestNeighbour(random_state=RND_SEED)\n    X_resampled, y_resampled = cnn.fit_resample(X, Y)\n\n    X_gt = np.array(\n        [\n            [-0.10903849, -0.12085181],\n            [0.01936241, 0.17799828],\n            [0.05230552, 0.09043907],\n            [-1.25020462, -0.40402054],\n            [0.70524765, 0.39816382],\n            [0.35831463, 1.33483198],\n            [-0.284881, -0.62730973],\n            [0.03394306, 0.03986753],\n            [-0.01252787, 0.34102657],\n            [0.15198585, 0.12512646],\n        ]\n    )\n    y_gt = np.array([0, 0, 1, 1, 1, 2, 2, 2, 2, 2])\n    assert_array_equal(X_resampled, X_gt)\n    assert_array_equal(y_resampled, y_gt)\n\n\n@pytest.mark.parametrize(\"n_neighbors\", [1, KNeighborsClassifier(n_neighbors=1)])\ndef test_cnn_fit_resample_with_object(n_neighbors):\n    cnn = CondensedNearestNeighbour(random_state=RND_SEED, n_neighbors=n_neighbors)\n    X_resampled, y_resampled = cnn.fit_resample(X, Y)\n\n    X_gt = np.array(\n        [\n            [-0.10903849, -0.12085181],\n            [0.01936241, 0.17799828],\n            [0.05230552, 0.09043907],\n            [-1.25020462, -0.40402054],\n            [0.70524765, 0.39816382],\n            [0.35831463, 1.33483198],\n            [-0.284881, -0.62730973],\n            [0.03394306, 0.03986753],\n            [-0.01252787, 0.34102657],\n            [0.15198585, 0.12512646],\n        ]\n    )\n    y_gt = np.array([0, 0, 1, 1, 1, 2, 2, 2, 2, 2])\n    assert_array_equal(X_resampled, X_gt)\n    assert_array_equal(y_resampled, y_gt)\n\n    cnn = CondensedNearestNeighbour(random_state=RND_SEED, n_neighbors=1)\n    X_resampled, y_resampled = cnn.fit_resample(X, Y)\n    assert_array_equal(X_resampled, X_gt)\n    assert_array_equal(y_resampled, y_gt)\n\n\ndef test_condensed_nearest_neighbour_multiclass():\n    \"\"\"Check the validity of the fitted attributes `estimators_`.\"\"\"\n    X, y = make_classification(\n        n_samples=1_000,\n        n_classes=4,\n        weights=[0.1, 0.2, 0.2, 0.5],\n        n_clusters_per_class=1,\n        random_state=0,\n    )\n    cnn = CondensedNearestNeighbour(random_state=RND_SEED)\n    cnn.fit_resample(X, y)\n\n    assert len(cnn.estimators_) == len(cnn.sampling_strategy_)\n    other_classes = []\n    for est in cnn.estimators_:\n        assert est.classes_[0] == 0  # minority class\n        assert est.classes_[1] in {1, 2, 3}  # other classes\n        other_classes.append(est.classes_[1])\n    assert len(set(other_classes)) == len(other_classes)\n"
  },
  {
    "path": "imblearn/under_sampling/_prototype_selection/tests/test_edited_nearest_neighbours.py",
    "content": "\"\"\"Test the module edited nearest neighbour.\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport numpy as np\nfrom sklearn.datasets import make_classification\nfrom sklearn.neighbors import NearestNeighbors\nfrom sklearn.utils._testing import assert_array_equal\n\nfrom imblearn.under_sampling import EditedNearestNeighbours\n\nX = np.array(\n    [\n        [2.59928271, 0.93323465],\n        [0.25738379, 0.95564169],\n        [1.42772181, 0.526027],\n        [1.92365863, 0.82718767],\n        [-0.10903849, -0.12085181],\n        [-0.284881, -0.62730973],\n        [0.57062627, 1.19528323],\n        [0.03394306, 0.03986753],\n        [0.78318102, 2.59153329],\n        [0.35831463, 1.33483198],\n        [-0.14313184, -1.0412815],\n        [0.01936241, 0.17799828],\n        [-1.25020462, -0.40402054],\n        [-0.09816301, -0.74662486],\n        [-0.01252787, 0.34102657],\n        [0.52726792, -0.38735648],\n        [0.2821046, -0.07862747],\n        [0.05230552, 0.09043907],\n        [0.15198585, 0.12512646],\n        [0.70524765, 0.39816382],\n    ]\n)\nY = np.array([1, 2, 1, 1, 0, 2, 2, 2, 2, 2, 2, 0, 1, 2, 2, 2, 2, 1, 2, 1])\n\n\ndef test_enn_init():\n    enn = EditedNearestNeighbours()\n\n    assert enn.n_neighbors == 3\n    assert enn.kind_sel == \"all\"\n    assert enn.n_jobs is None\n\n\ndef test_enn_fit_resample():\n    enn = EditedNearestNeighbours()\n    X_resampled, y_resampled = enn.fit_resample(X, Y)\n\n    X_gt = np.array(\n        [\n            [-0.10903849, -0.12085181],\n            [0.01936241, 0.17799828],\n            [2.59928271, 0.93323465],\n            [1.92365863, 0.82718767],\n            [0.25738379, 0.95564169],\n            [0.78318102, 2.59153329],\n            [0.52726792, -0.38735648],\n        ]\n    )\n    y_gt = np.array([0, 0, 1, 1, 2, 2, 2])\n    assert_array_equal(X_resampled, X_gt)\n    assert_array_equal(y_resampled, y_gt)\n\n\ndef test_enn_fit_resample_mode():\n    enn = EditedNearestNeighbours(kind_sel=\"mode\")\n    X_resampled, y_resampled = enn.fit_resample(X, Y)\n\n    X_gt = np.array(\n        [\n            [-0.10903849, -0.12085181],\n            [0.01936241, 0.17799828],\n            [2.59928271, 0.93323465],\n            [1.42772181, 0.526027],\n            [1.92365863, 0.82718767],\n            [0.25738379, 0.95564169],\n            [-0.284881, -0.62730973],\n            [0.57062627, 1.19528323],\n            [0.78318102, 2.59153329],\n            [0.35831463, 1.33483198],\n            [-0.14313184, -1.0412815],\n            [-0.09816301, -0.74662486],\n            [0.52726792, -0.38735648],\n            [0.2821046, -0.07862747],\n        ]\n    )\n    y_gt = np.array([0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2])\n    assert_array_equal(X_resampled, X_gt)\n    assert_array_equal(y_resampled, y_gt)\n\n\ndef test_enn_fit_resample_with_nn_object():\n    nn = NearestNeighbors(n_neighbors=4)\n    enn = EditedNearestNeighbours(n_neighbors=nn, kind_sel=\"mode\")\n    X_resampled, y_resampled = enn.fit_resample(X, Y)\n\n    X_gt = np.array(\n        [\n            [-0.10903849, -0.12085181],\n            [0.01936241, 0.17799828],\n            [2.59928271, 0.93323465],\n            [1.42772181, 0.526027],\n            [1.92365863, 0.82718767],\n            [0.25738379, 0.95564169],\n            [-0.284881, -0.62730973],\n            [0.57062627, 1.19528323],\n            [0.78318102, 2.59153329],\n            [0.35831463, 1.33483198],\n            [-0.14313184, -1.0412815],\n            [-0.09816301, -0.74662486],\n            [0.52726792, -0.38735648],\n            [0.2821046, -0.07862747],\n        ]\n    )\n    y_gt = np.array([0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2])\n    assert_array_equal(X_resampled, X_gt)\n    assert_array_equal(y_resampled, y_gt)\n\n\ndef test_enn_check_kind_selection():\n    \"\"\"Check that `check_sel=\"all\"` is more conservative than\n    `check_sel=\"mode\"`.\"\"\"\n\n    X, y = make_classification(\n        n_samples=1000,\n        n_classes=2,\n        weights=[0.3, 0.7],\n        random_state=0,\n    )\n\n    enn_all = EditedNearestNeighbours(kind_sel=\"all\")\n    enn_mode = EditedNearestNeighbours(kind_sel=\"mode\")\n\n    enn_all.fit_resample(X, y)\n    enn_mode.fit_resample(X, y)\n\n    assert enn_all.sample_indices_.size < enn_mode.sample_indices_.size\n"
  },
  {
    "path": "imblearn/under_sampling/_prototype_selection/tests/test_instance_hardness_threshold.py",
    "content": "\"\"\"Test the module .\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport numpy as np\nfrom sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier\nfrom sklearn.naive_bayes import GaussianNB as NB\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.utils._testing import assert_array_equal\n\nfrom imblearn.under_sampling import InstanceHardnessThreshold\n\nRND_SEED = 0\nX = np.array(\n    [\n        [-0.3879569, 0.6894251],\n        [-0.09322739, 1.28177189],\n        [-0.77740357, 0.74097941],\n        [0.91542919, -0.65453327],\n        [-0.03852113, 0.40910479],\n        [-0.43877303, 1.07366684],\n        [-0.85795321, 0.82980738],\n        [-0.18430329, 0.52328473],\n        [-0.30126957, -0.66268378],\n        [-0.65571327, 0.42412021],\n        [-0.28305528, 0.30284991],\n        [0.20246714, -0.34727125],\n        [1.06446472, -1.09279772],\n        [0.30543283, -0.02589502],\n        [-0.00717161, 0.00318087],\n    ]\n)\nY = np.array([0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0])\nESTIMATOR = GradientBoostingClassifier(random_state=RND_SEED)\n\n\ndef test_iht_init():\n    sampling_strategy = \"auto\"\n    iht = InstanceHardnessThreshold(\n        estimator=ESTIMATOR,\n        sampling_strategy=sampling_strategy,\n        random_state=RND_SEED,\n    )\n\n    assert iht.sampling_strategy == sampling_strategy\n    assert iht.random_state == RND_SEED\n\n\ndef test_iht_fit_resample():\n    iht = InstanceHardnessThreshold(estimator=ESTIMATOR, random_state=RND_SEED)\n    X_resampled, y_resampled = iht.fit_resample(X, Y)\n    assert X_resampled.shape == (12, 2)\n    assert y_resampled.shape == (12,)\n\n\ndef test_iht_fit_resample_half():\n    sampling_strategy = {0: 3, 1: 3}\n    iht = InstanceHardnessThreshold(\n        estimator=NB(),\n        sampling_strategy=sampling_strategy,\n        random_state=RND_SEED,\n    )\n    X_resampled, y_resampled = iht.fit_resample(X, Y)\n    assert X_resampled.shape == (6, 2)\n    assert y_resampled.shape == (6,)\n\n\ndef test_iht_fit_resample_class_obj():\n    est = GradientBoostingClassifier(random_state=RND_SEED)\n    iht = InstanceHardnessThreshold(estimator=est, random_state=RND_SEED)\n    X_resampled, y_resampled = iht.fit_resample(X, Y)\n    assert X_resampled.shape == (12, 2)\n    assert y_resampled.shape == (12,)\n\n\ndef test_iht_reproducibility():\n    from sklearn.datasets import load_digits\n\n    X_digits, y_digits = load_digits(return_X_y=True)\n    idx_sampled = []\n    for seed in range(5):\n        est = RandomForestClassifier(n_estimators=10, random_state=seed)\n        iht = InstanceHardnessThreshold(estimator=est, random_state=RND_SEED)\n        iht.fit_resample(X_digits, y_digits)\n        idx_sampled.append(iht.sample_indices_.copy())\n    for idx_1, idx_2 in zip(idx_sampled, idx_sampled[1:]):\n        assert_array_equal(idx_1, idx_2)\n\n\ndef test_iht_fit_resample_default_estimator():\n    iht = InstanceHardnessThreshold(estimator=None, random_state=RND_SEED)\n    X_resampled, y_resampled = iht.fit_resample(X, Y)\n    assert isinstance(iht.estimator_, RandomForestClassifier)\n    assert X_resampled.shape == (12, 2)\n    assert y_resampled.shape == (12,)\n\n\ndef test_iht_estimator_pipeline():\n    \"\"\"Check that we can pass a pipeline containing a classifier.\n\n    Checking if we have a classifier should not be based on inheriting from\n    `ClassifierMixin`.\n\n    Non-regression test for:\n    https://github.com/scikit-learn-contrib/imbalanced-learn/pull/1049\n    \"\"\"\n    model = make_pipeline(GradientBoostingClassifier(random_state=RND_SEED))\n    iht = InstanceHardnessThreshold(estimator=model, random_state=RND_SEED)\n    X_resampled, y_resampled = iht.fit_resample(X, Y)\n    assert X_resampled.shape == (12, 2)\n    assert y_resampled.shape == (12,)\n"
  },
  {
    "path": "imblearn/under_sampling/_prototype_selection/tests/test_nearmiss.py",
    "content": "\"\"\"Test the module nearmiss.\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport numpy as np\nfrom sklearn.neighbors import NearestNeighbors\nfrom sklearn.utils._testing import assert_array_equal\n\nfrom imblearn.under_sampling import NearMiss\n\nX = np.array(\n    [\n        [1.17737838, -0.2002118],\n        [0.4960075, 0.86130762],\n        [-0.05903827, 0.10947647],\n        [0.91464286, 1.61369212],\n        [-0.54619583, 1.73009918],\n        [-0.60413357, 0.24628718],\n        [0.45713638, 1.31069295],\n        [-0.04032409, 3.01186964],\n        [0.03142011, 0.12323596],\n        [0.50701028, -0.17636928],\n        [-0.80809175, -1.09917302],\n        [-0.20497017, -0.26630228],\n        [0.99272351, -0.11631728],\n        [-1.95581933, 0.69609604],\n        [1.15157493, -1.2981518],\n    ]\n)\nY = np.array([1, 2, 1, 0, 2, 1, 2, 2, 1, 2, 0, 0, 2, 1, 2])\n\nVERSION_NEARMISS = (1, 2, 3)\n\n\ndef test_nm_fit_resample_auto():\n    sampling_strategy = \"auto\"\n    X_gt = [\n        np.array(\n            [\n                [0.91464286, 1.61369212],\n                [-0.80809175, -1.09917302],\n                [-0.20497017, -0.26630228],\n                [-0.05903827, 0.10947647],\n                [0.03142011, 0.12323596],\n                [-0.60413357, 0.24628718],\n                [0.50701028, -0.17636928],\n                [0.4960075, 0.86130762],\n                [0.45713638, 1.31069295],\n            ]\n        ),\n        np.array(\n            [\n                [0.91464286, 1.61369212],\n                [-0.80809175, -1.09917302],\n                [-0.20497017, -0.26630228],\n                [-0.05903827, 0.10947647],\n                [0.03142011, 0.12323596],\n                [-0.60413357, 0.24628718],\n                [0.50701028, -0.17636928],\n                [0.4960075, 0.86130762],\n                [0.45713638, 1.31069295],\n            ]\n        ),\n        np.array(\n            [\n                [0.91464286, 1.61369212],\n                [-0.80809175, -1.09917302],\n                [-0.20497017, -0.26630228],\n                [1.17737838, -0.2002118],\n                [-0.60413357, 0.24628718],\n                [0.03142011, 0.12323596],\n                [1.15157493, -1.2981518],\n                [-0.54619583, 1.73009918],\n                [0.99272351, -0.11631728],\n            ]\n        ),\n    ]\n    y_gt = [\n        np.array([0, 0, 0, 1, 1, 1, 2, 2, 2]),\n        np.array([0, 0, 0, 1, 1, 1, 2, 2, 2]),\n        np.array([0, 0, 0, 1, 1, 1, 2, 2, 2]),\n    ]\n    for version_idx, version in enumerate(VERSION_NEARMISS):\n        nm = NearMiss(sampling_strategy=sampling_strategy, version=version)\n        X_resampled, y_resampled = nm.fit_resample(X, Y)\n        assert_array_equal(X_resampled, X_gt[version_idx])\n        assert_array_equal(y_resampled, y_gt[version_idx])\n\n\ndef test_nm_fit_resample_float_sampling_strategy():\n    sampling_strategy = {0: 3, 1: 4, 2: 4}\n    X_gt = [\n        np.array(\n            [\n                [-0.20497017, -0.26630228],\n                [-0.80809175, -1.09917302],\n                [0.91464286, 1.61369212],\n                [-0.05903827, 0.10947647],\n                [0.03142011, 0.12323596],\n                [-0.60413357, 0.24628718],\n                [1.17737838, -0.2002118],\n                [0.50701028, -0.17636928],\n                [0.4960075, 0.86130762],\n                [0.45713638, 1.31069295],\n                [0.99272351, -0.11631728],\n            ]\n        ),\n        np.array(\n            [\n                [-0.20497017, -0.26630228],\n                [-0.80809175, -1.09917302],\n                [0.91464286, 1.61369212],\n                [-0.05903827, 0.10947647],\n                [0.03142011, 0.12323596],\n                [-0.60413357, 0.24628718],\n                [1.17737838, -0.2002118],\n                [0.50701028, -0.17636928],\n                [0.4960075, 0.86130762],\n                [0.45713638, 1.31069295],\n                [0.99272351, -0.11631728],\n            ]\n        ),\n        np.array(\n            [\n                [0.91464286, 1.61369212],\n                [-0.80809175, -1.09917302],\n                [-0.20497017, -0.26630228],\n                [1.17737838, -0.2002118],\n                [-0.60413357, 0.24628718],\n                [0.03142011, 0.12323596],\n                [-0.05903827, 0.10947647],\n                [1.15157493, -1.2981518],\n                [-0.54619583, 1.73009918],\n                [0.99272351, -0.11631728],\n                [0.45713638, 1.31069295],\n            ]\n        ),\n    ]\n    y_gt = [\n        np.array([0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2]),\n        np.array([0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2]),\n        np.array([0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2]),\n    ]\n\n    for version_idx, version in enumerate(VERSION_NEARMISS):\n        nm = NearMiss(sampling_strategy=sampling_strategy, version=version)\n        X_resampled, y_resampled = nm.fit_resample(X, Y)\n        assert_array_equal(X_resampled, X_gt[version_idx])\n        assert_array_equal(y_resampled, y_gt[version_idx])\n\n\ndef test_nm_fit_resample_nn_obj():\n    sampling_strategy = \"auto\"\n    nn = NearestNeighbors(n_neighbors=3)\n    X_gt = [\n        np.array(\n            [\n                [0.91464286, 1.61369212],\n                [-0.80809175, -1.09917302],\n                [-0.20497017, -0.26630228],\n                [-0.05903827, 0.10947647],\n                [0.03142011, 0.12323596],\n                [-0.60413357, 0.24628718],\n                [0.50701028, -0.17636928],\n                [0.4960075, 0.86130762],\n                [0.45713638, 1.31069295],\n            ]\n        ),\n        np.array(\n            [\n                [0.91464286, 1.61369212],\n                [-0.80809175, -1.09917302],\n                [-0.20497017, -0.26630228],\n                [-0.05903827, 0.10947647],\n                [0.03142011, 0.12323596],\n                [-0.60413357, 0.24628718],\n                [0.50701028, -0.17636928],\n                [0.4960075, 0.86130762],\n                [0.45713638, 1.31069295],\n            ]\n        ),\n        np.array(\n            [\n                [0.91464286, 1.61369212],\n                [-0.80809175, -1.09917302],\n                [-0.20497017, -0.26630228],\n                [1.17737838, -0.2002118],\n                [-0.60413357, 0.24628718],\n                [0.03142011, 0.12323596],\n                [1.15157493, -1.2981518],\n                [-0.54619583, 1.73009918],\n                [0.99272351, -0.11631728],\n            ]\n        ),\n    ]\n    y_gt = [\n        np.array([0, 0, 0, 1, 1, 1, 2, 2, 2]),\n        np.array([0, 0, 0, 1, 1, 1, 2, 2, 2]),\n        np.array([0, 0, 0, 1, 1, 1, 2, 2, 2]),\n    ]\n    for version_idx, version in enumerate(VERSION_NEARMISS):\n        nm = NearMiss(\n            sampling_strategy=sampling_strategy,\n            version=version,\n            n_neighbors=nn,\n        )\n        X_resampled, y_resampled = nm.fit_resample(X, Y)\n        assert_array_equal(X_resampled, X_gt[version_idx])\n        assert_array_equal(y_resampled, y_gt[version_idx])\n"
  },
  {
    "path": "imblearn/under_sampling/_prototype_selection/tests/test_neighbourhood_cleaning_rule.py",
    "content": "\"\"\"Test the module neighbourhood cleaning rule.\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nfrom collections import Counter\n\nimport numpy as np\nimport pytest\nfrom sklearn.datasets import make_classification\nfrom sklearn.utils._testing import assert_array_equal\n\nfrom imblearn.under_sampling import EditedNearestNeighbours, NeighbourhoodCleaningRule\n\n\n@pytest.fixture(scope=\"module\")\ndef data():\n    return make_classification(\n        n_samples=200,\n        n_features=2,\n        n_informative=2,\n        n_redundant=0,\n        n_repeated=0,\n        n_clusters_per_class=1,\n        n_classes=3,\n        weights=[0.1, 0.3, 0.6],\n        random_state=0,\n    )\n\n\ndef test_ncr_threshold_cleaning(data):\n    \"\"\"Test the effect of the `threshold_cleaning` parameter.\"\"\"\n    X, y = data\n    # with a large `threshold_cleaning`, the algorithm is equivalent to ENN\n    enn = EditedNearestNeighbours()\n    ncr = NeighbourhoodCleaningRule(\n        edited_nearest_neighbours=enn, n_neighbors=10, threshold_cleaning=10\n    )\n\n    enn.fit_resample(X, y)\n    ncr.fit_resample(X, y)\n\n    assert_array_equal(np.sort(enn.sample_indices_), np.sort(ncr.sample_indices_))\n    assert ncr.classes_to_clean_ == []\n\n    # set a threshold that we should consider only the class #2\n    counter = Counter(y)\n    threshold = counter[1] / counter[0]\n    ncr.set_params(threshold_cleaning=threshold)\n    ncr.fit_resample(X, y)\n\n    assert set(ncr.classes_to_clean_) == {2}\n\n    # making the threshold slightly smaller to take into account class #1\n    ncr.set_params(threshold_cleaning=threshold - np.finfo(np.float32).eps)\n    ncr.fit_resample(X, y)\n\n    assert set(ncr.classes_to_clean_) == {1, 2}\n\n\ndef test_ncr_n_neighbors(data):\n    \"\"\"Check the effect of the NN on the cleaning of the second phase.\"\"\"\n    X, y = data\n\n    enn = EditedNearestNeighbours()\n    ncr = NeighbourhoodCleaningRule(edited_nearest_neighbours=enn, n_neighbors=3)\n\n    ncr.fit_resample(X, y)\n    sample_indices_3_nn = ncr.sample_indices_\n\n    ncr.set_params(n_neighbors=10).fit_resample(X, y)\n    sample_indices_10_nn = ncr.sample_indices_\n\n    # we should have a more aggressive cleaning with n_neighbors is larger\n    assert len(sample_indices_3_nn) > len(sample_indices_10_nn)\n"
  },
  {
    "path": "imblearn/under_sampling/_prototype_selection/tests/test_one_sided_selection.py",
    "content": "\"\"\"Test the module one-sided selection.\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport numpy as np\nimport pytest\nfrom sklearn.datasets import make_classification\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.utils._testing import assert_array_equal\n\nfrom imblearn.under_sampling import OneSidedSelection\n\nRND_SEED = 0\nX = np.array(\n    [\n        [-0.3879569, 0.6894251],\n        [-0.09322739, 1.28177189],\n        [-0.77740357, 0.74097941],\n        [0.91542919, -0.65453327],\n        [-0.03852113, 0.40910479],\n        [-0.43877303, 1.07366684],\n        [-0.85795321, 0.82980738],\n        [-0.18430329, 0.52328473],\n        [-0.30126957, -0.66268378],\n        [-0.65571327, 0.42412021],\n        [-0.28305528, 0.30284991],\n        [0.20246714, -0.34727125],\n        [1.06446472, -1.09279772],\n        [0.30543283, -0.02589502],\n        [-0.00717161, 0.00318087],\n    ]\n)\nY = np.array([0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0])\n\n\ndef test_oss_init():\n    oss = OneSidedSelection(random_state=RND_SEED)\n\n    assert oss.n_seeds_S == 1\n    assert oss.n_jobs is None\n    assert oss.random_state == RND_SEED\n\n\ndef test_oss_fit_resample():\n    oss = OneSidedSelection(random_state=RND_SEED)\n    X_resampled, y_resampled = oss.fit_resample(X, Y)\n\n    X_gt = np.array(\n        [\n            [-0.3879569, 0.6894251],\n            [0.91542919, -0.65453327],\n            [-0.65571327, 0.42412021],\n            [1.06446472, -1.09279772],\n            [0.30543283, -0.02589502],\n            [-0.00717161, 0.00318087],\n            [-0.09322739, 1.28177189],\n            [-0.77740357, 0.74097941],\n            [-0.43877303, 1.07366684],\n            [-0.85795321, 0.82980738],\n            [-0.30126957, -0.66268378],\n            [0.20246714, -0.34727125],\n        ]\n    )\n    y_gt = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])\n    assert_array_equal(X_resampled, X_gt)\n    assert_array_equal(y_resampled, y_gt)\n\n\n@pytest.mark.parametrize(\"n_neighbors\", [1, KNeighborsClassifier(n_neighbors=1)])\ndef test_oss_with_object(n_neighbors):\n    oss = OneSidedSelection(random_state=RND_SEED, n_neighbors=n_neighbors)\n    X_resampled, y_resampled = oss.fit_resample(X, Y)\n\n    X_gt = np.array(\n        [\n            [-0.3879569, 0.6894251],\n            [0.91542919, -0.65453327],\n            [-0.65571327, 0.42412021],\n            [1.06446472, -1.09279772],\n            [0.30543283, -0.02589502],\n            [-0.00717161, 0.00318087],\n            [-0.09322739, 1.28177189],\n            [-0.77740357, 0.74097941],\n            [-0.43877303, 1.07366684],\n            [-0.85795321, 0.82980738],\n            [-0.30126957, -0.66268378],\n            [0.20246714, -0.34727125],\n        ]\n    )\n    y_gt = np.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])\n    assert_array_equal(X_resampled, X_gt)\n    assert_array_equal(y_resampled, y_gt)\n    knn = 1\n    oss = OneSidedSelection(random_state=RND_SEED, n_neighbors=knn)\n    X_resampled, y_resampled = oss.fit_resample(X, Y)\n    assert_array_equal(X_resampled, X_gt)\n    assert_array_equal(y_resampled, y_gt)\n\n\ndef test_one_sided_selection_multiclass():\n    \"\"\"Check the validity of the fitted attributes `estimators_`.\"\"\"\n    X, y = make_classification(\n        n_samples=1_000,\n        n_classes=4,\n        weights=[0.1, 0.2, 0.2, 0.5],\n        n_clusters_per_class=1,\n        random_state=0,\n    )\n    oss = OneSidedSelection(random_state=RND_SEED)\n    oss.fit_resample(X, y)\n\n    assert len(oss.estimators_) == len(oss.sampling_strategy_)\n    other_classes = []\n    for est in oss.estimators_:\n        assert est.classes_[0] == 0  # minority class\n        assert est.classes_[1] in {1, 2, 3}  # other classes\n        other_classes.append(est.classes_[1])\n    assert len(set(other_classes)) == len(other_classes)\n"
  },
  {
    "path": "imblearn/under_sampling/_prototype_selection/tests/test_random_under_sampler.py",
    "content": "\"\"\"Test the module random under sampler.\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nfrom collections import Counter\nfrom datetime import datetime\n\nimport numpy as np\nimport pytest\nfrom sklearn.datasets import make_classification\nfrom sklearn.utils._testing import assert_array_equal\n\nfrom imblearn.under_sampling import RandomUnderSampler\n\nRND_SEED = 0\nX = np.array(\n    [\n        [0.04352327, -0.20515826],\n        [0.92923648, 0.76103773],\n        [0.20792588, 1.49407907],\n        [0.47104475, 0.44386323],\n        [0.22950086, 0.33367433],\n        [0.15490546, 0.3130677],\n        [0.09125309, -0.85409574],\n        [0.12372842, 0.6536186],\n        [0.13347175, 0.12167502],\n        [0.094035, -2.55298982],\n    ]\n)\nY = np.array([1, 0, 1, 0, 1, 1, 1, 1, 0, 1])\n\n\n@pytest.mark.parametrize(\"as_frame\", [True, False], ids=[\"dataframe\", \"array\"])\ndef test_rus_fit_resample(as_frame):\n    if as_frame:\n        pd = pytest.importorskip(\"pandas\")\n        X_ = pd.DataFrame(X)\n    else:\n        X_ = X\n    rus = RandomUnderSampler(random_state=RND_SEED, replacement=True)\n    X_resampled, y_resampled = rus.fit_resample(X_, Y)\n\n    X_gt = np.array(\n        [\n            [0.92923648, 0.76103773],\n            [0.47104475, 0.44386323],\n            [0.13347175, 0.12167502],\n            [0.09125309, -0.85409574],\n            [0.12372842, 0.6536186],\n            [0.04352327, -0.20515826],\n        ]\n    )\n    y_gt = np.array([0, 0, 0, 1, 1, 1])\n\n    if as_frame:\n        assert hasattr(X_resampled, \"loc\")\n        X_resampled = X_resampled.to_numpy()\n\n    assert_array_equal(X_resampled, X_gt)\n    assert_array_equal(y_resampled, y_gt)\n\n\ndef test_rus_fit_resample_half():\n    sampling_strategy = {0: 3, 1: 6}\n    rus = RandomUnderSampler(\n        sampling_strategy=sampling_strategy,\n        random_state=RND_SEED,\n        replacement=True,\n    )\n    X_resampled, y_resampled = rus.fit_resample(X, Y)\n\n    X_gt = np.array(\n        [\n            [0.92923648, 0.76103773],\n            [0.47104475, 0.44386323],\n            [0.92923648, 0.76103773],\n            [0.15490546, 0.3130677],\n            [0.15490546, 0.3130677],\n            [0.15490546, 0.3130677],\n            [0.20792588, 1.49407907],\n            [0.15490546, 0.3130677],\n            [0.12372842, 0.6536186],\n        ]\n    )\n    y_gt = np.array([0, 0, 0, 1, 1, 1, 1, 1, 1])\n    assert_array_equal(X_resampled, X_gt)\n    assert_array_equal(y_resampled, y_gt)\n\n\ndef test_multiclass_fit_resample():\n    y = Y.copy()\n    y[5] = 2\n    y[6] = 2\n    rus = RandomUnderSampler(random_state=RND_SEED)\n    X_resampled, y_resampled = rus.fit_resample(X, y)\n    count_y_res = Counter(y_resampled)\n    assert count_y_res[0] == 2\n    assert count_y_res[1] == 2\n    assert count_y_res[2] == 2\n\n\ndef test_random_under_sampling_heterogeneous_data():\n    X_hetero = np.array(\n        [[\"xxx\", 1, 1.0], [\"yyy\", 2, 2.0], [\"zzz\", 3, 3.0]], dtype=object\n    )\n    y = np.array([0, 0, 1])\n    rus = RandomUnderSampler(random_state=RND_SEED)\n    X_res, y_res = rus.fit_resample(X_hetero, y)\n\n    assert X_res.shape[0] == 2\n    assert y_res.shape[0] == 2\n    assert X_res.dtype == object\n\n\ndef test_random_under_sampling_nan_inf():\n    # check that we can undersample even with missing or infinite data\n    # regression tests for #605\n    rng = np.random.RandomState(42)\n    n_not_finite = X.shape[0] // 3\n    row_indices = rng.choice(np.arange(X.shape[0]), size=n_not_finite)\n    col_indices = rng.randint(0, X.shape[1], size=n_not_finite)\n    not_finite_values = rng.choice([np.nan, np.inf], size=n_not_finite)\n\n    X_ = X.copy()\n    X_[row_indices, col_indices] = not_finite_values\n\n    rus = RandomUnderSampler(random_state=0)\n    X_res, y_res = rus.fit_resample(X_, Y)\n\n    assert y_res.shape == (6,)\n    assert X_res.shape == (6, 2)\n    assert np.any(~np.isfinite(X_res))\n\n\n@pytest.mark.parametrize(\n    \"sampling_strategy\", [\"auto\", \"majority\", \"not minority\", \"not majority\", \"all\"]\n)\ndef test_random_under_sampler_strings(sampling_strategy):\n    \"\"\"Check that we support all supposed strings as `sampling_strategy` in\n    a sampler inheriting from `BaseUnderSampler`.\"\"\"\n\n    X, y = make_classification(\n        n_samples=100,\n        n_clusters_per_class=1,\n        n_classes=3,\n        weights=[0.1, 0.3, 0.6],\n        random_state=0,\n    )\n    RandomUnderSampler(sampling_strategy=sampling_strategy).fit_resample(X, y)\n\n\ndef test_random_under_sampling_datetime():\n    \"\"\"Check that we don't convert input data and only sample from it.\"\"\"\n    pd = pytest.importorskip(\"pandas\")\n    X = pd.DataFrame({\"label\": [0, 0, 0, 1], \"td\": [datetime.now()] * 4})\n    y = X[\"label\"]\n    rus = RandomUnderSampler(random_state=0)\n    X_res, y_res = rus.fit_resample(X, y)\n\n    pd.testing.assert_series_equal(X_res.dtypes, X.dtypes)\n    pd.testing.assert_index_equal(X_res.index, y_res.index)\n    assert_array_equal(y_res.to_numpy(), np.array([0, 1]))\n\n\ndef test_random_under_sampler_full_nat():\n    \"\"\"Check that we can return timedelta columns full of NaT.\n\n    Non-regression test for:\n    https://github.com/scikit-learn-contrib/imbalanced-learn/issues/1055\n    \"\"\"\n    pd = pytest.importorskip(\"pandas\")\n\n    X = pd.DataFrame(\n        {\n            \"col_str\": [\"abc\", \"def\", \"xyz\"],\n            \"col_timedelta\": pd.to_timedelta([np.nan, np.nan, np.nan]),\n        }\n    )\n    y = np.array([0, 0, 1])\n\n    X_res, y_res = RandomUnderSampler().fit_resample(X, y)\n    assert X_res.shape == (2, 2)\n    assert y_res.shape == (2,)\n\n    assert X_res[\"col_timedelta\"].dtype == \"timedelta64[ns]\"\n"
  },
  {
    "path": "imblearn/under_sampling/_prototype_selection/tests/test_repeated_edited_nearest_neighbours.py",
    "content": "\"\"\"Test the module repeated edited nearest neighbour.\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport numpy as np\nimport pytest\nfrom sklearn.neighbors import NearestNeighbors\nfrom sklearn.utils._testing import assert_array_equal\n\nfrom imblearn.under_sampling import RepeatedEditedNearestNeighbours\n\nX = np.array(\n    [\n        [-0.12840393, 0.66446571],\n        [1.32319756, -0.13181616],\n        [0.04296502, -0.37981873],\n        [0.83631853, 0.18569783],\n        [1.02956816, 0.36061601],\n        [1.12202806, 0.33811558],\n        [-0.53171468, -0.53735182],\n        [1.3381556, 0.35956356],\n        [-0.35946678, 0.72510189],\n        [1.32326943, 0.28393874],\n        [2.94290565, -0.13986434],\n        [0.28294738, -1.00125525],\n        [0.34218094, -0.58781961],\n        [-0.88864036, -0.33782387],\n        [-1.10146139, 0.91782682],\n        [-0.7969716, -0.50493969],\n        [0.73489726, 0.43915195],\n        [0.2096964, -0.61814058],\n        [-0.28479268, 0.70459548],\n        [1.84864913, 0.14729596],\n        [1.59068979, -0.96622933],\n        [0.73418199, -0.02222847],\n        [0.50307437, 0.498805],\n        [0.84929742, 0.41042894],\n        [0.62649535, 0.46600596],\n        [0.79270821, -0.41386668],\n        [1.16606871, -0.25641059],\n        [1.57356906, 0.30390519],\n        [1.0304995, -0.16955962],\n        [1.67314371, 0.19231498],\n        [0.98382284, 0.37184502],\n        [0.48921682, -1.38504507],\n        [-0.46226554, -0.50481004],\n        [-0.03918551, -0.68540745],\n        [0.24991051, -1.00864997],\n        [0.80541964, -0.34465185],\n        [0.1732627, -1.61323172],\n        [0.69804044, 0.44810796],\n        [-0.5506368, -0.42072426],\n        [-0.34474418, 0.21969797],\n    ]\n)\nY = np.array(\n    [\n        1,\n        2,\n        2,\n        2,\n        1,\n        1,\n        0,\n        2,\n        1,\n        1,\n        1,\n        2,\n        2,\n        0,\n        1,\n        2,\n        1,\n        2,\n        1,\n        1,\n        2,\n        2,\n        1,\n        1,\n        1,\n        2,\n        2,\n        2,\n        2,\n        1,\n        1,\n        2,\n        0,\n        2,\n        2,\n        2,\n        2,\n        1,\n        2,\n        0,\n    ]\n)\n\n\ndef test_renn_init():\n    renn = RepeatedEditedNearestNeighbours()\n\n    assert renn.n_neighbors == 3\n    assert renn.kind_sel == \"all\"\n    assert renn.n_jobs is None\n\n\ndef test_renn_iter_wrong():\n    max_iter = -1\n    renn = RepeatedEditedNearestNeighbours(max_iter=max_iter)\n    with pytest.raises(ValueError):\n        renn.fit_resample(X, Y)\n\n\ndef test_renn_fit_resample():\n    renn = RepeatedEditedNearestNeighbours()\n    X_resampled, y_resampled = renn.fit_resample(X, Y)\n\n    X_gt = np.array(\n        [\n            [-0.53171468, -0.53735182],\n            [-0.88864036, -0.33782387],\n            [-0.46226554, -0.50481004],\n            [-0.34474418, 0.21969797],\n            [1.02956816, 0.36061601],\n            [1.12202806, 0.33811558],\n            [0.73489726, 0.43915195],\n            [0.50307437, 0.498805],\n            [0.84929742, 0.41042894],\n            [0.62649535, 0.46600596],\n            [0.98382284, 0.37184502],\n            [0.69804044, 0.44810796],\n            [0.04296502, -0.37981873],\n            [0.28294738, -1.00125525],\n            [0.34218094, -0.58781961],\n            [0.2096964, -0.61814058],\n            [1.59068979, -0.96622933],\n            [0.73418199, -0.02222847],\n            [0.79270821, -0.41386668],\n            [1.16606871, -0.25641059],\n            [1.0304995, -0.16955962],\n            [0.48921682, -1.38504507],\n            [-0.03918551, -0.68540745],\n            [0.24991051, -1.00864997],\n            [0.80541964, -0.34465185],\n            [0.1732627, -1.61323172],\n        ]\n    )\n    y_gt = np.array(\n        [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]\n    )\n    assert_array_equal(X_resampled, X_gt)\n    assert_array_equal(y_resampled, y_gt)\n    assert 0 < renn.n_iter_ <= renn.max_iter\n\n\ndef test_renn_fit_resample_mode_object():\n    renn = RepeatedEditedNearestNeighbours(kind_sel=\"mode\")\n    X_resampled, y_resampled = renn.fit_resample(X, Y)\n\n    X_gt = np.array(\n        [\n            [-0.53171468, -0.53735182],\n            [-0.88864036, -0.33782387],\n            [-0.46226554, -0.50481004],\n            [-0.34474418, 0.21969797],\n            [-0.12840393, 0.66446571],\n            [1.02956816, 0.36061601],\n            [1.12202806, 0.33811558],\n            [-0.35946678, 0.72510189],\n            [2.94290565, -0.13986434],\n            [-1.10146139, 0.91782682],\n            [0.73489726, 0.43915195],\n            [-0.28479268, 0.70459548],\n            [1.84864913, 0.14729596],\n            [0.50307437, 0.498805],\n            [0.84929742, 0.41042894],\n            [0.62649535, 0.46600596],\n            [1.67314371, 0.19231498],\n            [0.98382284, 0.37184502],\n            [0.69804044, 0.44810796],\n            [1.32319756, -0.13181616],\n            [0.04296502, -0.37981873],\n            [0.28294738, -1.00125525],\n            [0.34218094, -0.58781961],\n            [0.2096964, -0.61814058],\n            [1.59068979, -0.96622933],\n            [0.73418199, -0.02222847],\n            [0.79270821, -0.41386668],\n            [1.16606871, -0.25641059],\n            [1.0304995, -0.16955962],\n            [0.48921682, -1.38504507],\n            [-0.03918551, -0.68540745],\n            [0.24991051, -1.00864997],\n            [0.80541964, -0.34465185],\n            [0.1732627, -1.61323172],\n        ]\n    )\n    y_gt = np.array(\n        [\n            0,\n            0,\n            0,\n            0,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n        ]\n    )\n    assert_array_equal(X_resampled, X_gt)\n    assert_array_equal(y_resampled, y_gt)\n    assert 0 < renn.n_iter_ <= renn.max_iter\n\n\ndef test_renn_fit_resample_mode():\n    nn = NearestNeighbors(n_neighbors=4)\n    renn = RepeatedEditedNearestNeighbours(n_neighbors=nn, kind_sel=\"mode\")\n    X_resampled, y_resampled = renn.fit_resample(X, Y)\n\n    X_gt = np.array(\n        [\n            [-0.53171468, -0.53735182],\n            [-0.88864036, -0.33782387],\n            [-0.46226554, -0.50481004],\n            [-0.34474418, 0.21969797],\n            [-0.12840393, 0.66446571],\n            [1.02956816, 0.36061601],\n            [1.12202806, 0.33811558],\n            [-0.35946678, 0.72510189],\n            [2.94290565, -0.13986434],\n            [-1.10146139, 0.91782682],\n            [0.73489726, 0.43915195],\n            [-0.28479268, 0.70459548],\n            [1.84864913, 0.14729596],\n            [0.50307437, 0.498805],\n            [0.84929742, 0.41042894],\n            [0.62649535, 0.46600596],\n            [1.67314371, 0.19231498],\n            [0.98382284, 0.37184502],\n            [0.69804044, 0.44810796],\n            [1.32319756, -0.13181616],\n            [0.04296502, -0.37981873],\n            [0.28294738, -1.00125525],\n            [0.34218094, -0.58781961],\n            [0.2096964, -0.61814058],\n            [1.59068979, -0.96622933],\n            [0.73418199, -0.02222847],\n            [0.79270821, -0.41386668],\n            [1.16606871, -0.25641059],\n            [1.0304995, -0.16955962],\n            [0.48921682, -1.38504507],\n            [-0.03918551, -0.68540745],\n            [0.24991051, -1.00864997],\n            [0.80541964, -0.34465185],\n            [0.1732627, -1.61323172],\n        ]\n    )\n    y_gt = np.array(\n        [\n            0,\n            0,\n            0,\n            0,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            1,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n            2,\n        ]\n    )\n    assert_array_equal(X_resampled, X_gt)\n    assert_array_equal(y_resampled, y_gt)\n    assert 0 < renn.n_iter_ <= renn.max_iter\n\n\n@pytest.mark.parametrize(\n    \"max_iter, n_iter\",\n    [(2, 2), (5, 3)],\n)\ndef test_renn_iter_attribute(max_iter, n_iter):\n    renn = RepeatedEditedNearestNeighbours(max_iter=max_iter)\n    renn.fit_resample(X, Y)\n    assert renn.n_iter_ == n_iter\n"
  },
  {
    "path": "imblearn/under_sampling/_prototype_selection/tests/test_tomek_links.py",
    "content": "\"\"\"Test the module Tomek's links.\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport numpy as np\nimport pytest\nfrom sklearn.datasets import make_classification\nfrom sklearn.utils._testing import assert_array_equal\n\nfrom imblearn.under_sampling import TomekLinks\n\nX = np.array(\n    [\n        [0.31230513, 0.1216318],\n        [0.68481731, 0.51935141],\n        [1.34192108, -0.13367336],\n        [0.62366841, -0.21312976],\n        [1.61091956, -0.40283504],\n        [-0.37162401, -2.19400981],\n        [0.74680821, 1.63827342],\n        [0.2184254, 0.24299982],\n        [0.61472253, -0.82309052],\n        [0.19893132, -0.47761769],\n        [1.06514042, -0.0770537],\n        [0.97407872, 0.44454207],\n        [1.40301027, -0.83648734],\n        [-1.20515198, -1.02689695],\n        [-0.27410027, -0.54194484],\n        [0.8381014, 0.44085498],\n        [-0.23374509, 0.18370049],\n        [-0.32635887, -0.29299653],\n        [-0.00288378, 0.84259929],\n        [1.79580611, -0.02219234],\n    ]\n)\nY = np.array([1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0])\n\n\ndef test_tl_init():\n    tl = TomekLinks()\n    assert tl.n_jobs is None\n\n\ndef test_tl_fit_resample():\n    tl = TomekLinks()\n    X_resampled, y_resampled = tl.fit_resample(X, Y)\n\n    X_gt = np.array(\n        [\n            [0.31230513, 0.1216318],\n            [0.68481731, 0.51935141],\n            [1.34192108, -0.13367336],\n            [0.62366841, -0.21312976],\n            [1.61091956, -0.40283504],\n            [-0.37162401, -2.19400981],\n            [0.74680821, 1.63827342],\n            [0.2184254, 0.24299982],\n            [0.61472253, -0.82309052],\n            [0.19893132, -0.47761769],\n            [0.97407872, 0.44454207],\n            [1.40301027, -0.83648734],\n            [-1.20515198, -1.02689695],\n            [-0.23374509, 0.18370049],\n            [-0.32635887, -0.29299653],\n            [-0.00288378, 0.84259929],\n            [1.79580611, -0.02219234],\n        ]\n    )\n    y_gt = np.array([1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0])\n    assert_array_equal(X_resampled, X_gt)\n    assert_array_equal(y_resampled, y_gt)\n\n\n@pytest.mark.parametrize(\n    \"sampling_strategy\", [\"auto\", \"majority\", \"not minority\", \"not majority\", \"all\"]\n)\ndef test_tomek_links_strings(sampling_strategy):\n    \"\"\"Check that we support all supposed strings as `sampling_strategy` in\n    a sampler inheriting from `BaseCleaningSampler`.\"\"\"\n\n    X, y = make_classification(\n        n_samples=100,\n        n_clusters_per_class=1,\n        n_classes=3,\n        weights=[0.1, 0.3, 0.6],\n        random_state=0,\n    )\n    TomekLinks(sampling_strategy=sampling_strategy).fit_resample(X, y)\n"
  },
  {
    "path": "imblearn/under_sampling/base.py",
    "content": "\"\"\"\nBase class for the under-sampling method.\n\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\nimport numbers\nfrom collections.abc import Mapping\n\nfrom sklearn.utils._param_validation import Interval, StrOptions\n\nfrom imblearn.base import BaseSampler\n\n\nclass BaseUnderSampler(BaseSampler):\n    \"\"\"Base class for under-sampling algorithms.\n\n    Warning: This class should not be used directly. Use the derive classes\n    instead.\n    \"\"\"\n\n    _sampling_type = \"under-sampling\"\n\n    _sampling_strategy_docstring = (\n        \"\"\"sampling_strategy : float, str, dict, callable, default='auto'\n        Sampling information to sample the data set.\n\n        - When ``float``, it corresponds to the desired ratio of the number of\n          samples in the minority class over the number of samples in the\n          majority class after resampling. Therefore, the ratio is expressed as\n          :math:`\\\\alpha_{us} = N_{m} / N_{rM}` where :math:`N_{m}` is the\n          number of samples in the minority class and\n          :math:`N_{rM}` is the number of samples in the majority class\n          after resampling.\n\n          .. warning::\n             ``float`` is only available for **binary** classification. An\n             error is raised for multi-class classification.\n\n        - When ``str``, specify the class targeted by the resampling. The\n          number of samples in the different classes will be equalized.\n          Possible choices are:\n\n            ``'majority'``: resample only the majority class;\n\n            ``'not minority'``: resample all classes but the minority class;\n\n            ``'not majority'``: resample all classes but the majority class;\n\n            ``'all'``: resample all classes;\n\n            ``'auto'``: equivalent to ``'not minority'``.\n\n        - When ``dict``, the keys correspond to the targeted classes. The\n          values correspond to the desired number of samples for each targeted\n          class.\n\n        - When callable, function taking ``y`` and returns a ``dict``. The keys\n          correspond to the targeted classes. The values correspond to the\n          desired number of samples for each class.\n        \"\"\".rstrip()\n    )  # noqa: E501\n\n    _parameter_constraints: dict = {\n        \"sampling_strategy\": [\n            Interval(numbers.Real, 0, 1, closed=\"right\"),\n            StrOptions({\"auto\", \"majority\", \"not minority\", \"not majority\", \"all\"}),\n            Mapping,\n            callable,\n        ],\n    }\n\n\nclass BaseCleaningSampler(BaseSampler):\n    \"\"\"Base class for under-sampling algorithms.\n\n    Warning: This class should not be used directly. Use the derive classes\n    instead.\n    \"\"\"\n\n    _sampling_type = \"clean-sampling\"\n\n    _sampling_strategy_docstring = \"\"\"sampling_strategy : str, list or callable\n        Sampling information to sample the data set.\n\n        - When ``str``, specify the class targeted by the resampling. Note the\n          the number of samples will not be equal in each. Possible choices\n          are:\n\n            ``'majority'``: resample only the majority class;\n\n            ``'not minority'``: resample all classes but the minority class;\n\n            ``'not majority'``: resample all classes but the majority class;\n\n            ``'all'``: resample all classes;\n\n            ``'auto'``: equivalent to ``'not minority'``.\n\n        - When ``list``, the list contains the classes targeted by the\n          resampling.\n\n        - When callable, function taking ``y`` and returns a ``dict``. The keys\n          correspond to the targeted classes. The values correspond to the\n          desired number of samples for each class.\n        \"\"\".rstrip()\n\n    _parameter_constraints: dict = {\n        \"sampling_strategy\": [\n            Interval(numbers.Real, 0, 1, closed=\"right\"),\n            StrOptions({\"auto\", \"majority\", \"not minority\", \"not majority\", \"all\"}),\n            list,\n            callable,\n        ],\n    }\n"
  },
  {
    "path": "imblearn/utils/__init__.py",
    "content": "\"\"\"\nThe :mod:`imblearn.utils` module includes various utilities.\n\"\"\"\n\nfrom imblearn.utils._docstring import Substitution\nfrom imblearn.utils._validation import (\n    check_neighbors_object,\n    check_sampling_strategy,\n    check_target_type,\n)\n\n__all__ = [\n    \"check_neighbors_object\",\n    \"check_sampling_strategy\",\n    \"check_target_type\",\n    \"Substitution\",\n]\n"
  },
  {
    "path": "imblearn/utils/_docstring.py",
    "content": "\"\"\"Utilities for docstring in imbalanced-learn.\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\n\nclass Substitution:\n    \"\"\"Decorate a function's or a class' docstring to perform string\n    substitution on it.\n\n    This decorator should be robust even if obj.__doc__ is None\n    (for example, if -OO was passed to the interpreter)\n    \"\"\"\n\n    def __init__(self, *args, **kwargs):\n        if args and kwargs:\n            raise AssertionError(\"Only positional or keyword args are allowed\")\n\n        self.params = args or kwargs\n\n    def __call__(self, obj):\n        if obj.__doc__:\n            obj.__doc__ = obj.__doc__.format(**self.params)\n        return obj\n\n\n_random_state_docstring = \"\"\"random_state : int, RandomState instance, default=None\n        Control the randomization of the algorithm.\n\n        - If int, ``random_state`` is the seed used by the random number\n          generator;\n        - If ``RandomState`` instance, random_state is the random number\n          generator;\n        - If ``None``, the random number generator is the ``RandomState``\n          instance used by ``np.random``.\n    \"\"\".rstrip()\n\n_n_jobs_docstring = \"\"\"n_jobs : int, default=None\n        Number of CPU cores used during the cross-validation loop.\n        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.\n        ``-1`` means using all processors. See\n        `Glossary <https://scikit-learn.org/stable/glossary.html#term-n-jobs>`_\n        for more details.\n    \"\"\".rstrip()\n"
  },
  {
    "path": "imblearn/utils/_show_versions.py",
    "content": "\"\"\"\nUtility method which prints system info to help with debugging,\nand filing issues on GitHub.\nAdapted from :func:`sklearn.show_versions`,\nwhich was adapted from :func:`pandas.show_versions`\n\"\"\"\n\n# Author: Alexander L. Hayes <hayesall@iu.edu>\n# License: MIT\n\nfrom imblearn import __version__\n\n\ndef _get_deps_info():\n    \"\"\"Overview of the installed version of main dependencies\n    Returns\n    -------\n    deps_info: dict\n        version information on relevant Python libraries\n    \"\"\"\n    deps = [\n        \"imbalanced-learn\",\n        \"pip\",\n        \"setuptools\",\n        \"numpy\",\n        \"scipy\",\n        \"scikit-learn\",\n        \"Cython\",\n        \"pandas\",\n        \"keras\",\n        \"tensorflow\",\n        \"joblib\",\n    ]\n\n    deps_info = {\n        \"imbalanced-learn\": __version__,\n    }\n\n    from importlib.metadata import PackageNotFoundError, version\n\n    for modname in deps:\n        try:\n            deps_info[modname] = version(modname)\n        except PackageNotFoundError:\n            deps_info[modname] = None\n    return deps_info\n\n\ndef show_versions(github=False):\n    \"\"\"Print debugging information.\n\n    .. versionadded:: 0.5\n\n    Parameters\n    ----------\n    github : bool,\n        If true, wrap system info with GitHub markup.\n    \"\"\"\n\n    from sklearn.utils._show_versions import _get_sys_info\n\n    _sys_info = _get_sys_info()\n    _deps_info = _get_deps_info()\n    _github_markup = (\n        \"<details>\"\n        \"<summary>System, Dependency Information</summary>\\n\\n\"\n        \"**System Information**\\n\\n\"\n        \"{0}\\n\"\n        \"**Python Dependencies**\\n\\n\"\n        \"{1}\\n\"\n        \"</details>\"\n    )\n\n    if github:\n        _sys_markup = \"\"\n        _deps_markup = \"\"\n\n        for k, stat in _sys_info.items():\n            _sys_markup += f\"* {k:<10}: `{stat}`\\n\"\n        for k, stat in _deps_info.items():\n            _deps_markup += f\"* {k:<10}: `{stat}`\\n\"\n\n        print(_github_markup.format(_sys_markup, _deps_markup))\n\n    else:\n        print(\"\\nSystem:\")\n        for k, stat in _sys_info.items():\n            print(f\"{k:>11}: {stat}\")\n\n        print(\"\\nPython dependencies:\")\n        for k, stat in _deps_info.items():\n            print(f\"{k:>11}: {stat}\")\n"
  },
  {
    "path": "imblearn/utils/_tags.py",
    "content": "from dataclasses import dataclass, field\n\nfrom sklearn_compat.utils._tags import (\n    ClassifierTags,\n    RegressorTags,\n    TargetTags,\n    TransformerTags,\n)\nfrom sklearn_compat.utils._tags import (\n    InputTags as SklearnInputTags,\n)\n\n\n# tags infrastructure\ndef _dataclass_args():\n    return {\"slots\": True}\n\n\n@dataclass(**_dataclass_args())\nclass InputTags(SklearnInputTags):\n    \"\"\"Tags for the input data.\n\n    Parameters\n    ----------\n    one_d_array : bool, default=False\n        Whether the input can be a 1D array.\n\n    two_d_array : bool, default=True\n        Whether the input can be a 2D array. Note that most common\n        tests currently run only if this flag is set to ``True``.\n\n    three_d_array : bool, default=False\n        Whether the input can be a 3D array.\n\n    sparse : bool, default=False\n        Whether the input can be a sparse matrix.\n\n    categorical : bool, default=False\n        Whether the input can be categorical.\n\n    string : bool, default=False\n        Whether the input can be an array-like of strings.\n\n    dict : bool, default=False\n        Whether the input can be a dictionary.\n\n    positive_only : bool, default=False\n        Whether the estimator requires positive X.\n\n    allow_nan : bool, default=False\n        Whether the estimator supports data with missing values encoded as `np.nan`.\n\n    pairwise : bool, default=False\n        This boolean attribute indicates whether the data (`X`),\n        :term:`fit` and similar methods consists of pairwise measures\n        over samples rather than a feature representation for each\n        sample.  It is usually `True` where an estimator has a\n        `metric` or `affinity` or `kernel` parameter with value\n        'precomputed'. Its primary purpose is to support a\n        :term:`meta-estimator` or a cross validation procedure that\n        extracts a sub-sample of data intended for a pairwise\n        estimator, where the data needs to be indexed on both axes.\n        Specifically, this tag is used by\n        `sklearn.utils.metaestimators._safe_split` to slice rows and\n        columns.\n    \"\"\"\n\n    one_d_array: bool = False\n    two_d_array: bool = True\n    three_d_array: bool = False\n    sparse: bool = False\n    categorical: bool = False\n    string: bool = False\n    dict: bool = False\n    positive_only: bool = False\n    allow_nan: bool = False\n    pairwise: bool = False\n    dataframe: bool = False\n\n\n@dataclass(**_dataclass_args())\nclass SamplerTags:\n    \"\"\"Tags for the sampler.\n\n    Parameters\n    ----------\n    sample_indices : bool, default=False\n        Whether the sampler returns the indices of the samples that were\n        selected.\n    \"\"\"\n\n    sample_indices: bool = False\n\n\n@dataclass(**_dataclass_args())\nclass Tags:\n    \"\"\"Tags for the estimator.\n\n    See :ref:`estimator_tags` for more information.\n\n    Parameters\n    ----------\n    estimator_type : str or None\n        The type of the estimator. Can be one of:\n        - \"classifier\"\n        - \"regressor\"\n        - \"transformer\"\n        - \"clusterer\"\n        - \"outlier_detector\"\n        - \"density_estimator\"\n\n    target_tags : :class:`TargetTags`\n        The target(y) tags.\n\n    transformer_tags : :class:`TransformerTags` or None\n        The transformer tags.\n\n    classifier_tags : :class:`ClassifierTags` or None\n        The classifier tags.\n\n    regressor_tags : :class:`RegressorTags` or None\n        The regressor tags.\n\n    array_api_support : bool, default=False\n        Whether the estimator supports Array API compatible inputs.\n\n    no_validation : bool, default=False\n        Whether the estimator skips input-validation. This is only meant for\n        stateless and dummy transformers!\n\n    non_deterministic : bool, default=False\n        Whether the estimator is not deterministic given a fixed ``random_state``.\n\n    requires_fit : bool, default=True\n        Whether the estimator requires to be fitted before calling one of\n        `transform`, `predict`, `predict_proba`, or `decision_function`.\n\n    _skip_test : bool, default=False\n        Whether to skip common tests entirely. Don't use this unless\n        you have a *very good* reason.\n\n    input_tags : :class:`InputTags`\n        The input data(X) tags.\n    \"\"\"\n\n    estimator_type: str | None\n    target_tags: TargetTags\n    transformer_tags: TransformerTags | None = None\n    classifier_tags: ClassifierTags | None = None\n    regressor_tags: RegressorTags | None = None\n    array_api_support: bool = False\n    no_validation: bool = False\n    non_deterministic: bool = False\n    requires_fit: bool = True\n    _skip_test: bool = False\n    input_tags: InputTags = field(default_factory=InputTags)\n    sampler_tags: SamplerTags | None = None\n\n\ndef get_tags(estimator):\n    \"\"\"Get estimator tags in a consistent format across different sklearn versions.\n\n    This function provides compatibility between sklearn versions before and after 1.6.\n    It returns either a Tags object (sklearn >= 1.6) or a converted Tags object from\n    the dictionary format (sklearn < 1.6) containing metadata about the estimator's\n    requirements and capabilities.\n\n    Parameters\n    ----------\n    estimator : estimator object\n        A scikit-learn estimator instance.\n\n    Returns\n    -------\n    tags : Tags\n        An object containing metadata about the estimator's requirements and\n        capabilities (e.g., input types, fitting requirements, classifier/regressor\n        specific tags).\n    \"\"\"\n    try:\n        from sklearn.utils._tags import get_tags\n\n        return get_tags(estimator)\n    except ImportError:\n        from sklearn.utils._tags import _safe_tags\n\n        return _to_new_tags(_safe_tags(estimator), estimator)\n\n\ndef _to_new_tags(old_tags, estimator=None):\n    \"\"\"Utility function convert old tags (dictionary) to new tags (dataclass).\"\"\"\n    input_tags = InputTags(\n        one_d_array=\"1darray\" in old_tags[\"X_types\"],\n        two_d_array=\"2darray\" in old_tags[\"X_types\"],\n        three_d_array=\"3darray\" in old_tags[\"X_types\"],\n        sparse=\"sparse\" in old_tags[\"X_types\"],\n        categorical=\"categorical\" in old_tags[\"X_types\"],\n        string=\"string\" in old_tags[\"X_types\"],\n        dict=\"dict\" in old_tags[\"X_types\"],\n        positive_only=old_tags[\"requires_positive_X\"],\n        allow_nan=old_tags[\"allow_nan\"],\n        pairwise=old_tags[\"pairwise\"],\n        dataframe=\"dataframe\" in old_tags[\"X_types\"],\n    )\n    target_tags = TargetTags(\n        required=old_tags[\"requires_y\"],\n        one_d_labels=\"1dlabels\" in old_tags[\"X_types\"],\n        two_d_labels=\"2dlabels\" in old_tags[\"X_types\"],\n        positive_only=old_tags[\"requires_positive_y\"],\n        multi_output=old_tags[\"multioutput\"] or old_tags[\"multioutput_only\"],\n        single_output=not old_tags[\"multioutput_only\"],\n    )\n    if estimator is not None and (\n        hasattr(estimator, \"transform\") or hasattr(estimator, \"fit_transform\")\n    ):\n        transformer_tags = TransformerTags(\n            preserves_dtype=old_tags[\"preserves_dtype\"],\n        )\n    else:\n        transformer_tags = None\n    estimator_type = getattr(estimator, \"_estimator_type\", None)\n    if estimator_type == \"classifier\":\n        classifier_tags = ClassifierTags(\n            poor_score=old_tags[\"poor_score\"],\n            multi_class=not old_tags[\"binary_only\"],\n            multi_label=old_tags[\"multilabel\"],\n        )\n    else:\n        classifier_tags = None\n    if estimator_type == \"regressor\":\n        regressor_tags = RegressorTags(\n            poor_score=old_tags[\"poor_score\"],\n            multi_label=old_tags[\"multilabel\"],\n        )\n    else:\n        regressor_tags = None\n\n    if estimator_type == \"sampler\":\n        sampler_tags = SamplerTags(\n            sample_indices=old_tags.get(\"sample_indices\", False),\n        )\n    else:\n        sampler_tags = None\n\n    return Tags(\n        estimator_type=estimator_type,\n        target_tags=target_tags,\n        transformer_tags=transformer_tags,\n        classifier_tags=classifier_tags,\n        regressor_tags=regressor_tags,\n        sampler_tags=sampler_tags,\n        input_tags=input_tags,\n        # Array-API was introduced in 1.3, we need to default to False if not inside\n        # the old-tags.\n        array_api_support=old_tags.get(\"array_api_support\", False),\n        no_validation=old_tags[\"no_validation\"],\n        non_deterministic=old_tags[\"non_deterministic\"],\n        requires_fit=old_tags[\"requires_fit\"],\n        _skip_test=old_tags[\"_skip_test\"],\n    )\n"
  },
  {
    "path": "imblearn/utils/_test_common/__init__.py",
    "content": ""
  },
  {
    "path": "imblearn/utils/_test_common/instance_generator.py",
    "content": "# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport re\nimport warnings\nfrom contextlib import suppress\nfrom functools import partial\nfrom inspect import isfunction\n\nfrom sklearn import clone, config_context\nfrom sklearn.exceptions import SkipTestWarning\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.tree import DecisionTreeClassifier\nfrom sklearn.utils._testing import SkipTest\nfrom sklearn.utils.fixes import parse_version\nfrom sklearn_compat._sklearn_compat import sklearn_version\n\nfrom imblearn.combine import SMOTEENN, SMOTETomek\nfrom imblearn.ensemble import (\n    BalancedBaggingClassifier,\n    BalancedRandomForestClassifier,\n    EasyEnsembleClassifier,\n    RUSBoostClassifier,\n)\nfrom imblearn.over_sampling import (\n    ADASYN,\n    SMOTE,\n    SMOTEN,\n    SMOTENC,\n    SVMSMOTE,\n    BorderlineSMOTE,\n    KMeansSMOTE,\n    RandomOverSampler,\n)\nfrom imblearn.pipeline import Pipeline\nfrom imblearn.under_sampling import (\n    ClusterCentroids,\n    CondensedNearestNeighbour,\n    InstanceHardnessThreshold,\n    NearMiss,\n    OneSidedSelection,\n    RandomUnderSampler,\n)\nfrom imblearn.utils.testing import all_estimators\n\n# The following dictionary is to indicate constructor arguments suitable for the test\n# suite, which uses very small datasets, and is intended to run rather quickly.\nINIT_PARAMS = {\n    # estimator\n    BalancedBaggingClassifier: dict(random_state=42),\n    BalancedRandomForestClassifier: dict(random_state=42),\n    EasyEnsembleClassifier: [\n        # AdaBoostClassifier does not allow nan values\n        dict(random_state=42),\n        # DecisionTreeClassifier allows nan values\n        dict(estimator=DecisionTreeClassifier(random_state=42), random_state=42),\n    ],\n    Pipeline: dict(\n        steps=[\n            (\"sampler\", RandomUnderSampler(random_state=0)),\n            (\"logistic\", LogisticRegression()),\n        ]\n    ),\n    # over-sampling\n    ADASYN: dict(random_state=42),\n    BorderlineSMOTE: dict(random_state=42),\n    KMeansSMOTE: dict(random_state=0),\n    RandomOverSampler: dict(random_state=42),\n    SMOTE: dict(random_state=42),\n    SMOTEN: dict(random_state=42),\n    SMOTENC: dict(categorical_features=[0], random_state=42),\n    SVMSMOTE: dict(random_state=42),\n    # under-sampling\n    ClusterCentroids: dict(random_state=42),\n    CondensedNearestNeighbour: dict(random_state=42),\n    InstanceHardnessThreshold: dict(random_state=42),\n    NearMiss: [dict(version=1), dict(version=2), dict(version=3)],\n    OneSidedSelection: dict(random_state=42),\n    RandomUnderSampler: dict(random_state=42),\n    # combination\n    SMOTEENN: dict(random_state=42),\n    SMOTETomek: dict(random_state=42),\n}\n\n# This dictionary stores parameters for specific checks. It also enables running the\n# same check with multiple instances of the same estimator with different parameters.\n# The special key \"*\" allows to apply the parameters to all checks.\n# TODO(devtools): allow third-party developers to pass test specific params to checks\nPER_ESTIMATOR_CHECK_PARAMS: dict = {\n    Pipeline: {\n        \"check_classifiers_with_encoded_labels\": dict(\n            sampler__sampling_strategy={\"setosa\": 20, \"virginica\": 20}\n        )\n    }\n}\n\nSKIPPED_ESTIMATORS = [SMOTENC]\n\n\ndef _tested_estimators(type_filter=None):\n    for _, Estimator in all_estimators(type_filter=type_filter):\n        with suppress(SkipTest):\n            yield from _construct_instances(Estimator)\n\n\ndef _construct_instances(Estimator):\n    \"\"\"Construct Estimator instances if possible.\n\n    If parameter sets in INIT_PARAMS are provided, use them. If there are a list\n    of parameter sets, return one instance for each set.\n    \"\"\"\n    if Estimator in SKIPPED_ESTIMATORS:\n        msg = f\"Can't instantiate estimator {Estimator.__name__}\"\n        # raise additional warning to be shown by pytest\n        warnings.warn(msg, SkipTestWarning)\n        raise SkipTest(msg)\n\n    if Estimator in INIT_PARAMS:\n        param_sets = INIT_PARAMS[Estimator]\n        if not isinstance(param_sets, list):\n            param_sets = [param_sets]\n        for params in param_sets:\n            est = Estimator(**params)\n            yield est\n    else:\n        yield Estimator()\n\n\ndef _get_check_estimator_ids(obj):\n    \"\"\"Create pytest ids for checks.\n\n    When `obj` is an estimator, this returns the pprint version of the\n    estimator (with `print_changed_only=True`). When `obj` is a function, the\n    name of the function is returned with its keyword arguments.\n\n    `_get_check_estimator_ids` is designed to be used as the `id` in\n    `pytest.mark.parametrize` where `check_estimator(..., generate_only=True)`\n    is yielding estimators and checks.\n\n    Parameters\n    ----------\n    obj : estimator or function\n        Items generated by `check_estimator`.\n\n    Returns\n    -------\n    id : str or None\n\n    See Also\n    --------\n    check_estimator\n    \"\"\"\n    if isfunction(obj):\n        return obj.__name__\n    if isinstance(obj, partial):\n        if not obj.keywords:\n            return obj.func.__name__\n        kwstring = \",\".join([f\"{k}={v}\" for k, v in obj.keywords.items()])\n        return f\"{obj.func.__name__}({kwstring})\"\n    if hasattr(obj, \"get_params\"):\n        with config_context(print_changed_only=True):\n            return re.sub(r\"\\s\", \"\", str(obj))\n\n\ndef _yield_instances_for_check(check, estimator_orig):\n    \"\"\"Yield instances for a check.\n\n    For most estimators, this is a no-op.\n\n    For estimators which have an entry in PER_ESTIMATOR_CHECK_PARAMS, this will yield\n    an estimator for each parameter set in PER_ESTIMATOR_CHECK_PARAMS[estimator].\n    \"\"\"\n    # TODO(devtools): enable this behavior for third party estimators as well\n    if type(estimator_orig) not in PER_ESTIMATOR_CHECK_PARAMS:\n        yield estimator_orig\n        return\n\n    check_params = PER_ESTIMATOR_CHECK_PARAMS[type(estimator_orig)]\n\n    try:\n        check_name = check.__name__\n    except AttributeError:\n        # partial tests\n        check_name = check.func.__name__\n\n    if check_name not in check_params:\n        yield estimator_orig\n        return\n\n    param_set = check_params[check_name]\n    if isinstance(param_set, dict):\n        param_set = [param_set]\n\n    for params in param_set:\n        estimator = clone(estimator_orig)\n        estimator.set_params(**params)\n        yield estimator\n\n\nPER_ESTIMATOR_XFAIL_CHECKS = {\n    BalancedRandomForestClassifier: {\n        \"check_sample_weight_equivalence\": \"FIXME\",\n        \"check_sample_weight_equivalence_on_sparse_data\": \"FIXME\",\n        \"check_sample_weight_equivalence_on_dense_data\": \"FIXME\",\n    },\n    NearMiss: {\n        \"check_samplers_fit_resample\": \"FIXME\",\n    },\n    Pipeline: {\n        \"check_classifiers_train\": \"FIXME\",\n        \"check_supervised_y_2d\": \"FIXME\",\n        \"check_dont_overwrite_parameters\": (\n            \"Pipeline changes the `steps` parameter, which it shouldn't. \"\n            \"Therefore this test is x-fail until we fix this.\"\n        ),\n        \"check_estimators_overwrite_params\": (\n            \"Pipeline changes the `steps` parameter, which it shouldn't. \"\n            \"Therefore this test is x-fail until we fix this.\"\n        ),\n    },\n    RUSBoostClassifier: {\n        \"check_sample_weight_equivalence\": \"FIXME\",\n        \"check_sample_weight_equivalence_on_sparse_data\": \"FIXME\",\n        \"check_sample_weight_equivalence_on_dense_data\": \"FIXME\",\n        \"check_estimator_sparse_data\": \"FIXME\",\n        \"check_estimator_sparse_matrix\": \"FIXME\",\n        \"check_estimator_sparse_array\": \"FIXME\",\n    },\n}\n\nif sklearn_version < parse_version(\"1.4\"):\n    for _, Estimator in all_estimators():\n        if Estimator in PER_ESTIMATOR_XFAIL_CHECKS:\n            PER_ESTIMATOR_XFAIL_CHECKS[Estimator][\"check_estimators_pickle\"] = \"FIXME\"\n        else:\n            PER_ESTIMATOR_XFAIL_CHECKS[Estimator] = {\"check_estimators_pickle\": \"FIXME\"}\n\n\ndef _get_expected_failed_checks(estimator):\n    \"\"\"Get the expected failed checks for all estimators in scikit-learn.\"\"\"\n    failed_checks = PER_ESTIMATOR_XFAIL_CHECKS.get(type(estimator), {})\n    return failed_checks\n"
  },
  {
    "path": "imblearn/utils/_validation.py",
    "content": "\"\"\"Utilities for input validation\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\nimport warnings\nfrom collections import OrderedDict\nfrom functools import wraps\nfrom inspect import Parameter, signature\nfrom numbers import Integral, Real\n\nimport numpy as np\nfrom scipy.sparse import issparse\nfrom sklearn.base import clone\nfrom sklearn.neighbors import NearestNeighbors\nfrom sklearn.utils import column_or_1d\nfrom sklearn.utils.multiclass import type_of_target\nfrom sklearn.utils.validation import _num_samples\nfrom sklearn_compat.utils._dataframe import is_pandas_df\nfrom sklearn_compat.utils.validation import check_array\n\nSAMPLING_KIND = (\n    \"over-sampling\",\n    \"under-sampling\",\n    \"clean-sampling\",\n    \"ensemble\",\n    \"bypass\",\n)\nTARGET_KIND = (\"binary\", \"multiclass\", \"multilabel-indicator\")\n\n\nclass ArraysTransformer:\n    \"\"\"A class to convert sampler output arrays to their original types.\"\"\"\n\n    def __init__(self, X, y):\n        self.x_props = self._gets_props(X)\n        self.y_props = self._gets_props(y)\n\n    def transform(self, X, y):\n        X = self._transfrom_one(X, self.x_props)\n        y = self._transfrom_one(y, self.y_props)\n        if self.x_props[\"type\"].lower() == \"dataframe\" and self.y_props[\n            \"type\"\n        ].lower() in {\"series\", \"dataframe\"}:\n            # We lost the y.index during resampling. We can safely use X.index to align\n            # them.\n            y.index = X.index\n        return X, y\n\n    def _gets_props(self, array):\n        props = {}\n        props[\"type\"] = array.__class__.__name__\n        props[\"columns\"] = getattr(array, \"columns\", None)\n        props[\"name\"] = getattr(array, \"name\", None)\n        props[\"dtypes\"] = getattr(array, \"dtypes\", None)\n        return props\n\n    def _transfrom_one(self, array, props):\n        type_ = props[\"type\"].lower()\n        if type_ == \"list\":\n            ret = array.tolist()\n        elif type_ == \"dataframe\":\n            import pandas as pd\n\n            if issparse(array):\n                ret = pd.DataFrame.sparse.from_spmatrix(array, columns=props[\"columns\"])\n            else:\n                ret = pd.DataFrame(array, columns=props[\"columns\"])\n\n            try:\n                ret = ret.astype(props[\"dtypes\"])\n            except TypeError:\n                # We special case the following error:\n                # https://github.com/scikit-learn-contrib/imbalanced-learn/issues/1055\n                # There is no easy way to have a generic workaround. Here, we detect\n                # that we have a column with only null values that is datetime64\n                # (resulting from the np.vstack of the resampling).\n                for col in ret.columns:\n                    if (\n                        ret[col].isnull().all()\n                        and ret[col].dtype == \"datetime64[ns]\"\n                        and props[\"dtypes\"][col] == \"timedelta64[ns]\"\n                    ):\n                        ret[col] = pd.to_timedelta([\"NaT\"] * len(ret[col]))\n                # try again\n                ret = ret.astype(props[\"dtypes\"])\n        elif type_ == \"series\":\n            import pandas as pd\n\n            ret = pd.Series(array, dtype=props[\"dtypes\"], name=props[\"name\"])\n        else:\n            ret = array\n        return ret\n\n\ndef _is_neighbors_object(estimator):\n    \"\"\"Check that the estimator exposes a KNeighborsMixin-like API.\n\n    A KNeighborsMixin-like API exposes the following methods: (i) `kneighbors`,\n    (ii) `kneighbors_graph`.\n\n    Parameters\n    ----------\n    estimator : object\n        A scikit-learn compatible estimator.\n\n    Returns\n    -------\n    is_neighbors_object : bool\n        True if the estimator exposes a KNeighborsMixin-like API.\n    \"\"\"\n    neighbors_attributes = [\"kneighbors\", \"kneighbors_graph\"]\n    return all(hasattr(estimator, attr) for attr in neighbors_attributes)\n\n\ndef check_neighbors_object(nn_name, nn_object, additional_neighbor=0):\n    \"\"\"Check the objects is consistent to be a k nearest neighbors.\n\n    Several methods in `imblearn` relies on k nearest neighbors. These objects\n    can be passed at initialisation as an integer or as an object that has\n    KNeighborsMixin-like attributes. This utility will create or clone said\n    object, ensuring it is KNeighbors-like.\n\n    Parameters\n    ----------\n    nn_name : str\n        The name associated to the object to raise an error if needed.\n\n    nn_object : int or KNeighborsMixin\n        The object to be checked.\n\n    additional_neighbor : int, default=0\n        Sometimes, some algorithm need an additional neighbors.\n\n    Returns\n    -------\n    nn_object : KNeighborsMixin\n        The k-NN object.\n    \"\"\"\n    if isinstance(nn_object, Integral):\n        return NearestNeighbors(n_neighbors=nn_object + additional_neighbor)\n    # _is_neighbors_object(nn_object)\n    return clone(nn_object)\n\n\ndef _count_class_sample(y):\n    unique, counts = np.unique(y, return_counts=True)\n    return dict(zip(unique, counts))\n\n\ndef check_target_type(y, indicate_one_vs_all=False):\n    \"\"\"Check the target types to be conform to the current samplers.\n\n    The current samplers should be compatible with ``'binary'``,\n    ``'multilabel-indicator'`` and ``'multiclass'`` targets only.\n\n    Parameters\n    ----------\n    y : ndarray\n        The array containing the target.\n\n    indicate_one_vs_all : bool, default=False\n        Either to indicate if the targets are encoded in a one-vs-all fashion.\n\n    Returns\n    -------\n    y : ndarray\n        The returned target.\n\n    is_one_vs_all : bool, optional\n        Indicate if the target was originally encoded in a one-vs-all fashion.\n        Only returned if ``indicate_multilabel=True``.\n    \"\"\"\n    type_y = type_of_target(y)\n    if type_y == \"multilabel-indicator\":\n        if np.any(y.sum(axis=1) > 1):\n            raise ValueError(\n                \"Imbalanced-learn currently supports binary, multiclass and \"\n                \"binarized encoded multiclasss targets. Multilabel and \"\n                \"multioutput targets are not supported.\"\n            )\n        y = y.argmax(axis=1)\n    else:\n        y = column_or_1d(y)\n\n    return (y, type_y == \"multilabel-indicator\") if indicate_one_vs_all else y\n\n\ndef _sampling_strategy_all(y, sampling_type):\n    \"\"\"Returns sampling target by targeting all classes.\"\"\"\n    target_stats = _count_class_sample(y)\n    if sampling_type == \"over-sampling\":\n        n_sample_majority = max(target_stats.values())\n        sampling_strategy = {\n            key: n_sample_majority - value for (key, value) in target_stats.items()\n        }\n    elif sampling_type == \"under-sampling\" or sampling_type == \"clean-sampling\":\n        n_sample_minority = min(target_stats.values())\n        sampling_strategy = {key: n_sample_minority for key in target_stats.keys()}\n    else:\n        raise NotImplementedError\n\n    return sampling_strategy\n\n\ndef _sampling_strategy_majority(y, sampling_type):\n    \"\"\"Returns sampling target by targeting the majority class only.\"\"\"\n    if sampling_type == \"over-sampling\":\n        raise ValueError(\n            \"'sampling_strategy'='majority' cannot be used with over-sampler.\"\n        )\n    elif sampling_type == \"under-sampling\" or sampling_type == \"clean-sampling\":\n        target_stats = _count_class_sample(y)\n        class_majority = max(target_stats, key=target_stats.get)\n        n_sample_minority = min(target_stats.values())\n        sampling_strategy = {\n            key: n_sample_minority\n            for key in target_stats.keys()\n            if key == class_majority\n        }\n    else:\n        raise NotImplementedError\n\n    return sampling_strategy\n\n\ndef _sampling_strategy_not_majority(y, sampling_type):\n    \"\"\"Returns sampling target by targeting all classes but not the\n    majority.\"\"\"\n    target_stats = _count_class_sample(y)\n    if sampling_type == \"over-sampling\":\n        n_sample_majority = max(target_stats.values())\n        class_majority = max(target_stats, key=target_stats.get)\n        sampling_strategy = {\n            key: n_sample_majority - value\n            for (key, value) in target_stats.items()\n            if key != class_majority\n        }\n    elif sampling_type == \"under-sampling\" or sampling_type == \"clean-sampling\":\n        n_sample_minority = min(target_stats.values())\n        class_majority = max(target_stats, key=target_stats.get)\n        sampling_strategy = {\n            key: n_sample_minority\n            for key in target_stats.keys()\n            if key != class_majority\n        }\n    else:\n        raise NotImplementedError\n\n    return sampling_strategy\n\n\ndef _sampling_strategy_not_minority(y, sampling_type):\n    \"\"\"Returns sampling target by targeting all classes but not the\n    minority.\"\"\"\n    target_stats = _count_class_sample(y)\n    if sampling_type == \"over-sampling\":\n        n_sample_majority = max(target_stats.values())\n        class_minority = min(target_stats, key=target_stats.get)\n        sampling_strategy = {\n            key: n_sample_majority - value\n            for (key, value) in target_stats.items()\n            if key != class_minority\n        }\n    elif sampling_type == \"under-sampling\" or sampling_type == \"clean-sampling\":\n        n_sample_minority = min(target_stats.values())\n        class_minority = min(target_stats, key=target_stats.get)\n        sampling_strategy = {\n            key: n_sample_minority\n            for key in target_stats.keys()\n            if key != class_minority\n        }\n    else:\n        raise NotImplementedError\n\n    return sampling_strategy\n\n\ndef _sampling_strategy_minority(y, sampling_type):\n    \"\"\"Returns sampling target by targeting the minority class only.\"\"\"\n    target_stats = _count_class_sample(y)\n    if sampling_type == \"over-sampling\":\n        n_sample_majority = max(target_stats.values())\n        class_minority = min(target_stats, key=target_stats.get)\n        sampling_strategy = {\n            key: n_sample_majority - value\n            for (key, value) in target_stats.items()\n            if key == class_minority\n        }\n    elif sampling_type == \"under-sampling\" or sampling_type == \"clean-sampling\":\n        raise ValueError(\n            \"'sampling_strategy'='minority' cannot be used with\"\n            \" under-sampler and clean-sampler.\"\n        )\n    else:\n        raise NotImplementedError\n\n    return sampling_strategy\n\n\ndef _sampling_strategy_auto(y, sampling_type):\n    \"\"\"Returns sampling target auto for over-sampling and not-minority for\n    under-sampling.\"\"\"\n    if sampling_type == \"over-sampling\":\n        return _sampling_strategy_not_majority(y, sampling_type)\n    elif sampling_type == \"under-sampling\" or sampling_type == \"clean-sampling\":\n        return _sampling_strategy_not_minority(y, sampling_type)\n\n\ndef _sampling_strategy_dict(sampling_strategy, y, sampling_type):\n    \"\"\"Returns sampling target by converting the dictionary depending of the\n    sampling.\"\"\"\n    target_stats = _count_class_sample(y)\n    # check that all keys in sampling_strategy are also in y\n    set_diff_sampling_strategy_target = set(sampling_strategy.keys()) - set(\n        target_stats.keys()\n    )\n    if len(set_diff_sampling_strategy_target) > 0:\n        raise ValueError(\n            f\"The {set_diff_sampling_strategy_target} target class is/are not \"\n            \"present in the data.\"\n        )\n    # check that there is no negative number\n    if any(n_samples < 0 for n_samples in sampling_strategy.values()):\n        raise ValueError(\n            \"The number of samples in a class cannot be negative.\"\n            f\"'sampling_strategy' contains some negative value: {sampling_strategy}\"\n        )\n    sampling_strategy_ = {}\n    if sampling_type == \"over-sampling\":\n        for class_sample, n_samples in sampling_strategy.items():\n            if n_samples < target_stats[class_sample]:\n                raise ValueError(\n                    \"With over-sampling methods, the number\"\n                    \" of samples in a class should be greater\"\n                    \" or equal to the original number of samples.\"\n                    f\" Originally, there is {target_stats[class_sample]} \"\n                    f\"samples and {n_samples} samples are asked.\"\n                )\n            sampling_strategy_[class_sample] = n_samples - target_stats[class_sample]\n    elif sampling_type == \"under-sampling\":\n        for class_sample, n_samples in sampling_strategy.items():\n            if n_samples > target_stats[class_sample]:\n                raise ValueError(\n                    \"With under-sampling methods, the number of\"\n                    \" samples in a class should be less or equal\"\n                    \" to the original number of samples.\"\n                    f\" Originally, there is {target_stats[class_sample]} \"\n                    f\"samples and {n_samples} samples are asked.\"\n                )\n            sampling_strategy_[class_sample] = n_samples\n    elif sampling_type == \"clean-sampling\":\n        raise ValueError(\n            \"'sampling_strategy' as a dict for cleaning methods is \"\n            \"not supported. Please give a list of the classes to be \"\n            \"targeted by the sampling.\"\n        )\n    else:\n        raise NotImplementedError\n\n    return sampling_strategy_\n\n\ndef _sampling_strategy_list(sampling_strategy, y, sampling_type):\n    \"\"\"With cleaning methods, sampling_strategy can be a list to target the\n    class of interest.\"\"\"\n    if sampling_type != \"clean-sampling\":\n        raise ValueError(\n            \"'sampling_strategy' cannot be a list for samplers \"\n            \"which are not cleaning methods.\"\n        )\n\n    target_stats = _count_class_sample(y)\n    # check that all keys in sampling_strategy are also in y\n    set_diff_sampling_strategy_target = set(sampling_strategy) - set(\n        target_stats.keys()\n    )\n    if len(set_diff_sampling_strategy_target) > 0:\n        raise ValueError(\n            f\"The {set_diff_sampling_strategy_target} target class is/are not \"\n            \"present in the data.\"\n        )\n\n    return {\n        class_sample: min(target_stats.values()) for class_sample in sampling_strategy\n    }\n\n\ndef _sampling_strategy_float(sampling_strategy, y, sampling_type):\n    \"\"\"Take a proportion of the majority (over-sampling) or minority\n    (under-sampling) class in binary classification.\"\"\"\n    type_y = type_of_target(y)\n    if type_y != \"binary\":\n        raise ValueError(\n            '\"sampling_strategy\" can be a float only when the type '\n            \"of target is binary. For multi-class, use a dict.\"\n        )\n    target_stats = _count_class_sample(y)\n    if sampling_type == \"over-sampling\":\n        n_sample_majority = max(target_stats.values())\n        class_majority = max(target_stats, key=target_stats.get)\n        sampling_strategy_ = {\n            key: int(n_sample_majority * sampling_strategy - value)\n            for (key, value) in target_stats.items()\n            if key != class_majority\n        }\n        if any(n_samples <= 0 for n_samples in sampling_strategy_.values()):\n            raise ValueError(\n                \"The specified ratio required to remove samples \"\n                \"from the minority class while trying to \"\n                \"generate new samples. Please increase the \"\n                \"ratio.\"\n            )\n    elif sampling_type == \"under-sampling\":\n        n_sample_minority = min(target_stats.values())\n        class_minority = min(target_stats, key=target_stats.get)\n        sampling_strategy_ = {\n            key: int(n_sample_minority / sampling_strategy)\n            for (key, value) in target_stats.items()\n            if key != class_minority\n        }\n        if any(\n            n_samples > target_stats[target]\n            for target, n_samples in sampling_strategy_.items()\n        ):\n            raise ValueError(\n                \"The specified ratio required to generate new \"\n                \"sample in the majority class while trying to \"\n                \"remove samples. Please increase the ratio.\"\n            )\n    else:\n        raise ValueError(\n            \"'clean-sampling' methods do let the user specify the sampling ratio.\"\n        )\n    return sampling_strategy_\n\n\ndef check_sampling_strategy(sampling_strategy, y, sampling_type, **kwargs):\n    \"\"\"Sampling target validation for samplers.\n\n    Checks that ``sampling_strategy`` is of consistent type and return a\n    dictionary containing each targeted class with its corresponding\n    number of sample. It is used in :class:`~imblearn.base.BaseSampler`.\n\n    Parameters\n    ----------\n    sampling_strategy : float, str, dict, list or callable,\n        Sampling information to sample the data set.\n\n        - When ``float``:\n\n            For **under-sampling methods**, it corresponds to the ratio\n            :math:`\\\\alpha_{us}` defined by :math:`N_{rM} = \\\\alpha_{us}\n            \\\\times N_{m}` where :math:`N_{rM}` and :math:`N_{m}` are the\n            number of samples in the majority class after resampling and the\n            number of samples in the minority class, respectively;\n\n            For **over-sampling methods**, it correspond to the ratio\n            :math:`\\\\alpha_{os}` defined by :math:`N_{rm} = \\\\alpha_{os}\n            \\\\times N_{m}` where :math:`N_{rm}` and :math:`N_{M}` are the\n            number of samples in the minority class after resampling and the\n            number of samples in the majority class, respectively.\n\n            .. warning::\n               ``float`` is only available for **binary** classification. An\n               error is raised for multi-class classification and with cleaning\n               samplers.\n\n        - When ``str``, specify the class targeted by the resampling. For\n          **under- and over-sampling methods**, the number of samples in the\n          different classes will be equalized. For **cleaning methods**, the\n          number of samples will not be equal. Possible choices are:\n\n            ``'minority'``: resample only the minority class;\n\n            ``'majority'``: resample only the majority class;\n\n            ``'not minority'``: resample all classes but the minority class;\n\n            ``'not majority'``: resample all classes but the majority class;\n\n            ``'all'``: resample all classes;\n\n            ``'auto'``: for under-sampling methods, equivalent to ``'not\n            minority'`` and for over-sampling methods, equivalent to ``'not\n            majority'``.\n\n        - When ``dict``, the keys correspond to the targeted classes. The\n          values correspond to the desired number of samples for each targeted\n          class.\n\n          .. warning::\n             ``dict`` is available for both **under- and over-sampling\n             methods**. An error is raised with **cleaning methods**. Use a\n             ``list`` instead.\n\n        - When ``list``, the list contains the targeted classes. It used only\n          for **cleaning methods**.\n\n          .. warning::\n             ``list`` is available for **cleaning methods**. An error is raised\n             with **under- and over-sampling methods**.\n\n        - When callable, function taking ``y`` and returns a ``dict``. The keys\n          correspond to the targeted classes. The values correspond to the\n          desired number of samples for each class.\n\n    y : ndarray of shape (n_samples,)\n        The target array.\n\n    sampling_type : {{'over-sampling', 'under-sampling', 'clean-sampling'}}\n        The type of sampling. Can be either ``'over-sampling'``,\n        ``'under-sampling'``, or ``'clean-sampling'``.\n\n    **kwargs : dict\n        Dictionary of additional keyword arguments to pass to\n        ``sampling_strategy`` when this is a callable.\n\n    Returns\n    -------\n    sampling_strategy_converted : dict\n        The converted and validated sampling target. Returns a dictionary with\n        the key being the class target and the value being the desired\n        number of samples.\n    \"\"\"\n    if sampling_type not in SAMPLING_KIND:\n        raise ValueError(\n            f\"'sampling_type' should be one of {SAMPLING_KIND}. \"\n            f\"Got '{sampling_type} instead.\"\n        )\n\n    if np.unique(y).size <= 1:\n        raise ValueError(\n            \"The target 'y' needs to have more than 1 class. \"\n            f\"Got {np.unique(y).size} class instead\"\n        )\n\n    if sampling_type in (\"ensemble\", \"bypass\"):\n        return sampling_strategy\n\n    if isinstance(sampling_strategy, str):\n        if sampling_strategy not in SAMPLING_TARGET_KIND.keys():\n            raise ValueError(\n                \"When 'sampling_strategy' is a string, it needs\"\n                f\" to be one of {SAMPLING_TARGET_KIND}. Got '{sampling_strategy}' \"\n                \"instead.\"\n            )\n        return OrderedDict(\n            sorted(SAMPLING_TARGET_KIND[sampling_strategy](y, sampling_type).items())\n        )\n    elif isinstance(sampling_strategy, dict):\n        return OrderedDict(\n            sorted(_sampling_strategy_dict(sampling_strategy, y, sampling_type).items())\n        )\n    elif isinstance(sampling_strategy, list):\n        return OrderedDict(\n            sorted(_sampling_strategy_list(sampling_strategy, y, sampling_type).items())\n        )\n    elif isinstance(sampling_strategy, Real):\n        if sampling_strategy <= 0 or sampling_strategy > 1:\n            raise ValueError(\n                \"When 'sampling_strategy' is a float, it should be \"\n                f\"in the range (0, 1]. Got {sampling_strategy} instead.\"\n            )\n        return OrderedDict(\n            sorted(\n                _sampling_strategy_float(sampling_strategy, y, sampling_type).items()\n            )\n        )\n    elif callable(sampling_strategy):\n        sampling_strategy_ = sampling_strategy(y, **kwargs)\n        return OrderedDict(\n            sorted(\n                _sampling_strategy_dict(sampling_strategy_, y, sampling_type).items()\n            )\n        )\n\n\nSAMPLING_TARGET_KIND = {\n    \"minority\": _sampling_strategy_minority,\n    \"majority\": _sampling_strategy_majority,\n    \"not minority\": _sampling_strategy_not_minority,\n    \"not majority\": _sampling_strategy_not_majority,\n    \"all\": _sampling_strategy_all,\n    \"auto\": _sampling_strategy_auto,\n}\n\n\ndef _deprecate_positional_args(f):\n    \"\"\"Decorator for methods that issues warnings for positional arguments\n\n    Using the keyword-only argument syntax in pep 3102, arguments after the\n    * will issue a warning when passed as a positional argument.\n\n    Parameters\n    ----------\n    f : function\n        function to check arguments on.\n    \"\"\"\n    sig = signature(f)\n    kwonly_args = []\n    all_args = []\n\n    for name, param in sig.parameters.items():\n        if param.kind == Parameter.POSITIONAL_OR_KEYWORD:\n            all_args.append(name)\n        elif param.kind == Parameter.KEYWORD_ONLY:\n            kwonly_args.append(name)\n\n    @wraps(f)\n    def inner_f(*args, **kwargs):\n        extra_args = len(args) - len(all_args)\n        if extra_args > 0:\n            # ignore first 'self' argument for instance methods\n            args_msg = [\n                f\"{name}={arg}\"\n                for name, arg in zip(kwonly_args[:extra_args], args[-extra_args:])\n            ]\n            warnings.warn(\n                (\n                    f\"Pass {', '.join(args_msg)} as keyword args. From version 0.9 \"\n                    \"passing these as positional arguments will \"\n                    \"result in an error\"\n                ),\n                FutureWarning,\n            )\n        kwargs.update(dict(zip(sig.parameters, args)))\n        return f(**kwargs)\n\n    return inner_f\n\n\ndef _check_X(X):\n    \"\"\"Check X and do not check it if a dataframe.\"\"\"\n    n_samples = _num_samples(X)\n    if n_samples < 1:\n        raise ValueError(\n            f\"Found array with {n_samples} sample(s) while a minimum of 1 is required.\"\n        )\n    if is_pandas_df(X):\n        return X\n    return check_array(\n        X, dtype=None, accept_sparse=[\"csr\", \"csc\"], ensure_all_finite=False\n    )\n"
  },
  {
    "path": "imblearn/utils/deprecation.py",
    "content": "\"\"\"Utilities for deprecation\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\nimport warnings\n\n\ndef deprecate_parameter(sampler, version_deprecation, param_deprecated, new_param=None):\n    \"\"\"Helper to deprecate a parameter by another one.\n\n    Parameters\n    ----------\n    sampler : sampler object,\n        The object which will be inspected.\n\n    version_deprecation : str,\n        The version from which the parameter will be deprecated. The format\n        should be ``'x.y'``.\n\n    param_deprecated : str,\n        The parameter being deprecated.\n\n    new_param : str,\n        The parameter used instead of the deprecated parameter. By default, no\n        parameter is expected.\n    \"\"\"\n    x, y = version_deprecation.split(\".\")\n    version_removed = x + \".\" + str(int(y) + 2)\n    if new_param is None:\n        if getattr(sampler, param_deprecated) is not None:\n            warnings.warn(\n                (\n                    f\"'{param_deprecated}' is deprecated from {version_deprecation} and\"\n                    f\"  will be removed in {version_removed} for the estimator\"\n                    f\" {sampler.__class__}.\"\n                ),\n                category=FutureWarning,\n            )\n    else:\n        if getattr(sampler, param_deprecated) is not None:\n            warnings.warn(\n                (\n                    f\"'{param_deprecated}' is deprecated from {version_deprecation} and\"\n                    f\" will be removed in {version_removed} for the estimator\"\n                    f\" {sampler.__class__}. Use '{new_param}' instead.\"\n                ),\n                category=FutureWarning,\n            )\n            setattr(sampler, new_param, getattr(sampler, param_deprecated))\n"
  },
  {
    "path": "imblearn/utils/estimator_checks.py",
    "content": "\"\"\"Utils to check the samplers and compatibility with scikit-learn\"\"\"\n\n# Adapated from scikit-learn\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\nimport re\nimport sys\nimport traceback\nimport warnings\nfrom collections import Counter\nfrom functools import partial, wraps\n\nimport numpy as np\nfrom scipy import sparse\nfrom sklearn.base import clone, is_classifier, is_regressor\nfrom sklearn.cluster import KMeans\nfrom sklearn.datasets import (  # noqa\n    load_iris,\n    make_blobs,\n    make_classification,\n    make_multilabel_classification,\n)\nfrom sklearn.exceptions import SkipTestWarning\nfrom sklearn.preprocessing import StandardScaler, label_binarize\nfrom sklearn.utils._param_validation import generate_invalid_param_val, make_constraint\nfrom sklearn.utils._testing import (\n    SkipTest,\n    assert_allclose,\n    assert_array_equal,\n    raises,\n    set_random_state,\n)\nfrom sklearn.utils.estimator_checks import (\n    _enforce_estimator_tags_X,\n    _enforce_estimator_tags_y,\n)\nfrom sklearn.utils.multiclass import type_of_target\n\nfrom imblearn.datasets import make_imbalance\nfrom imblearn.over_sampling.base import BaseOverSampler\nfrom imblearn.under_sampling.base import BaseCleaningSampler, BaseUnderSampler\nfrom imblearn.utils._tags import get_tags\nfrom imblearn.utils._test_common.instance_generator import (\n    _get_check_estimator_ids,\n    _yield_instances_for_check,\n)\n\n\ndef sample_dataset_generator():\n    X, y = make_classification(\n        n_samples=1000,\n        n_classes=3,\n        n_informative=4,\n        weights=[0.2, 0.3, 0.5],\n        random_state=0,\n    )\n    return X, y\n\n\ndef _set_checking_parameters(estimator):\n    params = estimator.get_params()\n    name = estimator.__class__.__name__\n    if \"n_estimators\" in params:\n        estimator.set_params(n_estimators=min(5, estimator.n_estimators))\n    if name == \"ClusterCentroids\":\n        algorithm = \"lloyd\"\n        estimator.set_params(\n            voting=\"soft\",\n            estimator=KMeans(random_state=0, algorithm=algorithm, n_init=1),\n        )\n    if name == \"KMeansSMOTE\":\n        estimator.set_params(kmeans_estimator=12)\n\n\ndef _yield_sampler_checks(sampler):\n    tags = get_tags(sampler)\n    accept_sparse = tags.input_tags.sparse\n    accept_dataframe = tags.input_tags.dataframe\n    accept_string = tags.input_tags.string\n    allow_nan = tags.input_tags.allow_nan\n\n    yield check_target_type\n    yield check_samplers_one_label\n    yield check_samplers_fit\n    yield check_samplers_fit_resample\n    yield check_samplers_sampling_strategy_fit_resample\n    if accept_sparse:\n        yield check_samplers_sparse\n    if accept_dataframe:\n        yield check_samplers_pandas\n        yield check_samplers_pandas_sparse\n    if accept_string:\n        yield check_samplers_string\n    if allow_nan:\n        yield check_samplers_nan\n    yield check_samplers_list\n    yield check_samplers_multiclass_ova\n    yield check_samplers_preserve_dtype\n    # we don't filter samplers based on their tag here because we want to make\n    # sure that the fitted attribute does not exist if the tag is not\n    # stipulated\n    yield check_samplers_sample_indices\n    yield check_samplers_2d_target\n    yield check_sampler_get_feature_names_out\n    yield check_sampler_get_feature_names_out_pandas\n\n\ndef _yield_classifier_checks(classifier):\n    yield check_classifier_on_multilabel_or_multioutput_targets\n    yield check_classifiers_with_encoded_labels\n\n\ndef _yield_all_checks(estimator, legacy=True):\n    name = estimator.__class__.__name__\n    tags = get_tags(estimator)\n\n    skip_test = tags._skip_test\n    if skip_test:\n        warnings.warn(\n            f\"Explicit SKIP via _skip_test tag for estimator {name}.\",\n            SkipTestWarning,\n        )\n        return\n    # trigger our checks if this is a SamplerMixin\n    if hasattr(estimator, \"fit_resample\"):\n        for check in _yield_sampler_checks(estimator):\n            yield check\n    if hasattr(estimator, \"predict\"):\n        for check in _yield_classifier_checks(estimator):\n            yield check\n\n\ndef _check_name(check):\n    if hasattr(check, \"__wrapped__\"):\n        return _check_name(check.__wrapped__)\n    return check.func.__name__ if isinstance(check, partial) else check.__name__\n\n\ndef _maybe_mark(estimator, check, expected_failed_checks=None, mark=None, pytest=None):\n    \"\"\"Mark the test as xfail or skip if needed.\n\n    Parameters\n    ----------\n    estimator : estimator object\n        Estimator instance for which to generate checks.\n    check : partial or callable\n        Check to be marked.\n    expected_failed_checks : dict[str, str], default=None\n        Dictionary of the form {check_name: reason} for checks that are expected to\n        fail.\n    mark : \"xfail\" or \"skip\" or None\n        Whether to mark the check as xfail or skip.\n    pytest : pytest module, default=None\n        Pytest module to use to mark the check. This is only needed if ``mark`` is\n        `\"xfail\"`. Note that one can run `check_estimator` without having `pytest`\n        installed. This is used in combination with `parametrize_with_checks` only.\n    \"\"\"\n    should_be_marked, reason = _should_be_skipped_or_marked(\n        estimator, check, expected_failed_checks\n    )\n    if not should_be_marked or mark is None:\n        return estimator, check\n\n    estimator_name = estimator.__class__.__name__\n    if mark == \"xfail\":\n        return pytest.param(estimator, check, marks=pytest.mark.xfail(reason=reason))\n    else:\n\n        @wraps(check)\n        def wrapped(*args, **kwargs):\n            raise SkipTest(\n                f\"Skipping {_check_name(check)} for {estimator_name}: {reason}\"\n            )\n\n        return estimator, wrapped\n\n\ndef _should_be_skipped_or_marked(\n    estimator, check, expected_failed_checks: dict[str, str] | None = None\n) -> tuple[bool, str]:\n    \"\"\"Check whether a check should be skipped or marked as xfail.\n\n    Parameters\n    ----------\n    estimator : estimator object\n        Estimator instance for which to generate checks.\n    check : partial or callable\n        Check to be marked.\n    expected_failed_checks : dict[str, str], default=None\n        Dictionary of the form {check_name: reason} for checks that are expected to\n        fail.\n\n    Returns\n    -------\n    should_be_marked : bool\n        Whether the check should be marked as xfail or skipped.\n    reason : str\n        Reason for skipping the check.\n    \"\"\"\n\n    expected_failed_checks = expected_failed_checks or {}\n\n    check_name = _check_name(check)\n    if check_name in expected_failed_checks:\n        return True, expected_failed_checks[check_name]\n\n    return False, \"Check is not expected to fail\"\n\n\ndef estimator_checks_generator(\n    estimator, *, legacy=True, expected_failed_checks=None, mark=None\n):\n    \"\"\"Iteratively yield all check callables for an estimator.\n\n    .. versionadded:: 1.6\n\n    Parameters\n    ----------\n    estimator : estimator object\n        Estimator instance for which to generate checks.\n    legacy : bool, default=True\n        Whether to include legacy checks. Over time we remove checks from this category\n        and move them into their specific category.\n    expected_failed_checks : dict[str, str], default=None\n        Dictionary of the form {check_name: reason} for checks that are expected to\n        fail.\n    mark : {\"xfail\", \"skip\"} or None, default=None\n        Whether to mark the checks that are expected to fail as\n        xfail(`pytest.mark.xfail`) or skip. Marking a test as \"skip\" is done via\n        wrapping the check in a function that raises a\n        :class:`~sklearn.exceptions.SkipTest` exception.\n\n    Returns\n    -------\n    estimator_checks_generator : generator\n        Generator that yields (estimator, check) tuples.\n    \"\"\"\n    if mark == \"xfail\":\n        import pytest\n    else:\n        pytest = None  # type: ignore\n\n    name = type(estimator).__name__\n    for check in _yield_all_checks(estimator, legacy=legacy):\n        check_with_name = partial(check, name)\n        for check_instance in _yield_instances_for_check(check, estimator):\n            yield _maybe_mark(\n                check_instance,\n                check_with_name,\n                expected_failed_checks=expected_failed_checks,\n                mark=mark,\n                pytest=pytest,\n            )\n\n\ndef parametrize_with_checks(estimators, *, legacy=True, expected_failed_checks=None):\n    \"\"\"Pytest specific decorator for parametrizing estimator checks.\n\n    Checks are categorised into the following groups:\n\n    - API checks: a set of checks to ensure API compatibility with scikit-learn.\n      Refer to https://scikit-learn.org/dev/developers/develop.html a requirement of\n      scikit-learn estimators.\n    - legacy: a set of checks which gradually will be grouped into other categories.\n\n    The `id` of each check is set to be a pprint version of the estimator\n    and the name of the check with its keyword arguments.\n    This allows to use `pytest -k` to specify which tests to run::\n\n        pytest test_check_estimators.py -k check_estimators_fit_returns_self\n\n    Parameters\n    ----------\n    estimators : list of estimators instances\n        Estimators to generated checks for.\n\n        .. versionchanged:: 0.24\n           Passing a class was deprecated in version 0.23, and support for\n           classes was removed in 0.24. Pass an instance instead.\n\n        .. versionadded:: 0.24\n\n\n    legacy : bool, default=True\n        Whether to include legacy checks. Over time we remove checks from this category\n        and move them into their specific category.\n\n        .. versionadded:: 1.6\n\n    expected_failed_checks : callable, default=None\n        A callable that takes an estimator as input and returns a dictionary of the\n        form::\n\n            {\n                \"check_name\": \"my reason\",\n            }\n\n        Where `\"check_name\"` is the name of the check, and `\"my reason\"` is why\n        the check fails. These tests will be marked as xfail if the check fails.\n\n\n        .. versionadded:: 1.6\n\n    Returns\n    -------\n    decorator : `pytest.mark.parametrize`\n\n    See Also\n    --------\n    check_estimator : Check if estimator adheres to scikit-learn conventions.\n\n    Examples\n    --------\n    >>> from sklearn.utils.estimator_checks import parametrize_with_checks\n    >>> from sklearn.linear_model import LogisticRegression\n    >>> from sklearn.tree import DecisionTreeRegressor\n\n    >>> @parametrize_with_checks([LogisticRegression(),\n    ...                           DecisionTreeRegressor()])\n    ... def test_sklearn_compatible_estimator(estimator, check):\n    ...     check(estimator)\n\n    \"\"\"\n    import pytest\n\n    if any(isinstance(est, type) for est in estimators):\n        msg = (\n            \"Passing a class was deprecated in version 0.23 \"\n            \"and isn't supported anymore from 0.24.\"\n            \"Please pass an instance instead.\"\n        )\n        raise TypeError(msg)\n\n    def _checks_generator(estimators, legacy, expected_failed_checks):\n        for estimator in estimators:\n            args = {\"estimator\": estimator, \"legacy\": legacy, \"mark\": \"xfail\"}\n            if callable(expected_failed_checks):\n                args[\"expected_failed_checks\"] = expected_failed_checks(estimator)\n            yield from estimator_checks_generator(**args)\n\n    return pytest.mark.parametrize(\n        \"estimator, check\",\n        _checks_generator(estimators, legacy, expected_failed_checks),\n        ids=_get_check_estimator_ids,\n    )\n\n\ndef check_target_type(name, estimator_orig):\n    estimator = clone(estimator_orig)\n    # should raise warning if the target is continuous (we cannot raise error)\n    X = np.random.random((20, 2))\n    y = np.linspace(0, 1, 20)\n    msg = \"Unknown label type:\"\n    with raises(ValueError, err_msg=msg):\n        estimator.fit_resample(X, y)\n    # if the target is multilabel then we should raise an error\n    rng = np.random.RandomState(42)\n    y = rng.randint(2, size=(20, 3))\n    msg = \"Multilabel and multioutput targets are not supported.\"\n    with raises(ValueError, err_msg=msg):\n        estimator.fit_resample(X, y)\n\n\ndef check_samplers_one_label(name, sampler_orig):\n    sampler = clone(sampler_orig)\n    error_string_fit = \"Sampler can't balance when only one class is present.\"\n    X = np.random.random((20, 2))\n    y = np.zeros(20)\n    try:\n        sampler.fit_resample(X, y)\n    except ValueError as e:\n        if \"class\" not in repr(e):\n            print(error_string_fit, sampler.__class__.__name__, e)\n            traceback.print_exc(file=sys.stdout)\n            raise e\n        else:\n            return\n    except Exception as exc:\n        print(error_string_fit, traceback, exc)\n        traceback.print_exc(file=sys.stdout)\n        raise exc\n    raise AssertionError(error_string_fit)\n\n\ndef check_samplers_fit(name, sampler_orig):\n    sampler = clone(sampler_orig)\n    np.random.seed(42)  # Make this test reproducible\n    X = np.random.random((30, 2))\n    y = np.array([1] * 20 + [0] * 10)\n    sampler.fit_resample(X, y)\n    assert hasattr(\n        sampler, \"sampling_strategy_\"\n    ), \"No fitted attribute sampling_strategy_\"\n\n\ndef check_samplers_fit_resample(name, sampler_orig):\n    sampler = clone(sampler_orig)\n    X, y = sample_dataset_generator()\n    target_stats = Counter(y)\n    X_res, y_res = sampler.fit_resample(X, y)\n    if isinstance(sampler, BaseOverSampler):\n        target_stats_res = Counter(y_res)\n        n_samples = max(target_stats.values())\n        assert all(value >= n_samples for value in Counter(y_res).values())\n    elif isinstance(sampler, BaseUnderSampler):\n        n_samples = min(target_stats.values())\n        if name == \"InstanceHardnessThreshold\":\n            # IHT does not enforce the number of samples but provide a number\n            # of samples the closest to the desired target.\n            assert all(\n                Counter(y_res)[k] <= target_stats[k] for k in target_stats.keys()\n            )\n        else:\n            assert all(value == n_samples for value in Counter(y_res).values())\n    elif isinstance(sampler, BaseCleaningSampler):\n        target_stats_res = Counter(y_res)\n        class_minority = min(target_stats, key=target_stats.get)\n        assert all(\n            target_stats[class_sample] > target_stats_res[class_sample]\n            for class_sample in target_stats.keys()\n            if class_sample != class_minority\n        )\n\n\ndef check_samplers_sampling_strategy_fit_resample(name, sampler_orig):\n    sampler = clone(sampler_orig)\n    # in this test we will force all samplers to not change the class 1\n    X, y = sample_dataset_generator()\n    expected_stat = Counter(y)[1]\n    if isinstance(sampler, BaseOverSampler):\n        sampling_strategy = {2: 498, 0: 498}\n        sampler.set_params(sampling_strategy=sampling_strategy)\n        X_res, y_res = sampler.fit_resample(X, y)\n        assert Counter(y_res)[1] == expected_stat\n    elif isinstance(sampler, BaseUnderSampler):\n        sampling_strategy = {2: 201, 0: 201}\n        sampler.set_params(sampling_strategy=sampling_strategy)\n        X_res, y_res = sampler.fit_resample(X, y)\n        assert Counter(y_res)[1] == expected_stat\n    elif isinstance(sampler, BaseCleaningSampler):\n        sampling_strategy = [2, 0]\n        sampler.set_params(sampling_strategy=sampling_strategy)\n        X_res, y_res = sampler.fit_resample(X, y)\n        assert Counter(y_res)[1] == expected_stat\n\n\ndef check_samplers_sparse(name, sampler_orig):\n    sampler = clone(sampler_orig)\n    # check that sparse matrices can be passed through the sampler leading to\n    # the same results than dense\n    X, y = sample_dataset_generator()\n    X_sparse = sparse.csr_matrix(X)\n    X_res_sparse, y_res_sparse = sampler.fit_resample(X_sparse, y)\n    sampler = clone(sampler)\n    X_res, y_res = sampler.fit_resample(X, y)\n    assert sparse.issparse(X_res_sparse)\n    assert_allclose(X_res_sparse.toarray(), X_res, rtol=1e-5)\n    assert_allclose(y_res_sparse, y_res)\n\n\ndef check_samplers_pandas_sparse(name, sampler_orig):\n    try:\n        import pandas as pd\n    except ImportError:\n        raise SkipTest(\n            \"pandas is not installed: not checking column name consistency for pandas\"\n        )\n    sampler = clone(sampler_orig)\n    # Check that the samplers handle pandas dataframe and pandas series\n    X, y = sample_dataset_generator()\n    X_df = pd.DataFrame(\n        X, columns=[str(i) for i in range(X.shape[1])], dtype=pd.SparseDtype(float, 0)\n    )\n    y_s = pd.Series(y, name=\"class\")\n\n    X_res_df, y_res_s = sampler.fit_resample(X_df, y_s)\n    X_res, y_res = sampler.fit_resample(X, y)\n\n    # check that we return the same type for dataframes or series types\n    assert isinstance(X_res_df, pd.DataFrame)\n    assert isinstance(y_res_s, pd.Series)\n\n    for column_dtype in X_res_df.dtypes:\n        assert isinstance(column_dtype, pd.SparseDtype)\n\n    assert X_df.columns.tolist() == X_res_df.columns.tolist()\n    assert y_s.name == y_res_s.name\n\n    assert_allclose(X_res_df.to_numpy(), X_res)\n    assert_allclose(y_res_s.to_numpy(), y_res)\n\n\ndef check_samplers_pandas(name, sampler_orig):\n    try:\n        import pandas as pd\n    except ImportError:\n        raise SkipTest(\n            \"pandas is not installed: not checking column name consistency for pandas\"\n        )\n    sampler = clone(sampler_orig)\n    # Check that the samplers handle pandas dataframe and pandas series\n    X, y = sample_dataset_generator()\n    X_df = pd.DataFrame(X, columns=[str(i) for i in range(X.shape[1])])\n    y_df = pd.DataFrame(y)\n    y_s = pd.Series(y, name=\"class\")\n\n    X_res_df, y_res_s = sampler.fit_resample(X_df, y_s)\n    X_res_df, y_res_df = sampler.fit_resample(X_df, y_df)\n    X_res, y_res = sampler.fit_resample(X, y)\n\n    # check that we return the same type for dataframes or series types\n    assert isinstance(X_res_df, pd.DataFrame)\n    assert isinstance(y_res_df, pd.DataFrame)\n    assert isinstance(y_res_s, pd.Series)\n\n    assert X_df.columns.tolist() == X_res_df.columns.tolist()\n    assert y_df.columns.tolist() == y_res_df.columns.tolist()\n    assert y_s.name == y_res_s.name\n\n    assert_allclose(X_res_df.to_numpy(), X_res)\n    assert_allclose(y_res_df.to_numpy().ravel(), y_res)\n    assert_allclose(y_res_s.to_numpy(), y_res)\n\n\ndef check_samplers_list(name, sampler_orig):\n    sampler = clone(sampler_orig)\n    # Check that the can samplers handle simple lists\n    X, y = sample_dataset_generator()\n    X_list = X.tolist()\n    y_list = y.tolist()\n\n    X_res, y_res = sampler.fit_resample(X, y)\n    X_res_list, y_res_list = sampler.fit_resample(X_list, y_list)\n\n    assert isinstance(X_res_list, list)\n    assert isinstance(y_res_list, list)\n\n    assert_allclose(X_res, X_res_list)\n    assert_allclose(y_res, y_res_list)\n\n\ndef check_samplers_multiclass_ova(name, sampler_orig):\n    sampler = clone(sampler_orig)\n    # Check that multiclass target lead to the same results than OVA encoding\n    X, y = sample_dataset_generator()\n    y_ova = label_binarize(y, classes=np.unique(y))\n    X_res, y_res = sampler.fit_resample(X, y)\n    X_res_ova, y_res_ova = sampler.fit_resample(X, y_ova)\n    assert_allclose(X_res, X_res_ova)\n    assert type_of_target(y_res_ova) == type_of_target(y_ova)\n    assert_allclose(y_res, y_res_ova.argmax(axis=1))\n\n\ndef check_samplers_2d_target(name, sampler_orig):\n    sampler = clone(sampler_orig)\n    X, y = sample_dataset_generator()\n\n    y = y.reshape(-1, 1)  # Make the target 2d\n    sampler.fit_resample(X, y)\n\n\ndef check_samplers_preserve_dtype(name, sampler_orig):\n    sampler = clone(sampler_orig)\n    X, y = sample_dataset_generator()\n    # Cast X and y to not default dtype\n    X = X.astype(np.float32)\n    y = y.astype(np.int32)\n    X_res, y_res = sampler.fit_resample(X, y)\n    assert X.dtype == X_res.dtype, \"X dtype is not preserved\"\n    assert y.dtype == y_res.dtype, \"y dtype is not preserved\"\n\n\ndef check_samplers_sample_indices(name, sampler_orig):\n    sampler = clone(sampler_orig)\n    X, y = sample_dataset_generator()\n    sampler.fit_resample(X, y)\n    tags = get_tags(sampler)\n    if tags.sampler_tags.sample_indices:\n        assert hasattr(sampler, \"sample_indices_\") is tags.sampler_tags.sample_indices\n    else:\n        assert not hasattr(sampler, \"sample_indices_\")\n\n\ndef check_samplers_string(name, sampler_orig):\n    rng = np.random.RandomState(0)\n    sampler = clone(sampler_orig)\n    categories = np.array([\"A\", \"B\", \"C\"], dtype=object)\n    n_samples = 30\n    X = rng.randint(low=0, high=3, size=n_samples).reshape(-1, 1)\n    X = categories[X]\n    y = rng.permutation([0] * 10 + [1] * 20)\n\n    X_res, y_res = sampler.fit_resample(X, y)\n    assert X_res.dtype == object\n    assert X_res.shape[0] == y_res.shape[0]\n    assert_array_equal(np.unique(X_res.ravel()), categories)\n\n\ndef check_samplers_nan(name, sampler_orig):\n    rng = np.random.RandomState(0)\n    sampler = clone(sampler_orig)\n    categories = np.array([0, 1, np.nan], dtype=np.float64)\n    n_samples = 100\n    X = rng.randint(low=0, high=3, size=n_samples).reshape(-1, 1)\n    X = categories[X]\n    y = rng.permutation([0] * 40 + [1] * 60)\n\n    X_res, y_res = sampler.fit_resample(X, y)\n    assert X_res.dtype == np.float64\n    assert X_res.shape[0] == y_res.shape[0]\n    assert np.any(np.isnan(X_res.ravel()))\n\n\ndef check_classifier_on_multilabel_or_multioutput_targets(name, estimator_orig):\n    estimator = clone(estimator_orig)\n    X, y = make_multilabel_classification(n_samples=30)\n    msg = \"Multilabel and multioutput targets are not supported.\"\n    with raises(ValueError, match=msg):\n        estimator.fit(X, y)\n\n\ndef check_classifiers_with_encoded_labels(name, classifier_orig):\n    # Non-regression test for #709\n    # https://github.com/scikit-learn-contrib/imbalanced-learn/issues/709\n    try:\n        import pandas as pd\n    except ImportError:\n        raise SkipTest(\n            \"pandas is not installed: not checking column name consistency for pandas\"\n        )\n    classifier = clone(classifier_orig)\n    iris = load_iris(as_frame=True)\n    df, y = iris.data, iris.target\n    y = pd.Series(iris.target_names[iris.target], dtype=\"category\")\n    df, y = make_imbalance(\n        df,\n        y,\n        sampling_strategy={\n            \"setosa\": 30,\n            \"versicolor\": 20,\n            \"virginica\": 50,\n        },\n    )\n    classifier.fit(df, y)\n    assert set(classifier.classes_) == set(y.cat.categories.tolist())\n    y_pred = classifier.predict(df)\n    assert set(y_pred) == set(y.cat.categories.tolist())\n\n\ndef check_param_validation(name, estimator_orig):\n    # Check that an informative error is raised when the value of a constructor\n    # parameter does not have an appropriate type or value.\n    rng = np.random.RandomState(0)\n    X = rng.uniform(size=(20, 5))\n    y = rng.randint(0, 2, size=20)\n    y = _enforce_estimator_tags_y(estimator_orig, y)\n\n    estimator_params = estimator_orig.get_params(deep=False).keys()\n\n    # check that there is a constraint for each parameter\n    if estimator_params:\n        validation_params = estimator_orig._parameter_constraints.keys()\n        unexpected_params = set(validation_params) - set(estimator_params)\n        missing_params = set(estimator_params) - set(validation_params)\n        err_msg = (\n            f\"Mismatch between _parameter_constraints and the parameters of {name}.\"\n            f\"\\nConsider the unexpected parameters {unexpected_params} and expected but\"\n            f\" missing parameters {missing_params}\"\n        )\n        assert validation_params == estimator_params, err_msg\n\n    # this object does not have a valid type for sure for all params\n    param_with_bad_type = type(\"BadType\", (), {})()\n\n    fit_methods = [\"fit\", \"partial_fit\", \"fit_transform\", \"fit_predict\", \"fit_resample\"]\n\n    for param_name in estimator_params:\n        constraints = estimator_orig._parameter_constraints[param_name]\n\n        if constraints == \"no_validation\":\n            # This parameter is not validated\n            continue  # pragma: no cover\n\n        match = rf\"The '{param_name}' parameter of {name} must be .* Got .* instead.\"\n        err_msg = (\n            f\"{name} does not raise an informative error message when the \"\n            f\"parameter {param_name} does not have a valid type or value.\"\n        )\n\n        estimator = clone(estimator_orig)\n\n        # First, check that the error is raised if param doesn't match any valid type.\n        estimator.set_params(**{param_name: param_with_bad_type})\n\n        for method in fit_methods:\n            if not hasattr(estimator, method):\n                # the method is not accessible with the current set of parameters\n                continue\n\n            with raises(ValueError, match=match, err_msg=err_msg):\n                getattr(estimator, method)(X, y)\n\n        # Then, for constraints that are more than a type constraint, check that the\n        # error is raised if param does match a valid type but does not match any valid\n        # value for this type.\n        constraints = [make_constraint(constraint) for constraint in constraints]\n\n        for constraint in constraints:\n            try:\n                bad_value = generate_invalid_param_val(constraint)\n            except NotImplementedError:\n                continue\n\n            estimator.set_params(**{param_name: bad_value})\n\n            for method in fit_methods:\n                if not hasattr(estimator, method):\n                    # the method is not accessible with the current set of parameters\n                    continue\n\n                with raises(ValueError, match=match, err_msg=err_msg):\n                    getattr(estimator, method)(X, y)\n\n\ndef check_dataframe_column_names_consistency(name, estimator_orig):\n    try:\n        import pandas as pd\n    except ImportError:\n        raise SkipTest(\n            \"pandas is not installed: not checking column name consistency for pandas\"\n        )\n\n    tags = get_tags(estimator_orig)\n    is_supported_X_types = tags.input_tags.two_d_array or tags.input_tags.categorical\n    no_validation = tags.no_validation\n\n    if not is_supported_X_types or no_validation:\n        return\n\n    rng = np.random.RandomState(0)\n\n    estimator = clone(estimator_orig)\n    set_random_state(estimator)\n\n    X_orig = rng.normal(size=(150, 8))\n\n    X_orig = _enforce_estimator_tags_X(estimator, X_orig)\n    n_samples, n_features = X_orig.shape\n\n    names = np.array([f\"col_{i}\" for i in range(n_features)])\n    X = pd.DataFrame(X_orig, columns=names)\n\n    if is_regressor(estimator):\n        y = rng.normal(size=n_samples)\n    else:\n        y = rng.randint(low=0, high=2, size=n_samples)\n    y = _enforce_estimator_tags_y(estimator, y)\n\n    # Check that calling `fit` does not raise any warnings about feature names.\n    with warnings.catch_warnings():\n        warnings.filterwarnings(\n            \"error\",\n            message=\"X does not have valid feature names\",\n            category=UserWarning,\n            module=\"imblearn\",\n        )\n        estimator.fit(X, y)\n\n    if not hasattr(estimator, \"feature_names_in_\"):\n        raise ValueError(\n            \"Estimator does not have a feature_names_in_ \"\n            \"attribute after fitting with a dataframe\"\n        )\n    assert isinstance(estimator.feature_names_in_, np.ndarray)\n    assert estimator.feature_names_in_.dtype == object\n    assert_array_equal(estimator.feature_names_in_, names)\n\n    # Only check imblearn estimators for feature_names_in_ in docstring\n    module_name = estimator_orig.__module__\n    if (\n        module_name.startswith(\"imblearn.\")\n        and not (\"test_\" in module_name or module_name.endswith(\"_testing\"))\n        and (\"feature_names_in_\" not in (estimator_orig.__doc__))\n    ):\n        raise ValueError(\n            f\"Estimator {name} does not document its feature_names_in_ attribute\"\n        )\n\n    check_methods = []\n    for method in (\n        \"predict\",\n        \"transform\",\n        \"decision_function\",\n        \"predict_proba\",\n        \"score\",\n        \"score_samples\",\n        \"predict_log_proba\",\n    ):\n        if not hasattr(estimator, method):\n            continue\n\n        callable_method = getattr(estimator, method)\n        if method == \"score\":\n            callable_method = partial(callable_method, y=y)\n        check_methods.append((method, callable_method))\n\n    for _, method in check_methods:\n        with warnings.catch_warnings():\n            warnings.filterwarnings(\n                \"error\",\n                message=\"X does not have valid feature names\",\n                category=UserWarning,\n                module=\"sklearn\",\n            )\n            method(X)  # works without UserWarning for valid features\n\n    invalid_names = [\n        (names[::-1], \"Feature names must be in the same order as they were in fit.\"),\n        (\n            [f\"another_prefix_{i}\" for i in range(n_features)],\n            (\n                \"Feature names unseen at fit time:\\n- another_prefix_0\\n-\"\n                \" another_prefix_1\\n\"\n            ),\n        ),\n        (\n            names[:3],\n            f\"Feature names seen at fit time, yet now missing:\\n- {min(names[3:])}\\n\",\n        ),\n    ]\n    params = {\n        key: value\n        for key, value in estimator.get_params().items()\n        if \"early_stopping\" in key\n    }\n    early_stopping_enabled = any(value is True for value in params.values())\n\n    for invalid_name, additional_message in invalid_names:\n        X_bad = pd.DataFrame(X, columns=invalid_name)\n\n        for name, method in check_methods:\n            expected_msg = re.escape(\n                \"The feature names should match those that were passed during fit.\"\n                f\"\\n{additional_message}\"\n            )\n            with raises(\n                ValueError, match=expected_msg, err_msg=f\"{name} did not raise\"\n            ):\n                method(X_bad)\n\n        # partial_fit checks on second call\n        # Do not call partial fit if early_stopping is on\n        if not hasattr(estimator, \"partial_fit\") or early_stopping_enabled:\n            continue\n\n        estimator = clone(estimator_orig)\n        if is_classifier(estimator):\n            classes = np.unique(y)\n            estimator.partial_fit(X, y, classes=classes)\n        else:\n            estimator.partial_fit(X, y)\n\n        with raises(ValueError, match=expected_msg):\n            estimator.partial_fit(X_bad, y)\n\n\ndef check_sampler_get_feature_names_out(name, sampler_orig):\n    tags = get_tags(sampler_orig)\n\n    two_d_array = tags.input_tags.two_d_array\n    no_validation = tags.no_validation\n\n    if not two_d_array or no_validation:\n        return\n\n    X, y = make_blobs(\n        n_samples=30,\n        centers=[[0, 0, 0], [1, 1, 1]],\n        random_state=0,\n        n_features=2,\n        cluster_std=0.1,\n    )\n    X = StandardScaler().fit_transform(X)\n\n    sampler = clone(sampler_orig)\n    X = _enforce_estimator_tags_X(sampler, X)\n\n    n_features = X.shape[1]\n    set_random_state(sampler)\n\n    y_ = y\n    X_res, y_res = sampler.fit_resample(X, y=y_)\n    input_features = [f\"feature{i}\" for i in range(n_features)]\n\n    # input_features names is not the same length as n_features_in_\n    with raises(ValueError, match=\"input_features should have length equal\"):\n        sampler.get_feature_names_out(input_features[::2])\n\n    feature_names_out = sampler.get_feature_names_out(input_features)\n    assert feature_names_out is not None\n    assert isinstance(feature_names_out, np.ndarray)\n    assert feature_names_out.dtype == object\n    assert all(isinstance(name, str) for name in feature_names_out)\n\n    n_features_out = X_res.shape[1]\n\n    assert (\n        len(feature_names_out) == n_features_out\n    ), f\"Expected {n_features_out} feature names, got {len(feature_names_out)}\"\n\n\ndef check_sampler_get_feature_names_out_pandas(name, sampler_orig):\n    try:\n        import pandas as pd\n    except ImportError:\n        raise SkipTest(\n            \"pandas is not installed: not checking column name consistency for pandas\"\n        )\n\n    tags = get_tags(sampler_orig)\n    two_d_array = tags.input_tags.two_d_array\n    no_validation = tags.no_validation\n\n    if not two_d_array or no_validation:\n        return\n\n    X, y = make_blobs(\n        n_samples=30,\n        centers=[[0, 0, 0], [1, 1, 1]],\n        random_state=0,\n        n_features=2,\n        cluster_std=0.1,\n    )\n    X = StandardScaler().fit_transform(X)\n\n    sampler = clone(sampler_orig)\n    X = _enforce_estimator_tags_X(sampler, X)\n\n    n_features = X.shape[1]\n    set_random_state(sampler)\n\n    y_ = y\n    feature_names_in = [f\"col{i}\" for i in range(n_features)]\n    df = pd.DataFrame(X, columns=feature_names_in)\n    X_res, y_res = sampler.fit_resample(df, y=y_)\n\n    # error is raised when `input_features` do not match feature_names_in\n    invalid_feature_names = [f\"bad{i}\" for i in range(n_features)]\n    with raises(ValueError, match=\"input_features is not equal to feature_names_in_\"):\n        sampler.get_feature_names_out(invalid_feature_names)\n\n    feature_names_out_default = sampler.get_feature_names_out()\n    feature_names_in_explicit_names = sampler.get_feature_names_out(feature_names_in)\n    assert_array_equal(feature_names_out_default, feature_names_in_explicit_names)\n\n    n_features_out = X_res.shape[1]\n\n    assert (\n        len(feature_names_out_default) == n_features_out\n    ), f\"Expected {n_features_out} feature names, got {len(feature_names_out_default)}\"\n"
  },
  {
    "path": "imblearn/utils/testing.py",
    "content": "\"\"\"Test utilities.\"\"\"\n\n# Adapted from scikit-learn\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\nimport inspect\nimport pkgutil\nfrom importlib import import_module\nfrom operator import itemgetter\nfrom pathlib import Path\n\nimport numpy as np\nfrom scipy import sparse\nfrom sklearn.base import BaseEstimator\nfrom sklearn.neighbors import KDTree\nfrom sklearn.utils._testing import ignore_warnings\n\n\ndef all_estimators(\n    type_filter=None,\n):\n    \"\"\"Get a list of all estimators from imblearn.\n\n    This function crawls the module and gets all classes that inherit\n    from BaseEstimator. Classes that are defined in test-modules are not\n    included.\n    By default meta_estimators are also not included.\n    This function is adapted from sklearn.\n\n    Parameters\n    ----------\n    type_filter : str, list of str, or None, default=None\n        Which kind of estimators should be returned. If None, no\n        filter is applied and all estimators are returned.  Possible\n        values are 'sampler' to get estimators only of these specific\n        types, or a list of these to get the estimators that fit at\n        least one of the types.\n\n    Returns\n    -------\n    estimators : list of tuples\n        List of (name, class), where ``name`` is the class name as string\n        and ``class`` is the actual type of the class.\n    \"\"\"\n    from imblearn.base import SamplerMixin\n\n    def is_abstract(c):\n        if not (hasattr(c, \"__abstractmethods__\")):\n            return False\n        if not len(c.__abstractmethods__):\n            return False\n        return True\n\n    all_classes = []\n    modules_to_ignore = {\"tests\"}\n    root = str(Path(__file__).parent.parent)\n    # Ignore deprecation warnings triggered at import time and from walking\n    # packages\n    with ignore_warnings(category=FutureWarning):\n        for importer, modname, ispkg in pkgutil.walk_packages(\n            path=[root], prefix=\"imblearn.\"\n        ):\n            mod_parts = modname.split(\".\")\n            if any(part in modules_to_ignore for part in mod_parts) or \"._\" in modname:\n                continue\n            module = import_module(modname)\n            classes = inspect.getmembers(module, inspect.isclass)\n            classes = [\n                (name, est_cls) for name, est_cls in classes if not name.startswith(\"_\")\n            ]\n\n            all_classes.extend(classes)\n\n    all_classes = set(all_classes)\n\n    estimators = [\n        c\n        for c in all_classes\n        if (issubclass(c[1], BaseEstimator) and c[0] != \"BaseEstimator\")\n    ]\n    # get rid of abstract base classes\n    estimators = [c for c in estimators if not is_abstract(c[1])]\n\n    # get rid of sklearn estimators which have been imported in some classes\n    estimators = [c for c in estimators if \"sklearn\" not in c[1].__module__]\n\n    if type_filter is not None:\n        if not isinstance(type_filter, list):\n            type_filter = [type_filter]\n        else:\n            type_filter = list(type_filter)  # copy\n        filtered_estimators = []\n        filters = {\"sampler\": SamplerMixin}\n        for name, mixin in filters.items():\n            if name in type_filter:\n                type_filter.remove(name)\n                filtered_estimators.extend(\n                    [est for est in estimators if issubclass(est[1], mixin)]\n                )\n        estimators = filtered_estimators\n        if type_filter:\n            raise ValueError(\n                f\"Parameter type_filter must be 'sampler' or None, got {type_filter!r}.\"\n            )\n\n    # drop duplicates, sort for reproducibility\n    # itemgetter is used to ensure the sort does not extend to the 2nd item of\n    # the tuple\n    return sorted(set(estimators), key=itemgetter(0))\n\n\nclass _CustomNearestNeighbors(BaseEstimator):\n    \"\"\"Basic implementation of nearest neighbors not relying on scikit-learn.\n\n    `kneighbors_graph` is ignored and `metric` does not have any impact.\n    \"\"\"\n\n    def __init__(self, n_neighbors=1, metric=\"euclidean\"):\n        self.n_neighbors = n_neighbors\n        self.metric = metric\n\n    def fit(self, X, y=None):\n        X = X.toarray() if sparse.issparse(X) else X\n        self._kd_tree = KDTree(X)\n        return self\n\n    def kneighbors(self, X, n_neighbors=None, return_distance=True):\n        n_neighbors = n_neighbors if n_neighbors is not None else self.n_neighbors\n        X = X.toarray() if sparse.issparse(X) else X\n        distances, indices = self._kd_tree.query(X, k=n_neighbors)\n        if return_distance:\n            return distances, indices\n        return indices\n\n    def kneighbors_graph(X=None, n_neighbors=None, mode=\"connectivity\"):\n        \"\"\"This method is not used within imblearn but it is required for\n        duck-typing.\"\"\"\n        pass\n\n\nclass _CustomClusterer(BaseEstimator):\n    \"\"\"Class that mimics a cluster that does not expose `cluster_centers_`.\"\"\"\n\n    def __init__(self, n_clusters=1, expose_cluster_centers=True):\n        self.n_clusters = n_clusters\n        self.expose_cluster_centers = expose_cluster_centers\n\n    def fit(self, X, y=None):\n        if self.expose_cluster_centers:\n            self.cluster_centers_ = np.random.randn(self.n_clusters, X.shape[1])\n        return self\n\n    def predict(self, X):\n        return np.zeros(len(X), dtype=int)\n"
  },
  {
    "path": "imblearn/utils/tests/__init__.py",
    "content": ""
  },
  {
    "path": "imblearn/utils/tests/test_deprecation.py",
    "content": "\"\"\"Test for the deprecation helper\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\nimport pytest\n\nfrom imblearn.utils.deprecation import deprecate_parameter\n\n\nclass Sampler:\n    def __init__(self):\n        self.a = \"something\"\n        self.b = \"something\"\n\n\ndef test_deprecate_parameter():\n    with pytest.warns(FutureWarning, match=\"is deprecated from\"):\n        deprecate_parameter(Sampler(), \"0.2\", \"a\")\n    with pytest.warns(FutureWarning, match=\"Use 'b' instead.\"):\n        deprecate_parameter(Sampler(), \"0.2\", \"a\", \"b\")\n"
  },
  {
    "path": "imblearn/utils/tests/test_docstring.py",
    "content": "\"\"\"Test utilities for docstring.\"\"\"\n\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT\n\nimport sys\nimport textwrap\n\nimport pytest\n\nfrom imblearn.utils import Substitution\nfrom imblearn.utils._docstring import _n_jobs_docstring, _random_state_docstring\n\n\ndef _dedent_docstring(docstring):\n    \"\"\"Compatibility with Python 3.13+.\n\n    xref: https://github.com/python/cpython/issues/81283\n    \"\"\"\n    return \"\\n\".join([textwrap.dedent(line) for line in docstring.split(\"\\n\")])\n\n\nfunc_docstring = \"\"\"A function.\n\n    Parameters\n    ----------\n    xxx\n\n    yyy\n    \"\"\"\n\n\ndef func(param_1, param_2):\n    \"\"\"A function.\n\n    Parameters\n    ----------\n    {param_1}\n\n    {param_2}\n    \"\"\"\n    return param_1, param_2\n\n\ncls_docstring = \"\"\"A class.\n\n    Parameters\n    ----------\n    xxx\n\n    yyy\n    \"\"\"\n\n\nclass cls:\n    \"\"\"A class.\n\n    Parameters\n    ----------\n    {param_1}\n\n    {param_2}\n    \"\"\"\n\n    def __init__(self, param_1, param_2):\n        self.param_1 = param_1\n        self.param_2 = param_2\n\n\nif sys.version_info >= (3, 13):\n    func_docstring = _dedent_docstring(func_docstring)\n    cls_docstring = _dedent_docstring(cls_docstring)\n\n\n@pytest.mark.parametrize(\n    \"obj, obj_docstring\", [(func, func_docstring), (cls, cls_docstring)]\n)\ndef test_docstring_inject(obj, obj_docstring):\n    obj_injected_docstring = Substitution(param_1=\"xxx\", param_2=\"yyy\")(obj)\n    assert obj_injected_docstring.__doc__ == obj_docstring\n\n\ndef test_docstring_template():\n    assert \"random_state\" in _random_state_docstring\n    assert \"n_jobs\" in _n_jobs_docstring\n\n\ndef test_docstring_with_python_OO():\n    \"\"\"Check that we don't raise a warning if the code is executed with -OO.\n\n    Non-regression test for:\n    https://github.com/scikit-learn-contrib/imbalanced-learn/issues/945\n    \"\"\"\n    instance = cls(param_1=\"xxx\", param_2=\"yyy\")\n    instance.__doc__ = None  # simulate -OO\n\n    instance = Substitution(param_1=\"xxx\", param_2=\"yyy\")(instance)\n\n    assert instance.__doc__ is None\n"
  },
  {
    "path": "imblearn/utils/tests/test_estimator_checks.py",
    "content": "import numpy as np\nimport pytest\nfrom sklearn.base import BaseEstimator\nfrom sklearn.utils.multiclass import check_classification_targets\nfrom sklearn_compat.utils.validation import validate_data\n\nfrom imblearn.base import BaseSampler\nfrom imblearn.over_sampling.base import BaseOverSampler\nfrom imblearn.utils import check_target_type as target_check\nfrom imblearn.utils.estimator_checks import (\n    check_samplers_fit,\n    check_samplers_nan,\n    check_samplers_one_label,\n    check_samplers_preserve_dtype,\n    check_samplers_sparse,\n    check_samplers_string,\n    check_target_type,\n)\n\n\nclass BaseBadSampler(BaseEstimator):\n    \"\"\"Sampler without inputs checking.\"\"\"\n\n    _sampling_type = \"bypass\"\n\n    def fit(self, X, y):\n        return self\n\n    def fit_resample(self, X, y):\n        check_classification_targets(y)\n        self.fit(X, y)\n        return X, y\n\n\nclass SamplerSingleClass(BaseSampler):\n    \"\"\"Sampler that would sample even with a single class.\"\"\"\n\n    _sampling_type = \"bypass\"\n\n    def fit_resample(self, X, y):\n        return self._fit_resample(X, y)\n\n    def _fit_resample(self, X, y):\n        return X, y\n\n\nclass NotFittedSampler(BaseBadSampler):\n    \"\"\"Sampler without target checking.\"\"\"\n\n    def fit(self, X, y):\n        X, y = validate_data(self, X=X, y=y)\n        return self\n\n\nclass NoAcceptingSparseSampler(BaseBadSampler):\n    \"\"\"Sampler which does not accept sparse matrix.\"\"\"\n\n    def fit(self, X, y):\n        X, y = validate_data(self, X=X, y=y)\n        self.sampling_strategy_ = \"sampling_strategy_\"\n        return self\n\n\nclass NotPreservingDtypeSampler(BaseSampler):\n    _sampling_type = \"bypass\"\n\n    _parameter_constraints: dict = {\"sampling_strategy\": \"no_validation\"}\n\n    def _fit_resample(self, X, y):\n        return X.astype(np.float64), y.astype(np.int64)\n\n\nclass IndicesSampler(BaseOverSampler):\n    def _check_X_y(self, X, y):\n        y, binarize_y = target_check(y, indicate_one_vs_all=True)\n        X, y = validate_data(\n            self,\n            X=X,\n            y=y,\n            reset=True,\n            dtype=None,\n            ensure_all_finite=False,\n        )\n        return X, y, binarize_y\n\n    def _fit_resample(self, X, y):\n        n_max_count_class = np.bincount(y).max()\n        indices = np.random.choice(np.arange(X.shape[0]), size=n_max_count_class * 2)\n        return X[indices], y[indices]\n\n\ndef test_check_samplers_string():\n    sampler = IndicesSampler()\n    check_samplers_string(sampler.__class__.__name__, sampler)\n\n\ndef test_check_samplers_nan():\n    sampler = IndicesSampler()\n    check_samplers_nan(sampler.__class__.__name__, sampler)\n\n\nmapping_estimator_error = {\n    \"BaseBadSampler\": (AssertionError, None),\n    \"SamplerSingleClass\": (AssertionError, \"Sampler can't balance when only\"),\n    \"NotFittedSampler\": (AssertionError, \"No fitted attribute\"),\n    \"NoAcceptingSparseSampler\": (TypeError, \"dense data is required\"),\n    \"NotPreservingDtypeSampler\": (AssertionError, \"X dtype is not preserved\"),\n}\n\n\ndef _test_single_check(Estimator, check):\n    estimator = Estimator()\n    name = estimator.__class__.__name__\n    err_type, err_msg = mapping_estimator_error[name]\n    with pytest.raises(err_type, match=err_msg):\n        check(name, estimator)\n\n\ndef test_all_checks():\n    _test_single_check(BaseBadSampler, check_target_type)\n    _test_single_check(SamplerSingleClass, check_samplers_one_label)\n    _test_single_check(NotFittedSampler, check_samplers_fit)\n    _test_single_check(NoAcceptingSparseSampler, check_samplers_sparse)\n    _test_single_check(NotPreservingDtypeSampler, check_samplers_preserve_dtype)\n"
  },
  {
    "path": "imblearn/utils/tests/test_min_dependencies.py",
    "content": "\"\"\"Tests for the minimum dependencies in the README.rst file.\"\"\"\n\nimport os\nimport platform\nimport re\nfrom pathlib import Path\n\nimport pytest\nfrom packaging.requirements import Requirement\nfrom packaging.version import parse\n\nimport imblearn\n\n\n@pytest.mark.skipif(\n    platform.system() == \"Windows\" or parse(platform.python_version()) < parse(\"3.11\"),\n    reason=\"This test is enough on unix system and requires Python >= 3.11\",\n)\ndef test_min_dependencies_readme():\n    # local import to not import the file with Python < 3.11\n    import tomllib\n\n    # Test that the minimum dependencies in the README.rst file are\n    # consistent with the minimum dependencies defined at the file:\n    # pyproject.toml\n\n    pyproject_path = Path(imblearn.__path__[0]).parents[0] / \"pyproject.toml\"\n    with open(pyproject_path, \"rb\") as f:\n        pyproject_data = tomllib.load(f)\n\n    def process_requirements(requirements):\n        result = {}\n        for req in requirements:\n            req = Requirement(req)\n            for specifier in req.specifier:\n                if specifier.operator == \">=\":\n                    result[req.name] = parse(specifier.version)\n        return result\n\n    min_dependencies = process_requirements(\n        [f\"python{pyproject_data['project']['requires-python']}\"]\n    )\n    min_dependencies.update(\n        process_requirements(pyproject_data[\"project\"][\"dependencies\"])\n    )\n\n    markers = [\"docs\", \"optional\", \"tensorflow\", \"keras\", \"tests\"]\n    for marker_name in markers:\n        min_dependencies.update(\n            process_requirements(\n                pyproject_data[\"project\"][\"optional-dependencies\"][marker_name]\n            )\n        )\n\n    pattern = re.compile(\n        r\"(\\.\\. \\|)\"\n        + r\"(([A-Za-z]+\\-?)+)\"\n        + r\"(MinVersion\\| replace::)\"\n        + r\"( [0-9]+\\.[0-9]+(\\.[0-9]+)?)\"\n    )\n\n    readme_path = Path(imblearn.__path__[0]).parents[0]\n    readme_file = readme_path / \"README.rst\"\n\n    if not os.path.exists(readme_file):\n        # Skip the test if the README.rst file is not available.\n        # For instance, when installing scikit-learn from wheels\n        pytest.skip(\"The README.rst file is not available.\")\n\n    with readme_file.open(\"r\") as f:\n        for line in f:\n            matched = pattern.match(line)\n\n            if not matched:\n                continue\n\n            package, version = matched.group(2), matched.group(5)\n            package = package.lower()\n            if package == \"scikitlearn\":\n                package = \"scikit-learn\"\n\n            if package in min_dependencies:\n                version = parse(version)\n                min_version = min_dependencies[package]\n\n                assert version == min_version, f\"{package} has a mismatched version\"\n"
  },
  {
    "path": "imblearn/utils/tests/test_show_versions.py",
    "content": "\"\"\"Test for the show_versions helper. Based on the sklearn tests.\"\"\"\n# Author: Alexander L. Hayes <hayesall@iu.edu>\n# License: MIT\n\nfrom imblearn.utils._show_versions import _get_deps_info, show_versions\n\n\ndef test_get_deps_info():\n    _deps_info = _get_deps_info()\n    assert \"pip\" in _deps_info\n    assert \"setuptools\" in _deps_info\n    assert \"imbalanced-learn\" in _deps_info\n    assert \"scikit-learn\" in _deps_info\n    assert \"numpy\" in _deps_info\n    assert \"scipy\" in _deps_info\n    assert \"Cython\" in _deps_info\n    assert \"pandas\" in _deps_info\n    assert \"joblib\" in _deps_info\n\n\ndef test_show_versions_default(capsys):\n    show_versions()\n    out, err = capsys.readouterr()\n    assert \"python\" in out\n    assert \"executable\" in out\n    assert \"machine\" in out\n    assert \"pip\" in out\n    assert \"setuptools\" in out\n    assert \"imbalanced-learn\" in out\n    assert \"scikit-learn\" in out\n    assert \"numpy\" in out\n    assert \"scipy\" in out\n    assert \"Cython\" in out\n    assert \"pandas\" in out\n    assert \"keras\" in out\n    assert \"tensorflow\" in out\n    assert \"joblib\" in out\n\n\ndef test_show_versions_github(capsys):\n    show_versions(github=True)\n    out, err = capsys.readouterr()\n    assert \"<details><summary>System, Dependency Information</summary>\" in out\n    assert \"**System Information**\" in out\n    assert \"* python\" in out\n    assert \"* executable\" in out\n    assert \"* machine\" in out\n    assert \"**Python Dependencies**\" in out\n    assert \"* pip\" in out\n    assert \"* setuptools\" in out\n    assert \"* imbalanced-learn\" in out\n    assert \"* scikit-learn\" in out\n    assert \"* numpy\" in out\n    assert \"* scipy\" in out\n    assert \"* Cython\" in out\n    assert \"* pandas\" in out\n    assert \"* keras\" in out\n    assert \"* tensorflow\" in out\n    assert \"* joblib\" in out\n    assert \"</details>\" in out\n"
  },
  {
    "path": "imblearn/utils/tests/test_testing.py",
    "content": "\"\"\"Test for the testing module\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nimport numpy as np\nimport pytest\nfrom sklearn.neighbors._base import KNeighborsMixin\n\nfrom imblearn.base import SamplerMixin\nfrom imblearn.utils.testing import _CustomNearestNeighbors, all_estimators\n\n\ndef test_all_estimators():\n    # check if the filtering is working with a list or a single string\n    type_filter = \"sampler\"\n    all_estimators(type_filter=type_filter)\n    type_filter = [\"sampler\"]\n    estimators = all_estimators(type_filter=type_filter)\n    for estimator in estimators:\n        # check that all estimators are sampler\n        assert issubclass(estimator[1], SamplerMixin)\n\n    # check that an error is raised when the type is unknown\n    type_filter = \"rnd\"\n    with pytest.raises(ValueError, match=\"Parameter type_filter must be 'sampler'\"):\n        all_estimators(type_filter=type_filter)\n\n\ndef test_custom_nearest_neighbors():\n    \"\"\"Check that our custom nearest neighbors can be used for our internal\n    duck-typing.\"\"\"\n\n    neareat_neighbors = _CustomNearestNeighbors(n_neighbors=3)\n\n    assert not isinstance(neareat_neighbors, KNeighborsMixin)\n    assert hasattr(neareat_neighbors, \"kneighbors\")\n    assert hasattr(neareat_neighbors, \"kneighbors_graph\")\n\n    rng = np.random.RandomState(42)\n    X = rng.randn(150, 3)\n    y = rng.randint(0, 2, 150)\n    neareat_neighbors.fit(X, y)\n\n    distances, indices = neareat_neighbors.kneighbors(X)\n    assert distances.shape == (150, 3)\n    assert indices.shape == (150, 3)\n    np.testing.assert_allclose(distances[:, 0], 0.0)\n    np.testing.assert_allclose(indices[:, 0], np.arange(150))\n"
  },
  {
    "path": "imblearn/utils/tests/test_validation.py",
    "content": "\"\"\"Test for the validation helper\"\"\"\n# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n#          Christos Aridas\n# License: MIT\n\nfrom collections import Counter, OrderedDict\n\nimport numpy as np\nimport pytest\nfrom sklearn.cluster import KMeans\nfrom sklearn.neighbors import NearestNeighbors\nfrom sklearn.neighbors._base import KNeighborsMixin\nfrom sklearn.utils._testing import assert_array_equal\n\nfrom imblearn.utils import (\n    check_neighbors_object,\n    check_sampling_strategy,\n    check_target_type,\n)\nfrom imblearn.utils._validation import (\n    ArraysTransformer,\n    _deprecate_positional_args,\n    _is_neighbors_object,\n)\nfrom imblearn.utils.testing import _CustomNearestNeighbors\n\nmulticlass_target = np.array([1] * 50 + [2] * 100 + [3] * 25)\nbinary_target = np.array([1] * 25 + [0] * 100)\n\n\ndef test_check_neighbors_object():\n    name = \"n_neighbors\"\n    n_neighbors = 1\n    estimator = check_neighbors_object(name, n_neighbors)\n    assert issubclass(type(estimator), KNeighborsMixin)\n    assert estimator.n_neighbors == 1\n    estimator = check_neighbors_object(name, n_neighbors, 1)\n    assert issubclass(type(estimator), KNeighborsMixin)\n    assert estimator.n_neighbors == 2\n    estimator = NearestNeighbors(n_neighbors=n_neighbors)\n    estimator_cloned = check_neighbors_object(name, estimator)\n    assert estimator.n_neighbors == estimator_cloned.n_neighbors\n    estimator = _CustomNearestNeighbors()\n    estimator_cloned = check_neighbors_object(name, estimator)\n    assert isinstance(estimator_cloned, _CustomNearestNeighbors)\n\n\n@pytest.mark.parametrize(\n    \"target, output_target\",\n    [\n        (np.array([0, 1, 1]), np.array([0, 1, 1])),\n        (np.array([0, 1, 2]), np.array([0, 1, 2])),\n        (np.array([[0, 1], [1, 0]]), np.array([1, 0])),\n    ],\n)\ndef test_check_target_type(target, output_target):\n    converted_target = check_target_type(target.astype(int))\n    assert_array_equal(converted_target, output_target.astype(int))\n\n\n@pytest.mark.parametrize(\n    \"target, output_target, is_ova\",\n    [\n        (np.array([0, 1, 1]), np.array([0, 1, 1]), False),\n        (np.array([0, 1, 2]), np.array([0, 1, 2]), False),\n        (np.array([[0, 1], [1, 0]]), np.array([1, 0]), True),\n    ],\n)\ndef test_check_target_type_ova(target, output_target, is_ova):\n    converted_target, binarize_target = check_target_type(\n        target.astype(int), indicate_one_vs_all=True\n    )\n    assert_array_equal(converted_target, output_target.astype(int))\n    assert binarize_target == is_ova\n\n\ndef test_check_sampling_strategy_warning():\n    msg = \"dict for cleaning methods is not supported\"\n    with pytest.raises(ValueError, match=msg):\n        check_sampling_strategy({1: 0, 2: 0, 3: 0}, multiclass_target, \"clean-sampling\")\n\n\n@pytest.mark.parametrize(\n    \"ratio, y, type, err_msg\",\n    [\n        (\n            0.5,\n            binary_target,\n            \"clean-sampling\",\n            \"'clean-sampling' methods do let the user specify the sampling ratio\",  # noqa\n        ),\n        (\n            0.1,\n            np.array([0] * 10 + [1] * 20),\n            \"over-sampling\",\n            \"remove samples from the minority class while trying to generate new\",  # noqa\n        ),\n        (\n            0.1,\n            np.array([0] * 10 + [1] * 20),\n            \"under-sampling\",\n            \"generate new sample in the majority class while trying to remove\",\n        ),\n    ],\n)\ndef test_check_sampling_strategy_float_error(ratio, y, type, err_msg):\n    with pytest.raises(ValueError, match=err_msg):\n        check_sampling_strategy(ratio, y, type)\n\n\ndef test_check_sampling_strategy_error():\n    with pytest.raises(ValueError, match=\"'sampling_type' should be one of\"):\n        check_sampling_strategy(\"auto\", np.array([1, 2, 3]), \"rnd\")\n\n    error_regex = \"The target 'y' needs to have more than 1 class.\"\n    with pytest.raises(ValueError, match=error_regex):\n        check_sampling_strategy(\"auto\", np.ones((10,)), \"over-sampling\")\n\n    error_regex = \"When 'sampling_strategy' is a string, it needs to be one of\"\n    with pytest.raises(ValueError, match=error_regex):\n        check_sampling_strategy(\"rnd\", np.array([1, 2, 3]), \"over-sampling\")\n\n\n@pytest.mark.parametrize(\n    \"sampling_strategy, sampling_type, err_msg\",\n    [\n        (\"majority\", \"over-sampling\", \"over-sampler\"),\n        (\"minority\", \"under-sampling\", \"under-sampler\"),\n    ],\n)\ndef test_check_sampling_strategy_error_wrong_string(\n    sampling_strategy, sampling_type, err_msg\n):\n    with pytest.raises(\n        ValueError,\n        match=f\"'{sampling_strategy}' cannot be used with {err_msg}\",\n    ):\n        check_sampling_strategy(sampling_strategy, np.array([1, 2, 3]), sampling_type)\n\n\n@pytest.mark.parametrize(\n    \"sampling_strategy, sampling_method\",\n    [\n        ({10: 10}, \"under-sampling\"),\n        ({10: 10}, \"over-sampling\"),\n        ([10], \"clean-sampling\"),\n    ],\n)\ndef test_sampling_strategy_class_target_unknown(sampling_strategy, sampling_method):\n    y = np.array([1] * 50 + [2] * 100 + [3] * 25)\n    with pytest.raises(ValueError, match=\"are not present in the data.\"):\n        check_sampling_strategy(sampling_strategy, y, sampling_method)\n\n\ndef test_sampling_strategy_dict_error():\n    y = np.array([1] * 50 + [2] * 100 + [3] * 25)\n    sampling_strategy = {1: -100, 2: 50, 3: 25}\n    with pytest.raises(ValueError, match=\"in a class cannot be negative.\"):\n        check_sampling_strategy(sampling_strategy, y, \"under-sampling\")\n    sampling_strategy = {1: 45, 2: 100, 3: 70}\n    error_regex = (\n        \"With over-sampling methods, the number of samples in a\"\n        \" class should be greater or equal to the original number\"\n        \" of samples. Originally, there is 50 samples and 45\"\n        \" samples are asked.\"\n    )\n    with pytest.raises(ValueError, match=error_regex):\n        check_sampling_strategy(sampling_strategy, y, \"over-sampling\")\n\n    error_regex = (\n        \"With under-sampling methods, the number of samples in a\"\n        \" class should be less or equal to the original number of\"\n        \" samples. Originally, there is 25 samples and 70 samples\"\n        \" are asked.\"\n    )\n    with pytest.raises(ValueError, match=error_regex):\n        check_sampling_strategy(sampling_strategy, y, \"under-sampling\")\n\n\n@pytest.mark.parametrize(\"sampling_strategy\", [-10, 10])\ndef test_sampling_strategy_float_error_not_in_range(sampling_strategy):\n    y = np.array([1] * 50 + [2] * 100)\n    with pytest.raises(ValueError, match=\"it should be in the range\"):\n        check_sampling_strategy(sampling_strategy, y, \"under-sampling\")\n\n\ndef test_sampling_strategy_float_error_not_binary():\n    y = np.array([1] * 50 + [2] * 100 + [3] * 25)\n    with pytest.raises(ValueError, match=\"the type of target is binary\"):\n        sampling_strategy = 0.5\n        check_sampling_strategy(sampling_strategy, y, \"under-sampling\")\n\n\n@pytest.mark.parametrize(\"sampling_method\", [\"over-sampling\", \"under-sampling\"])\ndef test_sampling_strategy_list_error_not_clean_sampling(sampling_method):\n    y = np.array([1] * 50 + [2] * 100 + [3] * 25)\n    with pytest.raises(ValueError, match=\"cannot be a list for samplers\"):\n        sampling_strategy = [1, 2, 3]\n        check_sampling_strategy(sampling_strategy, y, sampling_method)\n\n\ndef _sampling_strategy_func(y):\n    # this function could create an equal number of samples\n    target_stats = Counter(y)\n    n_samples = max(target_stats.values())\n    return {key: int(n_samples) for key in target_stats.keys()}\n\n\n@pytest.mark.parametrize(\n    \"sampling_strategy, sampling_type, expected_sampling_strategy, target\",\n    [\n        (\"auto\", \"under-sampling\", {1: 25, 2: 25}, multiclass_target),\n        (\"auto\", \"clean-sampling\", {1: 25, 2: 25}, multiclass_target),\n        (\"auto\", \"over-sampling\", {1: 50, 3: 75}, multiclass_target),\n        (\"all\", \"over-sampling\", {1: 50, 2: 0, 3: 75}, multiclass_target),\n        (\"all\", \"under-sampling\", {1: 25, 2: 25, 3: 25}, multiclass_target),\n        (\"all\", \"clean-sampling\", {1: 25, 2: 25, 3: 25}, multiclass_target),\n        (\"majority\", \"under-sampling\", {2: 25}, multiclass_target),\n        (\"majority\", \"clean-sampling\", {2: 25}, multiclass_target),\n        (\"minority\", \"over-sampling\", {3: 75}, multiclass_target),\n        (\"not minority\", \"over-sampling\", {1: 50, 2: 0}, multiclass_target),\n        (\"not minority\", \"under-sampling\", {1: 25, 2: 25}, multiclass_target),\n        (\"not minority\", \"clean-sampling\", {1: 25, 2: 25}, multiclass_target),\n        (\"not majority\", \"over-sampling\", {1: 50, 3: 75}, multiclass_target),\n        (\"not majority\", \"under-sampling\", {1: 25, 3: 25}, multiclass_target),\n        (\"not majority\", \"clean-sampling\", {1: 25, 3: 25}, multiclass_target),\n        (\n            {1: 70, 2: 100, 3: 70},\n            \"over-sampling\",\n            {1: 20, 2: 0, 3: 45},\n            multiclass_target,\n        ),\n        (\n            {1: 30, 2: 45, 3: 25},\n            \"under-sampling\",\n            {1: 30, 2: 45, 3: 25},\n            multiclass_target,\n        ),\n        ([1], \"clean-sampling\", {1: 25}, multiclass_target),\n        (\n            _sampling_strategy_func,\n            \"over-sampling\",\n            {1: 50, 2: 0, 3: 75},\n            multiclass_target,\n        ),\n        (0.5, \"over-sampling\", {1: 25}, binary_target),\n        (0.5, \"under-sampling\", {0: 50}, binary_target),\n    ],\n)\ndef test_check_sampling_strategy(\n    sampling_strategy, sampling_type, expected_sampling_strategy, target\n):\n    sampling_strategy_ = check_sampling_strategy(\n        sampling_strategy, target, sampling_type\n    )\n    assert sampling_strategy_ == expected_sampling_strategy\n\n\ndef test_sampling_strategy_callable_args():\n    y = np.array([1] * 50 + [2] * 100 + [3] * 25)\n    multiplier = {1: 1.5, 2: 1, 3: 3}\n\n    def sampling_strategy_func(y, multiplier):\n        \"\"\"samples such that each class will be affected by the multiplier.\"\"\"\n        target_stats = Counter(y)\n        return {\n            key: int(values * multiplier[key]) for key, values in target_stats.items()\n        }\n\n    sampling_strategy_ = check_sampling_strategy(\n        sampling_strategy_func, y, \"over-sampling\", multiplier=multiplier\n    )\n    assert sampling_strategy_ == {1: 25, 2: 0, 3: 50}\n\n\n@pytest.mark.parametrize(\n    \"sampling_strategy, sampling_type, expected_result\",\n    [\n        (\n            {3: 25, 1: 25, 2: 25},\n            \"under-sampling\",\n            OrderedDict({1: 25, 2: 25, 3: 25}),\n        ),\n        (\n            {3: 100, 1: 100, 2: 100},\n            \"over-sampling\",\n            OrderedDict({1: 50, 2: 0, 3: 75}),\n        ),\n    ],\n)\ndef test_sampling_strategy_check_order(\n    sampling_strategy, sampling_type, expected_result\n):\n    # We pass on purpose a non sorted dictionary and check that the resulting\n    # dictionary is sorted. Refer to issue #428.\n    y = np.array([1] * 50 + [2] * 100 + [3] * 25)\n    sampling_strategy_ = check_sampling_strategy(sampling_strategy, y, sampling_type)\n    assert sampling_strategy_ == expected_result\n\n\ndef test_arrays_transformer_plain_list():\n    X = np.array([[0, 0], [1, 1]])\n    y = np.array([[0, 0], [1, 1]])\n\n    arrays_transformer = ArraysTransformer(X.tolist(), y.tolist())\n    X_res, y_res = arrays_transformer.transform(X, y)\n    assert isinstance(X_res, list)\n    assert isinstance(y_res, list)\n\n\ndef test_arrays_transformer_numpy():\n    X = np.array([[0, 0], [1, 1]])\n    y = np.array([[0, 0], [1, 1]])\n\n    arrays_transformer = ArraysTransformer(X, y)\n    X_res, y_res = arrays_transformer.transform(X, y)\n    assert isinstance(X_res, np.ndarray)\n    assert isinstance(y_res, np.ndarray)\n\n\ndef test_arrays_transformer_pandas():\n    pd = pytest.importorskip(\"pandas\")\n\n    X = np.array([[0, 0], [1, 1]])\n    y = np.array([0, 1])\n\n    X_df = pd.DataFrame(X, columns=[\"a\", \"b\"])\n    X_df = X_df.astype(int)\n    y_df = pd.DataFrame(y, columns=[\"target\"])\n    y_df = y_df.astype(int)\n    y_s = pd.Series(y, name=\"target\", dtype=int)\n\n    # DataFrame and DataFrame case\n    arrays_transformer = ArraysTransformer(X_df, y_df)\n    X_res, y_res = arrays_transformer.transform(X, y)\n    assert isinstance(X_res, pd.DataFrame)\n    assert_array_equal(X_res.columns, X_df.columns)\n    assert_array_equal(X_res.dtypes, X_df.dtypes)\n    assert isinstance(y_res, pd.DataFrame)\n    assert_array_equal(y_res.columns, y_df.columns)\n    assert_array_equal(y_res.dtypes, y_df.dtypes)\n\n    # DataFrames and Series case\n    arrays_transformer = ArraysTransformer(X_df, y_s)\n    _, y_res = arrays_transformer.transform(X, y)\n    assert isinstance(y_res, pd.Series)\n    assert_array_equal(y_res.name, y_s.name)\n    assert_array_equal(y_res.dtype, y_s.dtype)\n\n\ndef test_deprecate_positional_args_warns_for_function():\n    @_deprecate_positional_args\n    def f1(a, b, *, c=1, d=1):\n        pass\n\n    with pytest.warns(FutureWarning, match=r\"Pass c=3 as keyword args\"):\n        f1(1, 2, 3)\n\n    with pytest.warns(FutureWarning, match=r\"Pass c=3, d=4 as keyword args\"):\n        f1(1, 2, 3, 4)\n\n    @_deprecate_positional_args\n    def f2(a=1, *, b=1, c=1, d=1):\n        pass\n\n    with pytest.warns(FutureWarning, match=r\"Pass b=2 as keyword args\"):\n        f2(1, 2)\n\n    # The * is place before a keyword only argument without a default value\n    @_deprecate_positional_args\n    def f3(a, *, b, c=1, d=1):\n        pass\n\n    with pytest.warns(FutureWarning, match=r\"Pass b=2 as keyword args\"):\n        f3(1, 2)\n\n\n@pytest.mark.parametrize(\n    \"estimator, is_neighbor_estimator\", [(NearestNeighbors(), True), (KMeans(), False)]\n)\ndef test_is_neighbors_object(estimator, is_neighbor_estimator):\n    assert _is_neighbors_object(estimator) == is_neighbor_estimator\n"
  },
  {
    "path": "maint_tools/test_docstring.py",
    "content": "import importlib\nimport inspect\nimport pkgutil\nimport re\nfrom inspect import signature\n\nimport pytest\n\nimport imblearn\nfrom imblearn.utils.testing import all_estimators\n\nnumpydoc_validation = pytest.importorskip(\"numpydoc.validate\")\n\n# List of whitelisted modules and methods; regexp are supported.\n# These docstrings will fail because they are inheriting from scikit-learn\nDOCSTRING_WHITELIST = [\n    \"ADASYN$\",\n    \"ADASYN.\",\n    \"AllKNN$\",\n    \"AllKNN.\",\n    \"BalancedBaggingClassifier$\",\n    \"BalancedBaggingClassifier.\",\n    \"BalancedRandomForestClassifier$\",\n    \"BalancedRandomForestClassifier.\",\n    \"ClusterCentroids$\",\n    \"ClusterCentroids.\",\n    \"CondensedNearestNeighbour$\",\n    \"CondensedNearestNeighbour.\",\n    \"EasyEnsembleClassifier$\",\n    \"EasyEnsembleClassifier.\",\n    \"EditedNearestNeighbours$\",\n    \"EditedNearestNeighbours.\",\n    \"FunctionSampler$\",\n    \"FunctionSampler.\",\n    \"InstanceHardnessThreshold$\",\n    \"InstanceHardnessThreshold.\",\n    \"SMOTE$\",\n    \"SMOTE.\",\n    \"NearMiss$\",\n    \"NearMiss.\",\n    \"NeighbourhoodCleaningRule$\",\n    \"NeighbourhoodCleaningRule.\",\n    \"OneSidedSelection$\",\n    \"OneSidedSelection.\",\n    \"Pipeline$\",\n    \"Pipeline.\",\n    \"RUSBoostClassifier$\",\n    \"RUSBoostClassifier.\",\n    \"RandomOverSampler$\",\n    \"RandomOverSampler.\",\n    \"RandomUnderSampler$\",\n    \"RandomUnderSampler.\",\n    \"TomekLinks$\",\n    \"TomekLinks\",\n    \"ValueDifferenceMetric$\",\n    \"ValueDifferenceMetric.\",\n]\n\nFUNCTION_DOCSTRING_IGNORE_LIST = [\n    \"imblearn.tensorflow._generator.balanced_batch_generator\",\n]\nFUNCTION_DOCSTRING_IGNORE_LIST = set(FUNCTION_DOCSTRING_IGNORE_LIST)\n\n\ndef get_all_methods():\n    estimators = all_estimators()\n    for name, Estimator in estimators:\n        if name.startswith(\"_\"):\n            # skip private classes\n            continue\n        methods = []\n        for name in dir(Estimator):\n            if name.startswith(\"_\"):\n                continue\n            method_obj = getattr(Estimator, name)\n            if hasattr(method_obj, \"__call__\") or isinstance(method_obj, property):\n                methods.append(name)\n        methods.append(None)\n\n        for method in sorted(methods, key=lambda x: str(x)):\n            yield Estimator, method\n\n\ndef _is_checked_function(item):\n    if not inspect.isfunction(item):\n        return False\n\n    if item.__name__.startswith(\"_\"):\n        return False\n\n    mod = item.__module__\n    if not mod.startswith(\"imblearn.\") or mod.endswith(\"estimator_checks\"):\n        return False\n\n    return True\n\n\ndef get_all_functions_names():\n    \"\"\"Get all public functions define in the imblearn module\"\"\"\n    modules_to_ignore = {\n        \"tests\",\n        \"estimator_checks\",\n    }\n\n    all_functions_names = set()\n    for module_finder, module_name, ispkg in pkgutil.walk_packages(\n        path=imblearn.__path__, prefix=\"imblearn.\"\n    ):\n        module_parts = module_name.split(\".\")\n        if (\n            any(part in modules_to_ignore for part in module_parts)\n            or \"._\" in module_name\n        ):\n            continue\n\n        module = importlib.import_module(module_name)\n        functions = inspect.getmembers(module, _is_checked_function)\n        for name, func in functions:\n            full_name = f\"{func.__module__}.{func.__name__}\"\n            all_functions_names.add(full_name)\n\n    return sorted(all_functions_names)\n\n\ndef filter_errors(errors, method, Estimator=None):\n    \"\"\"\n    Ignore some errors based on the method type.\n\n    These rules are specific for scikit-learn.\"\"\"\n    for code, message in errors:\n        # We ignore following error code,\n        #  - RT02: The first line of the Returns section\n        #    should contain only the type, ..\n        #   (as we may need refer to the name of the returned\n        #    object)\n        #  - GL01: Docstring text (summary) should start in the line\n        #    immediately after the opening quotes (not in the same line,\n        #    or leaving a blank line in between)\n        #  - GL02: If there's a blank line, it should be before the\n        #    first line of the Returns section, not after (it allows to have\n        #    short docstrings for properties).\n\n        if code in [\"RT02\", \"GL01\", \"GL02\"]:\n            continue\n\n        # Ignore PR02: Unknown parameters for properties. We sometimes use\n        # properties for ducktyping, i.e. SGDClassifier.predict_proba\n        if code == \"PR02\" and Estimator is not None and method is not None:\n            method_obj = getattr(Estimator, method)\n            if isinstance(method_obj, property):\n                continue\n\n        # Following codes are only taken into account for the\n        # top level class docstrings:\n        #  - ES01: No extended summary found\n        #  - SA01: See Also section not found\n        #  - EX01: No examples section found\n\n        if method is not None and code in [\"EX01\", \"SA01\", \"ES01\"]:\n            continue\n        yield code, message\n\n\ndef repr_errors(res, estimator=None, method: str | None = None) -> str:\n    \"\"\"Pretty print original docstring and the obtained errors\n\n    Parameters\n    ----------\n    res : dict\n        result of numpydoc.validate.validate\n    estimator : {estimator, None}\n        estimator object or None\n    method : str\n        if estimator is not None, either the method name or None.\n\n    Returns\n    -------\n    str\n       String representation of the error.\n    \"\"\"\n    if method is None:\n        if hasattr(estimator, \"__init__\"):\n            method = \"__init__\"\n        elif estimator is None:\n            raise ValueError(\"At least one of estimator, method should be provided\")\n        else:\n            raise NotImplementedError\n\n    if estimator is not None:\n        obj = getattr(estimator, method)\n        try:\n            obj_signature = signature(obj)\n        except TypeError:\n            # In particular we can't parse the signature of properties\n            obj_signature = (\n                \"\\nParsing of the method signature failed, \"\n                \"possibly because this is a property.\"\n            )\n\n        obj_name = estimator.__name__ + \".\" + method\n    else:\n        obj_signature = \"\"\n        obj_name = method\n\n    msg = \"\\n\\n\" + \"\\n\\n\".join(\n        [\n            str(res[\"file\"]),\n            obj_name + str(obj_signature),\n            res[\"docstring\"],\n            \"# Errors\",\n            \"\\n\".join(f\" - {code}: {message}\" for code, message in res[\"errors\"]),\n        ]\n    )\n    return msg\n\n\n@pytest.mark.parametrize(\"function_name\", get_all_functions_names())\ndef test_function_docstring(function_name, request):\n    \"\"\"Check function docstrings using numpydoc.\"\"\"\n    if function_name in FUNCTION_DOCSTRING_IGNORE_LIST:\n        request.applymarker(\n            pytest.mark.xfail(run=False, reason=\"TODO pass numpydoc validation\")\n        )\n\n    res = numpydoc_validation.validate(function_name)\n\n    res[\"errors\"] = list(filter_errors(res[\"errors\"], method=\"function\"))\n\n    if res[\"errors\"]:\n        msg = repr_errors(res, method=f\"Tested function: {function_name}\")\n\n        raise ValueError(msg)\n\n\n@pytest.mark.parametrize(\"Estimator, method\", get_all_methods())\ndef test_docstring(Estimator, method, request):\n    base_import_path = Estimator.__module__\n    import_path = [base_import_path, Estimator.__name__]\n    if method is not None:\n        import_path.append(method)\n\n    import_path = \".\".join(import_path)\n\n    if not any(re.search(regex, import_path) for regex in DOCSTRING_WHITELIST):\n        request.applymarker(\n            pytest.mark.xfail(run=False, reason=\"TODO pass numpydoc validation\")\n        )\n\n    res = numpydoc_validation.validate(import_path)\n\n    res[\"errors\"] = list(filter_errors(res[\"errors\"], method))\n\n    if res[\"errors\"]:\n        msg = repr_errors(res, Estimator, method)\n\n        raise ValueError(msg)\n\n\nif __name__ == \"__main__\":\n    import argparse\n    import sys\n\n    parser = argparse.ArgumentParser(description=\"Validate docstring with numpydoc.\")\n    parser.add_argument(\"import_path\", help=\"Import path to validate\")\n\n    args = parser.parse_args()\n\n    res = numpydoc_validation.validate(args.import_path)\n\n    import_path_sections = args.import_path.split(\".\")\n    # When applied to classes, detect class method. For functions\n    # method = None.\n    # TODO: this detection can be improved. Currently we assume that we have\n    # class # methods if the second path element before last is in camel case.\n    if len(import_path_sections) >= 2 and re.match(\n        r\"(?:[A-Z][a-z]*)+\", import_path_sections[-2]\n    ):\n        method = import_path_sections[-1]\n    else:\n        method = None\n\n    res[\"errors\"] = list(filter_errors(res[\"errors\"], method))\n\n    if res[\"errors\"]:\n        msg = repr_errors(res, method=args.import_path)\n\n        print(msg)\n        sys.exit(1)\n    else:\n        print(f\"All docstring checks passed for {args.import_path}!\")\n"
  },
  {
    "path": "pyproject.toml",
    "content": "[build-system]\nrequires = [\"setuptools>=71\", \"setuptools_scm[toml]>=8\"]\nbuild-backend = \"setuptools.build_meta\"\n\n[project]\nname = \"imbalanced-learn\"\ndynamic = [\"version\", \"readme\"]\ndescription = \"Toolbox for imbalanced dataset in machine learning\"\nauthors = [\n    { name=\"G. Lemaitre\", email=\"g.lemaitre58@gmail.com\"},\n    { name=\"C. Aridas\", email=\"ichkoar@gmail.com\"},\n]\nclassifiers = [\n    \"Development Status :: 5 - Production/Stable\",\n    \"Environment :: Console\",\n    \"Intended Audience :: Science/Research\",\n    \"License :: OSI Approved :: MIT License\",\n    \"Operating System :: OS Independent\",\n    \"Programming Language :: Python :: 3.10\",\n    \"Programming Language :: Python :: 3.11\",\n    \"Programming Language :: Python :: 3.12\",\n    \"Programming Language :: Python :: 3.13\",\n    \"Programming Language :: Python :: 3.14\",\n    \"Topic :: Scientific/Engineering\",\n    \"Topic :: Software Development :: Libraries\",\n]\nrequires-python = \">=3.10\"\ndependencies = [\n   \"numpy>=1.25.2,<3\",\n   \"scipy>=1.11.4,<2\",\n   \"scikit-learn>=1.4.2,<2\",\n   \"sklearn-compat>=0.1.5,<0.2\",\n   \"joblib>=1.2.0,<2\",\n   \"threadpoolctl>=2.0.0,<4\",\n]\n\n[tool.setuptools.dynamic]\nversion = { file = \"imblearn/VERSION.txt\" }\nreadme = { file = \"README.rst\" }\n\n[project.optional-dependencies]\ndev = [\n    \"ipykernel\",\n    \"ipython\",\n    \"jupyterlab\",\n]\ndocs = [\n    \"pandas>=2.0.3,<3\",\n    \"tensorflow>=2.16.1,<3\",\n    \"matplotlib>=3.7.3,<4\",\n    \"seaborn>=0.12.2,<1\",\n    \"memory_profiler>=0.61.0,<1\",\n    \"numpydoc>=1.5.0,<2\",\n    \"sphinx>=8.0.2,<9\",\n    \"sphinx-gallery>=0.13.0,<1\",\n    \"sphinxcontrib-bibtex>=2.6.3,<3\",\n    \"sphinx-copybutton>=0.5.2,<1\",\n    \"pydata-sphinx-theme>=0.15.4,<1\",\n    \"sphinx-design>=0.6.1,<1\",\n]\nlinters = [\n    \"black==23.3.0\",\n    \"ruff==0.14.2\",\n    \"pre-commit\",\n]\noptional = [\n    \"pandas>=2.0.3,<3\",\n]\ntensorflow = [\n    \"tensorflow>=2.16.1,<3\",\n]\nkeras = [\n    \"keras>=3.3.3,<4\",\n]\ntests = [\n    \"packaging>=23.2,<25\",\n    \"pytest>=7.2.2,<9\",\n    \"pytest-cov>=4.1.0,<6\",\n    \"pytest-xdist>=3.5.0,<4\",\n]\n\n[project.urls]\nHomepage = \"https://imbalanced-learn.org/\"\nSource = \"https://github.com/scikit-learn-contrib/imbalanced-learn\"\nIssues = \"https://github.com/scikit-learn-contrib/imbalanced-learn/issues\"\n\n[tool.setuptools]\npackages = [\"imblearn\"]\n\n[tool.pixi.workspace]\nchannels = [\"conda-forge\"]\nplatforms = [\"linux-64\", \"osx-arm64\", \"osx-64\", \"win-64\"]\n\n[tool.pixi.dependencies]\nnumpy = \">=1.25.2,<3\"\nscipy = \">=1.11.4,<2\"\nscikit-learn = \">=1.4.2,<2\"\nsklearn-compat = \">=0.1.5,<0.2\"\njoblib = \">=1.2.0,<2\"\nthreadpoolctl = \">=2.0.0,<4\"\n\n[tool.pixi.feature.dev.dependencies]\nipykernel = \"*\"\nipython = \"*\"\njupyterlab = \"*\"\npip = \"*\"\ntwine = \"*\"\n\n[tool.pixi.feature.dev.pypi-dependencies]\n\"build\" = \"*\"\n\n[tool.pixi.feature.docs.dependencies]\nmatplotlib = \">=3.7.3,<4\"\nseaborn = \">=0.12.2,<1\"\nmemory_profiler = \">=0.61.0,<1\"\nnumpydoc = \">=1.5.0,<2\"\nsphinx = \">=8.0.2,<9\"\nsphinx-gallery = \">=0.13.0,<1\"\nsphinxcontrib-bibtex = \">=2.4.1,<3\"\nsphinx-copybutton = \">=0.5.2,<1\"\npydata-sphinx-theme = \">=0.15.4,<1\"\nsphinx-design = \">=0.6.1,<1\"\n\n[tool.pixi.feature.linters.dependencies]\nblack = \"==23.3.0\"\nruff = \"==0.14.2\"\npre-commit = \"*\"\n\n[tool.pixi.feature.optional.dependencies]\npandas = \">=2.0.3,<3\"\n\n[tool.pixi.feature.keras]\nplatforms = [\"linux-64\", \"osx-arm64\", \"osx-64\"]\n\n[tool.pixi.feature.keras.dependencies]\nkeras = \">=3.3.3,<4\"\n\n[tool.pixi.feature.tensorflow]\nplatforms = [\"linux-64\", \"osx-arm64\", \"osx-64\"]\n\n[tool.pixi.feature.tensorflow.dependencies]\ntensorflow = \">=2.16.1,<3\"\nkeras = \">=3.3.3,<3.9\"\n\n[tool.pixi.feature.min-dependencies.dependencies]\nnumpy = \"==1.25.2\"\nscipy = \"==1.11.4\"\nscikit-learn = \"==1.4.2\"\njoblib = \"==1.2.0\"\nthreadpoolctl = \"==2.0.0\"\n\n[tool.pixi.feature.min-optional-dependencies.dependencies]\npandas = \"==2.0.3\"\n\n[tool.pixi.feature.min-keras]\nplatforms = [\"linux-64\", \"osx-arm64\", \"osx-64\"]\n\n[tool.pixi.feature.min-keras.dependencies]\nkeras = \"==3.3.3\"\n\n[tool.pixi.feature.min-tensorflow]\nplatforms = [\"linux-64\", \"osx-arm64\", \"osx-64\"]\n\n[tool.pixi.feature.min-tensorflow.dependencies]\ntensorflow = \"==2.16.1\"\nkeras = \"==3.3.3\"\n\n[tool.pixi.feature.sklearn-1-4.dependencies]\nscikit-learn = \"~=1.4.0\"\n\n[tool.pixi.feature.sklearn-1-5.dependencies]\nscikit-learn = \"~=1.5.0\"\n\n[tool.pixi.feature.sklearn-1-6.dependencies]\nscikit-learn = \"~=1.6.0\"\n\n[tool.pixi.feature.scipy-1-15.dependencies]\n# for scikit-learn < 1.7, scipy > 1.15 is raising a deprecation warning\nscipy = \"~=1.15.0\"\n\n[tool.pixi.feature.py310.dependencies]\npython = \"~=3.10.0\"\n\n[tool.pixi.feature.py311.dependencies]\npython = \"~=3.11.0\"\n\n[tool.pixi.feature.py312.dependencies]\npython = \"~=3.12.0\"\n\n[tool.pixi.feature.py313.dependencies]\npython = \"~=3.13.0\"\n\n[tool.pixi.feature.py314.dependencies]\npython = \"~=3.14.0\"\n\n[tool.pixi.feature.tests.dependencies]\npackaging = \">=23.2,<25\"\npytest = \">=7.2.2,<9\"\npytest-cov = \">=4.1.0,<6\"\npytest-xdist = \">=3.5.0,<4\"\n\n[tool.pixi.pypi-dependencies]\nimbalanced-learn = { path = \".\", editable = true }\n\n[tool.pixi.feature.docs.tasks]\nbuild-docs = { cmd = \"make html\", cwd = \"doc\" }\nclean-docs = { cmd = \"rm -rf _build/ && rm -rf auto_examples/ && rm -rf reference/generated/\", cwd = \"doc\" }\n\n[tool.pixi.feature.linters.tasks]\nlinters = { cmd = \"pre-commit install && pre-commit run -v --all-files --show-diff-on-failure\" }\n\n[tool.pixi.feature.tests.tasks]\ntests = { cmd = \"pytest -vsl --cov=imblearn --cov-report=xml imblearn\" }\n\n[tool.pixi.environments]\nlinters = [\"linters\"]\ndocs = [\"optional\", \"docs\", \"tensorflow\"]\noptional = [\"optional\"]\ntests = [\"tests\", \"tensorflow\"]\ndev = [\"dev\", \"optional\", \"docs\", \"linters\", \"tests\", \"tensorflow\"]\n\nci-py310-min-dependencies = [\"py310\", \"min-dependencies\", \"tests\"]\nci-py310-min-optional-dependencies = [\"py310\", \"min-dependencies\", \"min-optional-dependencies\", \"tests\"]\nci-py310-min-keras = [\"py310\", \"min-keras\", \"tests\"]\nci-py310-min-tensorflow = [\"py310\", \"min-tensorflow\", \"tests\"]\n\nci-py311-sklearn-1-4 = [\"py311\", \"sklearn-1-4\", \"scipy-1-15\", \"tests\"]\nci-py311-sklearn-1-5 = [\"py311\", \"sklearn-1-5\", \"scipy-1-15\", \"tests\"]\nci-py312-sklearn-1-6 = [\"py312\", \"sklearn-1-6\", \"scipy-1-15\", \"tests\"]\nci-py311-latest-tensorflow = [\"py311\", \"tensorflow\", \"tests\"]\nci-py311-latest-keras = [\"py311\", \"keras\", \"tests\"]\n\nci-py314-latest-dependencies = [\"py314\", \"tests\"]\nci-py314-latest-optional-dependencies = [\"py314\", \"optional\", \"tests\"]\n\n[tool.black]\nline-length = 88\ntarget_version = ['py310', 'py311']\npreview = true\n# Exclude irrelevant directories for formatting\nexclude = '''\n/(\n    \\.eggs\n  | \\.git\n  | \\.mypy_cache\n  | \\.vscode\n  | \\.pytest_cache\n  | \\.idea\n  | build\n  | dist\n)/\n'''\n\n[tool.ruff]\n# max line length for black\nline-length = 88\ntarget-version = \"py310\"\nexclude=[\n    \".git\",\n    \"__pycache__\",\n    \"dist\",\n    \"doc/_build\",\n    \"doc/auto_examples\",\n    \"build\",\n    \"pixi.lock\",\n]\n\n[tool.ruff.lint]\n# all rules can be found here: https://beta.ruff.rs/docs/rules/\nselect = [\"E\", \"F\", \"W\", \"C4\", \"I\", \"UP\"]\nignore = [\n  # use `is` and `is not` for type comparisons\n  \"E721\",\n  # do not assign a lambda expression, use a def\n  \"E731\",\n  # do not use variables named 'l', 'O', or 'I'\n  \"E741\",\n  # unnecessary list comprehension (rewrite as a set comprehension)\n  \"C403\",\n  # unnecessary tuple literal (rewrite as a set literal)\n  \"C405\",\n  # unnecessary `dict()` call (rewrite as a literal)\n  \"C408\",\n  # unnecessary list literal passed to `tuple()` (rewrite as a tuple literal)\n  \"C409\",\n]\n\n[tool.ruff.lint.per-file-ignores]\n# It's fine not to put the import at the top of the file in the examples\n# folder.\n\"examples/*\"=[\"E402\"]\n\"doc/conf.py\"=[\"E402\"]\n\n[tool.pytest.ini_options]\nfilterwarnings = [\n    # Turn deprecation warnings into errors\n    \"error::FutureWarning\",\n    \"error::DeprecationWarning\",\n\n    # raised by `joblib` in old versions\n    \"ignore:.*distutils Version classes are deprecated.*:DeprecationWarning\",\n]\naddopts = \"--doctest-modules --color=yes -rs\"\ndoctest_optionflags = \"NORMALIZE_WHITESPACE ELLIPSIS\"\n"
  },
  {
    "path": "references.bib",
    "content": "\n@InProceedings{\t  batista2003,\n  title\t\t= {Balancing training data for automated annotation of\n\t\t  keywords: A case study},\n  author\t= {Batista, Gustavo E. A. P. A. and Bazzan, Ana L. C. and\n\t\t  Monard, Maria Carolina},\n  booktitle\t= {Proceedings of the 2nd Brazilian Workshop on\n\t\t  Bioinformatics},\n  pages\t\t= {10--18},\n  year\t\t= {2003},\n  month\t\t= {Dec.},\n  address\t= {Rio de Janeiro, Brazil}\n}\n\n@Article{\t  batista2004,\n  title\t\t= {A study of the behavior of several methods for balancing\n\t\t  machine learning training data},\n  author\t= {Batista, Gustavo E. A. P. A. and Prati, Ronaldo C. and\n\t\t  Monard, Maria Carolina},\n  journal\t= {ACM Sigkdd Explorations Newsletter},\n  volume\t= {6},\n  number\t= {1},\n  pages\t\t= {20--29},\n  year\t\t= {2004},\n  publisher\t= {ACM}\n}\n\n@Article{\t  chawla2002,\n  title\t\t= {SMOTE: Synthetic minority over-sampling technique},\n  author\t= {Chawla, Nitesh V. and Bowyer, Kevin W. and Hall, Lawrence\n\t\t  O. and Kegelmeyer, W. Philip},\n  journal\t= {Journal of Artificial Intelligence Research},\n  volume\t= {16},\n  pages\t\t= {321--357},\n  year\t\t= {2002}\n}\n\n@InProceedings{\t  han2005,\n  title\t\t= {Borderline-SMOTE: A new over-sampling method in imbalanced\n\t\t  data sets learning},\n  author\t= {Han, Hui and Wang, Wen-Yuan and Mao, Bing-Huan},\n  journal\t= {Advances in intelligent computing},\n  pages\t\t= {878--887},\n  year\t\t= {2005},\n  booktitle\t= {Proceedings of the 1st International Conference on\n\t\t  Intelligent Computing},\n  month\t\t= {Aug.},\n  address\t= {Hefei, China}\n}\n\n@Article{\t  hart1968,\n  title\t\t= {The condensed nearest neighbor rule},\n  author\t= {Hart, Peter E.},\n  journal\t= {IEEE Transactions on Information Theory},\n  volume\t= {14},\n  number\t= {3},\n  pages\t\t= {515--516},\n  year\t\t= {1968},\n  publisher\t= {IEEE}\n}\n\n@InProceedings{\t  he2008,\n  title\t\t= {ADASYN: Adaptive synthetic sampling approach for\n\t\t  imbalanced learning},\n  author\t= {He, Haibo and Bai, Yang and Garcia, Edwardo A. and Li,\n\t\t  Shutao},\n  booktitle\t= {Proceedings of the 5th IEEE International Joint Conference\n\t\t  on Neural Networks},\n  pages\t\t= {1322--1328},\n  year\t\t= {2008},\n  organization\t= {IEEE},\n  month\t\t= {Jun.},\n  address\t= {Hong Kong, China}\n}\n\n@InProceedings{\t  kubat1997,\n  title\t\t= {Addressing the curse of imbalanced training sets:\n\t\t  One-sided selection},\n  author\t= {Kubat, Miroslav and Matwin, Stan},\n  booktitle\t= {Proceedings of the 14th International Conference on\n\t\t  Machine Learning},\n  volume\t= {97},\n  pages\t\t= {179--186},\n  year\t\t= {1997},\n  address\t= {Nashville, Tennessee, USA},\n  month\t\t= {July}\n}\n\n@InProceedings{\t  laurikkala2001,\n  title\t\t= {Improving identification of difficult small classes by\n\t\t  balancing class distribution},\n  author\t= {Laurikkala, Jorma},\n  journal\t= {Proceedings of the 8th Conference on Artificial\n\t\t  Intelligence in Medicine in Europe},\n  pages\t\t= {63--66},\n  address\t= {Cascais, Portugal},\n  month\t\t= {Jul.},\n  year\t\t= {2001},\n  publisher\t= {Springer}\n}\n\n@Article{\t  liu2009,\n  title\t\t= {Exploratory undersampling for class-imbalance learning},\n  author\t= {Liu, Xu-Ying and Wu, Jianxin and Zhou, Zhi-Hua},\n  journal\t= {IEEE Transactions on Systems, Man, and Cybernetics},\n  volume\t= {39},\n  number\t= {2},\n  pages\t\t= {539--550},\n  year\t\t= {2009},\n  publisher\t= {IEEE}\n}\n\n@InProceedings{\t  mani2003,\n  title\t\t= {kNN approach to unbalanced data distributions: A case\n\t\t  study involving information extraction},\n  author\t= {Mani, Inderjeet and Zhang, Jianping},\n  booktitle\t= {Proceedings of the Workshop on Learning from Imbalanced\n\t\t  Data Sets},\n  volume\t= {126},\n  year\t\t= {2003},\n  month\t\t= {Aug.},\n  pages\t\t= {1--7},\n  address\t= {Washington, DC, USA}\n}\n\n@InProceedings{\t  nguyen2009,\n  title\t\t= {Borderline over-sampling for imbalanced data\n\t\t  classification},\n  author\t= {Nguyen, Hien M. and Cooper, Eric W. and Kamei, Katsuari},\n  journal\t= {Proceedings of the 5th International Workshop on\n\t\t  computational Intelligence and Applications},\n  pages\t\t= {24--29},\n  year\t\t= {2009}\n}\n\n@Article{\t  smith2014,\n  title\t\t= {An instance level analysis of data complexity},\n  author\t= {Smith, Michael R. and Martinez, Tony and Giraud-Carrier,\n\t\t  Christophe},\n  journal\t= {Machine learning},\n  volume\t= {95},\n  number\t= {2},\n  pages\t\t= {225--256},\n  year\t\t= {2014},\n  publisher\t= {Springer}\n}\n\n@Article{\t  tomek1976a,\n  title\t\t= {Two modifications of CNN},\n  author\t= {Tomek, Ivan},\n  journal\t= {IEEE Trans. Systems, Man and Cybernetics},\n  volume\t= {6},\n  issue\t\t= {6},\n  pages\t\t= {769--772},\n  year\t\t= {1976}\n}\n\n@Article{\t  tomek1976b,\n  title\t\t= {An experiment with the edited nearest-neighbor rule},\n  author\t= {Tomek, Ivan},\n  journal\t= {IEEE Transactions on Systems, Man, and Cybernetics},\n  number\t= {6},\n  issue\t\t= {6},\n  pages\t\t= {448--452},\n  year\t\t= {1976}\n}\n\n@Article{\t  wilson1972,\n  title\t\t= {Asymptotic properties of nearest neighbor rules using\n\t\t  edited data},\n  author\t= {Wilson, Dennis L.},\n  journal\t= {IEEE Transactions on Systems, Man, and Cybernetics},\n  volume\t= {2},\n  number\t= {3},\n  pages\t\t= {408--421},\n  year\t\t= {1972},\n  publisher\t= {IEEE}\n}\n\n@article{chen2004using,\n  title={Using random forest to learn imbalanced data},\n  author={Chen, Chao and Liaw, Andy and Breiman, Leo},\n  journal={University of California, Berkeley},\n  volume={110},\n  pages={1--12},\n  year={2004}\n}\n\n@article{torelli2014rose,\n  author = {Menardi, Giovanna and Torelli, Nicola},\n  title={Training and assessing classification rules with imbalanced data},\n  journal={Data Mining and Knowledge Discovery},\n  volume={28},\n  pages={92-122},\n  year={2014},\n  publisher={Springer},\n  issue = {1},\n  issn = {1573-756X},\n  url = {https://doi.org/10.1007/s10618-012-0295-5},\n  doi = {10.1007/s10618-012-0295-5}\n}\n\n@article{stanfill1986toward,\n  title={Toward memory-based reasoning},\n  author={Stanfill, Craig and Waltz, David},\n  journal={Communications of the ACM},\n  volume={29},\n  number={12},\n  pages={1213--1228},\n  year={1986},\n  publisher={ACM New York, NY, USA}\n}\n\n@article{wilson1997improved,\n  title={Improved heterogeneous distance functions},\n  author={Wilson, D Randall and Martinez, Tony R},\n  journal={Journal of artificial intelligence research},\n  volume={6},\n  pages={1--34},\n  year={1997}\n}\n\n@inproceedings{wang2009diversity,\n  title={Diversity analysis on imbalanced data sets by using ensemble models},\n  author={Wang, Shuo and Yao, Xin},\n  booktitle={2009 IEEE symposium on computational intelligence and data mining},\n  pages={324--331},\n  year={2009},\n  organization={IEEE}\n}\n\n@article{hido2009roughly,\n  title={Roughly balanced bagging for imbalanced data},\n  author={Hido, Shohei and Kashima, Hisashi and Takahashi, Yutaka},\n  journal={Statistical Analysis and Data Mining: The ASA Data Science Journal},\n  volume={2},\n  number={5-6},\n  pages={412--426},\n  year={2009},\n  publisher={Wiley Online Library}\n}\n\n@article{maclin1997empirical,\n  title={An empirical evaluation of bagging and boosting},\n  author={Maclin, Richard and Opitz, David},\n  journal={AAAI/IAAI},\n  volume={1997},\n  pages={546--551},\n  year={1997}\n}\n"
  }
]