Full Code of ray-project/xgboost_ray for AI

master e9049256575e cached

85 files

505.3 KB

124.4k tokens

564 symbols

1 requests

Download .txt

Showing preview only (533K chars total). Download the full file or copy to clipboard to get everything.

Repository: ray-project/xgboost_ray
Branch: master
Commit: e9049256575e
Files: 85
Total size: 505.3 KB

Directory structure:
gitextract_x7_6mxkx/

├── .flake8
├── .github/
│   └── workflows/
│       ├── gpu.yaml
│       └── test.yaml
├── .gitignore
├── LICENSE
├── README.md
├── format.sh
├── requirements/
│   ├── lint-requirements.txt
│   └── test-requirements.txt
├── run_ci_examples.sh
├── run_ci_tests.sh
├── setup.py
└── xgboost_ray/
    ├── __init__.py
    ├── callback.py
    ├── compat/
    │   ├── __init__.py
    │   └── tracker.py
    ├── data_sources/
    │   ├── __init__.py
    │   ├── _distributed.py
    │   ├── csv.py
    │   ├── dask.py
    │   ├── data_source.py
    │   ├── modin.py
    │   ├── numpy.py
    │   ├── object_store.py
    │   ├── pandas.py
    │   ├── parquet.py
    │   ├── partitioned.py
    │   ├── petastorm.py
    │   └── ray_dataset.py
    ├── elastic.py
    ├── examples/
    │   ├── __init__.py
    │   ├── create_test_data.py
    │   ├── higgs.py
    │   ├── higgs_parquet.py
    │   ├── readme.py
    │   ├── readme_sklearn_api.py
    │   ├── simple.py
    │   ├── simple_dask.py
    │   ├── simple_modin.py
    │   ├── simple_objectstore.py
    │   ├── simple_partitioned.py
    │   ├── simple_predict.py
    │   ├── simple_ray_dataset.py
    │   ├── simple_tune.py
    │   ├── train_on_test_data.py
    │   └── train_with_ml_dataset.py
    ├── main.py
    ├── matrix.py
    ├── session.py
    ├── sklearn.py
    ├── tests/
    │   ├── __init__.py
    │   ├── conftest.py
    │   ├── env_info.sh
    │   ├── fault_tolerance.py
    │   ├── release/
    │   │   ├── benchmark_cpu_gpu.py
    │   │   ├── benchmark_ft.py
    │   │   ├── cluster_cpu.yaml
    │   │   ├── cluster_ft.yaml
    │   │   ├── cluster_gpu.yaml
    │   │   ├── create_learnable_data.py
    │   │   ├── create_test_data.py
    │   │   ├── custom_objective_metric.py
    │   │   ├── run_e2e_gpu.sh
    │   │   ├── setup_xgboost.sh
    │   │   ├── start_cpu_cluster.sh
    │   │   ├── start_ft_cluster.sh
    │   │   ├── start_gpu_cluster.sh
    │   │   ├── submit_cpu_gpu_benchmark.sh
    │   │   ├── submit_ft_benchmark.sh
    │   │   ├── tune_cluster.yaml
    │   │   └── tune_placement.py
    │   ├── test_client.py
    │   ├── test_colocation.py
    │   ├── test_data_source.py
    │   ├── test_end_to_end.py
    │   ├── test_fault_tolerance.py
    │   ├── test_matrix.py
    │   ├── test_sklearn.py
    │   ├── test_sklearn_matrix.py
    │   ├── test_tune.py
    │   ├── test_xgboost_api.py
    │   └── utils.py
    ├── tune.py
    ├── util.py
    └── xgb.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .flake8
================================================
[flake8]
max-line-length = 88
inline-quotes = "
ignore =
  C408
  C417
  E121
  E123
  E126
  E203
  E226
  E24
  E704
  W503
  W504
  W605
  I
  N
  B001
  B002
  B003
  B004
  B005
  B007
  B008
  B009
  B010
  B011
  B012
  B013
  B014
  B015
  B016
  B017
avoid-escape = no
# Error E731 is ignored because of the migration from YAPF to Black.
# See https://github.com/ray-project/ray/issues/21315 for more information.
per-file-ignores =
    rllib/evaluation/worker_set.py:E731
    rllib/evaluation/sampler.py:E731


================================================
FILE: .github/workflows/gpu.yaml
================================================
name: GPU on manual trigger

on:
  workflow_dispatch

jobs:
  test_gpu:
    runs-on: ubuntu-latest
    timeout-minutes: 20
    steps:
    - uses: actions/checkout@v3
    - name: Set up Python 3.8
      uses: actions/setup-python@v3
      with:
        python-version: 3.8
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        python -m pip install -U anyscale pyyaml
    - name: Print environment info
      run: |
        ./xgboost_ray/tests/env_info.sh
    - name: Set anyscale project
      env:
        ANYSCALE_PROJECT: ${{ secrets.ANYSCALE_PROJECT }}
      run: |
        echo "project_id: ${ANYSCALE_PROJECT}" > ./xgboost_ray/tests/release/.anyscale.yaml
    - name: Run end to end GPU test
      env:
        ANYSCALE_CLI_TOKEN: ${{ secrets.ANYSCALE_CLI_TOKEN }}
      run: |
        pushd ./xgboost_ray/tests/release
        ./run_e2e_gpu.sh
        popd || true


================================================
FILE: .github/workflows/test.yaml
================================================
name: pytest on push

on:
  push:
  pull_request:
  schedule:
    - cron: "0 5 * * *"

jobs:
  test_lint:
    runs-on: ubuntu-latest
    timeout-minutes: 3
    steps:
    - uses: actions/checkout@v3
    - name: Set up Python 3.8
      uses: actions/setup-python@v3
      with:
        python-version: 3.8
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        python -m pip install codecov
        if [ -f requirements/lint-requirements.txt ]; then python -m pip install -r requirements/lint-requirements.txt; fi
    - name: Print environment info
      run: |
        ./xgboost_ray/tests/env_info.sh
    - name: Run format script
      run: |
        ls -alp
        ./format.sh --all

  test_linux_ray_master:
    runs-on: ubuntu-latest
    timeout-minutes: 160
    strategy:
      matrix:
        python-version: ["3.8", "3.9", "3.10"]
        include:
          - python-version: "3.8"
            ray-wheel: https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl
          - python-version: "3.9"
            ray-wheel: https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp39-cp39-manylinux2014_x86_64.whl
          - python-version: "3.10"
            ray-wheel: https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl
    steps:
    - uses: actions/checkout@v3
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v3
      with:
        python-version: ${{ matrix.python-version }}
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        python -m pip install codecov
        python -m pip install -U ${{ matrix.ray-wheel }}
        if [ -f requirements/test-requirements.txt ]; then python -m pip install -r requirements/test-requirements.txt; fi
    - name: Install package
      run: |
        python -m pip install -e .
    - name: Print environment info
      run: |
        ./xgboost_ray/tests/env_info.sh
    - name: Run tests
      uses: nick-invision/retry@v2
      with:
        timeout_minutes: 45
        max_attempts: 3
        command: bash ./run_ci_tests.sh
    - name: Run examples
      uses: nick-invision/retry@v2
      with:
        timeout_minutes: 10
        max_attempts: 3
        command: bash ./run_ci_examples.sh

  test_linux_ray_release:
    runs-on: ubuntu-latest
    timeout-minutes: 160
    strategy:
      matrix:
        python-version: ["3.8", "3.9", "3.10"]
    steps:
    - uses: actions/checkout@v3
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v3
      with:
        python-version: ${{ matrix.python-version }}
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        python -m pip install codecov
        python -m pip install -U ray
        if [ -f requirements/test-requirements.txt ]; then python -m pip install -r requirements/test-requirements.txt; fi
    - name: Install package
      run: |
        python -m pip install -e .
    - name: Print environment info
      run: |
        ./xgboost_ray/tests/env_info.sh
    - name: Run tests
      uses: nick-invision/retry@v2
      with:
        timeout_minutes: 45
        max_attempts: 3
        command: bash ./run_ci_tests.sh
    - name: Run examples
      uses: nick-invision/retry@v2
      with:
        timeout_minutes: 10
        max_attempts: 3
        command: bash ./run_ci_examples.sh

  test_linux_compat:
    # Test compatibility when some optional libraries are missing
    # Test runs on latest ray release
    runs-on: ubuntu-latest
    timeout-minutes: 160
    strategy:
      matrix:
        python-version: ["3.8", "3.9", "3.10"]
    steps:
    - uses: actions/checkout@v3
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v3
      with:
        python-version: ${{ matrix.python-version }}
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        python -m pip install codecov
        python -m pip install -U ray
        if [ -f requirements/test-requirements.txt ]; then python -m pip install -r requirements/test-requirements.txt; fi
    - name: Uninstall unavailable dependencies
      # Disables modin and Ray Tune (via tabulate)
      run: |
        python -m pip uninstall -y modin
        python -m pip uninstall -y tabulate
    - name: Install package
      run: |
        python -m pip install -e .
    - name: Print environment info
      run: |
        ./xgboost_ray/tests/env_info.sh
    - name: Run tests
      uses: nick-invision/retry@v2
      with:
        timeout_minutes: 45
        max_attempts: 3
        command: bash ./run_ci_tests.sh --no-tune
    - name: Run examples
      uses: nick-invision/retry@v2
      with:
        timeout_minutes: 10
        max_attempts: 3
        command: bash ./run_ci_examples.sh --no-tune

  test_linux_cutting_edge:
    # Tests on cutting edge, i.e. latest Ray master, latest XGBoost master
    runs-on: ubuntu-latest
    timeout-minutes: 160
    strategy:
      matrix:
        # no new versions for xgboost are published for 3.6
        python-version: ["3.8", "3.9", "3.10"]
        include:
          - python-version: "3.8"
            ray-wheel: https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl
          - python-version: "3.9"
            ray-wheel: https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp39-cp39-manylinux2014_x86_64.whl
          - python-version: "3.10"
            ray-wheel: https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl
    steps:
    - uses: actions/checkout@v3
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v3
      with:
        python-version: ${{ matrix.python-version }}
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        python -m pip install codecov
        python -m pip install -U ${{ matrix.ray-wheel }}
        if [ -f requirements/test-requirements.txt ]; then python -m pip install -r requirements/test-requirements.txt; fi
    - name: Install Ubuntu system dependencies
      run: |
        sudo apt-get install -y --no-install-recommends ninja-build
    - name: Install package
      run: |
        python -m pip install -e .
    - name: Clone XGBoost repo
      uses: actions/checkout@v3
      with:
        repository: dmlc/xgboost
        path: xgboost
        submodules: true
    - name: Install XGBoost from source
      shell: bash -l {0}
      run: |
        pushd ${GITHUB_WORKSPACE}/xgboost/python-package
        python --version
        python setup.py sdist
        pip install -v ./dist/xgboost-*.tar.gz
        popd
    - name: Print environment info
      run: |
        ./xgboost_ray/tests/env_info.sh
    - name: Run tests
      uses: nick-invision/retry@v2
      with:
        timeout_minutes: 45
        max_attempts: 3
        command: bash ./run_ci_tests.sh
    - name: Run examples
      uses: nick-invision/retry@v2
      with:
        timeout_minutes: 10
        max_attempts: 3
        command: bash ./run_ci_examples.sh

  test_linux_xgboost_legacy:
    # Tests on XGBoost 0.90 and latest Ray release
    runs-on: ubuntu-latest
    timeout-minutes: 160
    strategy:
      matrix:
        python-version: [3.8]
    steps:
    - uses: actions/checkout@v3
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v3
      with:
        python-version: ${{ matrix.python-version }}
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        python -m pip install codecov
        python -m pip install -U ray
        if [ -f requirements/test-requirements.txt ]; then python -m pip install -r requirements/test-requirements.txt; fi
    - name: Install package
      run: |
        python -m pip install -e .
    - name: Install legacy XGBoost
      run: |
        python -m pip install xgboost==0.90
    - name: Print environment info
      run: |
        ./xgboost_ray/tests/env_info.sh
    - name: Run tests
      uses: nick-invision/retry@v2
      with:
        timeout_minutes: 45
        max_attempts: 3
        command: bash ./run_ci_tests.sh
    - name: Run examples
      uses: nick-invision/retry@v2
      with:
        timeout_minutes: 10
        max_attempts: 3
        command: bash ./run_ci_examples.sh


================================================
FILE: .gitignore
================================================
# Python byte code files
*.pyc
python/.eggs

# Backup files
*.bak

# Emacs temporary files
*~
*#

# Debug symbols
*.pdb

# Visual Studio files
/packages
*.suo
*.user
*.VC.db
*.VC.opendb

# Protobuf-generated files
*_pb2.py
*.pb.h
*.pb.cc

# Ray cluster configuration
scripts/nodes.txt

# OS X folder attributes
.DS_Store

# Debug files
*.dSYM/
*.su

# Python setup files
*.egg-info

# Compressed files
*.gz

# Datasets from examples
**/MNIST_data/
**/cifar-10-batches-bin/

# Generated documentation files
/doc/_build
/doc/source/_static/thumbs
/doc/source/tune/generated_guides/

# User-specific stuff:
.idea/

# Pytest Cache
**/.pytest_cache
**/.cache
.benchmarks
python-driver-*

# Vscode
.vscode/

*.iml

# python virtual env
venv

# pyenv version file
.python-version

# Vim
.*.swp
*.swp
tags

# Emacs
.#*

# tools
tools/prometheus*

# ray project files
project-id
.mypy_cache/

# XGBoost models from examples
*.xgb

# Downloaded test data
*.csv
*.csv.gz
*.parquet

# Byte-compiled files
__pycache__/

================================================
FILE: LICENSE
================================================
                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "{}"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright {yyyy} {name of copyright owner}

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.

--------------------------------------------------------------------------------

Code in python/ray/rllib/{evolution_strategies, dqn} adapted from
https://github.com/openai (MIT License)

Copyright (c) 2016 OpenAI (http://openai.com)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

--------------------------------------------------------------------------------

Code in python/ray/rllib/impala/vtrace.py from
https://github.com/deepmind/scalable_agent

Copyright 2018 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

--------------------------------------------------------------------------------
Code in python/ray/rllib/ars is adapted from https://github.com/modestyachts/ARS

Copyright (c) 2018, ARS contributors (Horia Mania, Aurelia Guy, Benjamin Recht)
All rights reserved.

Redistribution and use of ARS in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation and/or
other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

------------------
Code in python/ray/prometheus_exporter.py is adapted from https://github.com/census-instrumentation/opencensus-python/blob/master/contrib/opencensus-ext-prometheus/opencensus/ext/prometheus/stats_exporter/__init__.py

# Copyright 2018, OpenCensus Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


================================================
FILE: README.md
================================================
<!--$UNCOMMENT(xgboost-ray)=-->

# Distributed XGBoost on Ray
<!--$REMOVE-->
![Build Status](https://github.com/ray-project/xgboost_ray/workflows/pytest%20on%20push/badge.svg)
[![docs.ray.io](https://img.shields.io/badge/docs-ray.io-blue)](https://docs.ray.io/en/master/xgboost-ray.html)
<!--$END_REMOVE-->
XGBoost-Ray is a distributed backend for
[XGBoost](https://xgboost.readthedocs.io/en/latest/), built
on top of
[distributed computing framework Ray](https://ray.io).

XGBoost-Ray

- enables [multi-node](#usage) and [multi-GPU](#multi-gpu-training) training
- integrates seamlessly with distributed [hyperparameter optimization](#hyperparameter-tuning) library [Ray Tune](http://tune.io)
- comes with advanced [fault tolerance handling](#fault-tolerance) mechanisms, and
- supports [distributed dataframes and distributed data loading](#distributed-data-loading)

All releases are tested on large clusters and workloads.

## Installation

You can install the latest XGBoost-Ray release from PIP:

```bash
pip install "xgboost_ray"
```

If you'd like to install the latest master, use this command instead:

```bash
pip install "git+https://github.com/ray-project/xgboost_ray.git#egg=xgboost_ray"
```

## Usage

XGBoost-Ray provides a drop-in replacement for XGBoost's `train`
function. To pass data, instead of using `xgb.DMatrix` you will
have to use `xgboost_ray.RayDMatrix`. You can also use a scikit-learn
interface - see next section.


Just as in original `xgb.train()` function, the
[training parameters](https://xgboost.readthedocs.io/en/stable/parameter.html)
are passed as the `params` dictionary.

Ray-specific distributed training parameters are configured with a
`xgboost_ray.RayParams` object. For instance, you can set
the `num_actors` property to specify how many distributed actors
you would like to use.

Here is a simplified example (which requires `sklearn`):

**Training:**

```python
from xgboost_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer

train_x, train_y = load_breast_cancer(return_X_y=True)
train_set = RayDMatrix(train_x, train_y)

evals_result = {}
bst = train(
    {
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    },
    train_set,
    evals_result=evals_result,
    evals=[(train_set, "train")],
    verbose_eval=False,
    ray_params=RayParams(
        num_actors=2,  # Number of remote actors
        cpus_per_actor=1))

bst.save_model("model.xgb")
print("Final training error: {:.4f}".format(
    evals_result["train"]["error"][-1]))
```

**Prediction:**

```python
from xgboost_ray import RayDMatrix, RayParams, predict
from sklearn.datasets import load_breast_cancer
import xgboost as xgb

data, labels = load_breast_cancer(return_X_y=True)

dpred = RayDMatrix(data, labels)

bst = xgb.Booster(model_file="model.xgb")
pred_ray = predict(bst, dpred, ray_params=RayParams(num_actors=2))

print(pred_ray)
```

### scikit-learn API

XGBoost-Ray also features a scikit-learn API fully mirroring pure
XGBoost scikit-learn API, providing a completely drop-in
replacement. The following estimators are available:

- `RayXGBClassifier`
- `RayXGRegressor`
- `RayXGBRFClassifier`
- `RayXGBRFRegressor`
- `RayXGBRanker`

Example usage of `RayXGBClassifier`:

```python
from xgboost_ray import RayXGBClassifier, RayParams
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

seed = 42

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.25, random_state=42
)

clf = RayXGBClassifier(
    n_jobs=4,  # In XGBoost-Ray, n_jobs sets the number of actors
    random_state=seed
)

# scikit-learn API will automatically convert the data
# to RayDMatrix format as needed.
# You can also pass X as a RayDMatrix, in which case
# y will be ignored.

clf.fit(X_train, y_train)

pred_ray = clf.predict(X_test)
print(pred_ray)

pred_proba_ray = clf.predict_proba(X_test)
print(pred_proba_ray)

# It is also possible to pass a RayParams object
# to fit/predict/predict_proba methods - will override
# n_jobs set during initialization

clf.fit(X_train, y_train, ray_params=RayParams(num_actors=2))

pred_ray = clf.predict(X_test, ray_params=RayParams(num_actors=2))
print(pred_ray)
```

Things to keep in mind:

- `n_jobs` parameter controls the number of actors spawned.
  You can pass a `RayParams` object to the
  `fit`/`predict`/`predict_proba` methods as the `ray_params` argument
  for greater control over resource allocation. Doing
  so will override the value of `n_jobs` with the value of
  `ray_params.num_actors` attribute. For more information, refer
  to the [Resources](#resources) section below.
- By default `n_jobs` is set to `1`, which means the training
  will **not** be distributed. Make sure to either set `n_jobs`
  to a higher value or pass a `RayParams` object as outlined above
  in order to take advantage of XGBoost-Ray's functionality.
- After calling `fit`, additional evaluation results (e.g. training time,
  number of rows, callback results) will be available under
  `additional_results_` attribute.
- XGBoost-Ray's scikit-learn API is based on XGBoost 1.4.
  While we try to support older XGBoost versions, please note that
  this library is only fully tested and supported for XGBoost >= 1.4.

For more information on the scikit-learn API, refer to the [XGBoost documentation](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn).

## Data loading

Data is passed to XGBoost-Ray via a `RayDMatrix` object.

The `RayDMatrix` lazy loads data and stores it sharded in the
Ray object store. The Ray XGBoost actors then access these
shards to run their training on.

A `RayDMatrix` support various data and file types, like
Pandas DataFrames, Numpy Arrays, CSV files and Parquet files.

Example loading multiple parquet files:

```python
import glob
from xgboost_ray import RayDMatrix, RayFileType

# We can also pass a list of files
path = list(sorted(glob.glob("/data/nyc-taxi/*/*/*.parquet")))

# This argument will be passed to `pd.read_parquet()`
columns = [
    "passenger_count",
    "trip_distance", "pickup_longitude", "pickup_latitude",
    "dropoff_longitude", "dropoff_latitude",
    "fare_amount", "extra", "mta_tax", "tip_amount",
    "tolls_amount", "total_amount"
]

dtrain = RayDMatrix(
    path,
    label="passenger_count",  # Will select this column as the label
    columns=columns,
    # ignore=["total_amount"],  # Optional list of columns to ignore
    filetype=RayFileType.PARQUET)
```

<!--$UNCOMMENT(xgboost-ray-tuning)=-->

## Hyperparameter Tuning

XGBoost-Ray integrates with <!--$UNCOMMENT{ref}`Ray Tune <tune-main>`--><!--$REMOVE-->[Ray Tune](https://tune.io)<!--$END_REMOVE--> to provide distributed hyperparameter tuning for your
distributed XGBoost models. You can run multiple XGBoost-Ray training runs in parallel, each with a different
hyperparameter configuration, and each training run parallelized by itself. All you have to do is move your training
code to a function, and pass the function to `tune.run`. Internally, `train` will detect if `tune` is being used and will
automatically report results to tune.

Example using XGBoost-Ray with Ray Tune:

```python
from xgboost_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer

num_actors = 4
num_cpus_per_actor = 1

ray_params = RayParams(
    num_actors=num_actors,
    cpus_per_actor=num_cpus_per_actor)

def train_model(config):
    train_x, train_y = load_breast_cancer(return_X_y=True)
    train_set = RayDMatrix(train_x, train_y)

    evals_result = {}
    bst = train(
        params=config,
        dtrain=train_set,
        evals_result=evals_result,
        evals=[(train_set, "train")],
        verbose_eval=False,
        ray_params=ray_params)
    bst.save_model("model.xgb")

from ray import tune

# Specify the hyperparameter search space.
config = {
    "tree_method": "approx",
    "objective": "binary:logistic",
    "eval_metric": ["logloss", "error"],
    "eta": tune.loguniform(1e-4, 1e-1),
    "subsample": tune.uniform(0.5, 1.0),
    "max_depth": tune.randint(1, 9)
}

# Make sure to use the `get_tune_resources` method to set the `resources_per_trial`
analysis = tune.run(
    train_model,
    config=config,
    metric="train-error",
    mode="min",
    num_samples=4,
    resources_per_trial=ray_params.get_tune_resources())
print("Best hyperparameters", analysis.best_config)
```

Also see examples/simple_tune.py for another example.

## Fault tolerance

XGBoost-Ray leverages the stateful Ray actor model to
enable fault tolerant training. There are currently
two modes implemented.

### Non-elastic training (warm restart)

When an actor or node dies, XGBoost-Ray will retain the
state of the remaining actors. In non-elastic training,
the failed actors will be replaced as soon as resources
are available again. Only these actors will reload their
parts of the data. Training will resume once all actors
are ready for training again.

You can set this mode in the `RayParams`:

```python
from xgboost_ray import RayParams

ray_params = RayParams(
    elastic_training=False,  # Use non-elastic training
    max_actor_restarts=2,    # How often are actors allowed to fail
)
```

### Elastic training

In elastic training, XGBoost-Ray will continue training
with fewer actors (and on fewer data) when a node or actor
dies. The missing actors are staged in the background,
and are reintegrated into training once they are back and
loaded their data.

This mode will train on fewer data for a period of time,
which can impact accuracy. In practice, we found these
effects to be minor, especially for large shuffled datasets.
The immediate benefit is that training time is reduced
significantly to almost the same level as if no actors died.
Thus, especially when data loading takes a large part of
the total training time, this setting can dramatically speed
up training times for large distributed jobs.

You can configure this mode in the `RayParams`:

```python
from xgboost_ray import RayParams

ray_params = RayParams(
    elastic_training=True,  # Use elastic training
    max_failed_actors=3,    # Only allow at most 3 actors to die at the same time
    max_actor_restarts=2,   # How often are actors allowed to fail
)
```

## Resources

By default, XGBoost-Ray tries to determine the number of CPUs
available and distributes them evenly across actors.

In the case of very large clusters or clusters with many different
machine sizes, it makes sense to limit the number of CPUs per actor
by setting the `cpus_per_actor` argument. Consider always
setting this explicitly.

The number of XGBoost actors always has to be set manually with
the `num_actors` argument.

### Multi GPU training

XGBoost-Ray enables multi GPU training. The XGBoost core backend
will automatically leverage NCCL2 for cross-device communication.
All you have to do is to start one actor per GPU and set XGBoost's
`tree_method` to a GPU-compatible option, eg. `gpu_hist` (see XGBoost
documentation for more details.)

For instance, if you have 2 machines with 4 GPUs each, you will want
to start 8 remote actors, and set `gpus_per_actor=1`. There is usually
no benefit in allocating less (e.g. 0.5) or more than one GPU per actor.

You should divide the CPUs evenly across actors per machine, so if your
machines have 16 CPUs in addition to the 4 GPUs, each actor should have
4 CPUs to use.

```python
from xgboost_ray import RayParams

ray_params = RayParams(
    num_actors=8,
    gpus_per_actor=1,
    cpus_per_actor=4,   # Divide evenly across actors per machine
)
```

### How many remote actors should I use?

This depends on your workload and your cluster setup.
Generally there is no inherent benefit of running more than
one remote actor per node for CPU-only training. This is because
XGBoost core can already leverage multiple CPUs via threading.

However, there are some cases when you should consider starting
more than one actor per node:

- For [multi GPU training](#multi-gpu-training), each GPU should have a separate
  remote actor. Thus, if your machine has 24 CPUs and 4 GPUs,
  you will want to start 4 remote actors with 6 CPUs and 1 GPU
  each
- In a **heterogeneous cluster**, you might want to find the
  [greatest common divisor](https://en.wikipedia.org/wiki/Greatest_common_divisor)
  for the number of CPUs.
  E.g. for a cluster with three nodes of 4, 8, and 12 CPUs, respectively,
  you should set the number of actors to 6 and the CPUs per
  actor to 4.

## Distributed data loading

XGBoost-Ray can leverage both centralized and distributed data loading.

In **centralized data loading**, the data is partitioned by the head node
and stored in the object store. Each remote actor then retrieves their
partitions by querying the Ray object store. Centralized loading is used
when you pass centralized in-memory dataframes, such as Pandas dataframes
or Numpy arrays, or when you pass a single source file, such as a single CSV
or Parquet file.

```python
from xgboost_ray import RayDMatrix

# This will use centralized data loading, as only one source file is specified
# `label_col` is a column in the CSV, used as the target label
ray_params = RayDMatrix("./source_file.csv", label="label_col")
```

In **distributed data loading**, each remote actor loads their data directly from
the source (e.g. local hard disk, NFS, HDFS, S3),
without a central bottleneck. The data is still stored in the
object store, but locally to each actor. This mode is used automatically
when loading data from multiple CSV or Parquet files. Please note that
we do not check or enforce partition sizes in this case - it is your job
to make sure the data is evenly distributed across the source files.

```python
from xgboost_ray import RayDMatrix

# This will use distributed data loading, as four source files are specified
# Please note that you cannot schedule more than four actors in this case.
# `label_col` is a column in the Parquet files, used as the target label
ray_params = RayDMatrix([
    "hdfs:///tmp/part1.parquet",
    "hdfs:///tmp/part2.parquet",
    "hdfs:///tmp/part3.parquet",
    "hdfs:///tmp/part4.parquet",
], label="label_col")
```

Lastly, XGBoost-Ray supports **distributed dataframe** representations, such
as <!--$UNCOMMENT{ref}`Ray Datasets <datasets>`--><!--$REMOVE-->[Ray Datasets](https://docs.ray.io/en/latest/data/dataset.html)<!--$END_REMOVE-->,
[Modin](https://modin.readthedocs.io/en/latest/) and
[Dask dataframes](https://docs.dask.org/en/latest/dataframe.html)
(used with <!--$UNCOMMENT{ref}`Dask on Ray <dask-on-ray>`--><!--$REMOVE-->[Dask on Ray](https://docs.ray.io/en/master/dask-on-ray.html)<!--$END_REMOVE-->).
Here, XGBoost-Ray will check on which nodes the distributed partitions
are currently located, and will assign partitions to actors in order to
minimize cross-node data transfer. Please note that we also assume here
that partition sizes are uniform.

```python
from xgboost_ray import RayDMatrix

# This will try to allocate the existing Modin partitions
# to co-located Ray actors. If this is not possible, data will
# be transferred across nodes
ray_params = RayDMatrix(existing_modin_df)
```

### Data sources

The following data sources can be used with a `RayDMatrix` object.

| Type                                                             | Centralized loading | Distributed loading |
|------------------------------------------------------------------|---------------------|---------------------|
| Numpy array                                                      | Yes                 | No                  |
| Pandas dataframe                                                 | Yes                 | No                  |
| Single CSV                                                       | Yes                 | No                  |
| Multi CSV                                                        | Yes                 | Yes                 |
| Single Parquet                                                   | Yes                 | No                  |
| Multi Parquet                                                    | Yes                 | Yes                 |
| [Ray Dataset](https://docs.ray.io/en/latest/data/dataset.html)   | Yes                 | Yes                 |
| [Petastorm](https://github.com/uber/petastorm)                   | Yes                 | Yes                 |
| [Dask dataframe](https://docs.dask.org/en/latest/dataframe.html) | Yes                 | Yes                 |
| [Modin dataframe](https://modin.readthedocs.io/en/latest/)       | Yes                 | Yes                 |

## Memory usage

XGBoost uses a compute-optimized datastructure, the `DMatrix`,
to hold training data. When converting a dataset to a `DMatrix`,
XGBoost creates intermediate copies and ends up
holding a complete copy of the full data. The data will be converted
into the local dataformat (on a 64 bit system these are 64 bit floats.)
Depending on the system and original dataset dtype, this matrix can
thus occupy more memory than the original dataset.

The **peak memory usage** for CPU-based training is at least
**3x** the dataset size (assuming dtype `float32` on a 64bit system)
plus about **400,000 KiB** for other resources,
like operating system requirements and storing of intermediate
results.

**Example**

- Machine type: AWS m5.xlarge (4 vCPUs, 16 GiB RAM)
- Usable RAM: ~15,350,000 KiB
- Dataset: 1,250,000 rows with 1024 features, dtype float32.
  Total size: 5,000,000 KiB
- XGBoost DMatrix size: ~10,000,000 KiB

This dataset will fit exactly on this node for training.

Note that the DMatrix size might be lower on a 32 bit system.

**GPUs**

Generally, the same memory requirements exist for GPU-based
training. Additionally, the GPU must have enough memory
to hold the dataset.

In the example above, the GPU must have at least
10,000,000 KiB (about 9.6 GiB) memory. However,
empirically we found that using a `DeviceQuantileDMatrix`
seems to show more peak GPU memory usage, possibly
for intermediate storage when loading data (about 10%).

**Best practices**

In order to reduce peak memory usage, consider the following
suggestions:

- Store data as `float32` or less. More precision is often
  not needed, and keeping data in a smaller format will
  help reduce peak memory usage for initial data loading.
- Pass the `dtype` when loading data from CSV. Otherwise,
  floating point values will be loaded as `np.float64`
  per default, increasing peak memory usage by 33%.

## Placement Strategies

XGBoost-Ray leverages Ray's Placement Group API (<https://docs.ray.io/en/latest/ray-core/scheduling/placement-group.html>)
to implement placement strategies for better fault tolerance.

By default, a SPREAD strategy is used for training, which attempts to spread all of the training workers
across the nodes in a cluster on a best-effort basis. This improves fault tolerance since it minimizes the
number of worker failures when a node goes down, but comes at a cost of increased inter-node communication
To disable this strategy, set the `RXGB_USE_SPREAD_STRATEGY` environment variable to 0. If disabled, no
particular placement strategy will be used.

Note that this strategy is used only when `elastic_training` is not used. If `elastic_training` is set to `True`,
no placement strategy is used.

When XGBoost-Ray is used with Ray Tune for hyperparameter tuning, a PACK strategy is used. This strategy
attempts to place all workers for each trial on the same node on a best-effort basis. This means that if a node
goes down, it will be less likely to impact multiple trials.

When placement strategies are used, XGBoost-Ray will wait for 100 seconds for the required resources
to become available, and will fail if the required resources cannot be reserved and the cluster cannot autoscale
to increase the number of resources. You can change the `RXGB_PLACEMENT_GROUP_TIMEOUT_S` environment variable to modify
how long this timeout should be.

## More examples

For complete end to end examples, please have a look at
the [examples folder](https://github.com/ray-project/xgboost_ray/tree/master/xgboost_ray/examples/):

- [Simple sklearn breastcancer dataset example](https://github.com/ray-project/xgboost_ray/blob/master/xgboost_ray/examples/simple.py) (requires `sklearn`)
- [HIGGS classification example](https://github.com/ray-project/xgboost_ray/blob/master/xgboost_ray/examples/higgs.py)
  ([download dataset (2.6 GB)](https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz))
- [HIGGS classification example with Parquet](https://github.com/ray-project/xgboost_ray/blob/master/xgboost_ray/examples/higgs_parquet.py) (uses the same dataset)
- [Test data classification](https://github.com/ray-project/xgboost_ray/blob/master/xgboost_ray/examples/train_on_test_data.py) (uses a self-generated dataset)
<!--$REMOVE-->
## Resources

* [XGBoost-Ray documentation](https://xgboost.readthedocs.io/en/stable/tutorials/ray.html)
* [Ray community slack](https://forms.gle/9TSdDYUgxYs8SA9e8)
<!--$END_REMOVE-->
<!--$UNCOMMENT## API reference

```{eval-rst}
.. autoclass:: xgboost_ray.RayParams
    :members:
```

```{eval-rst}
.. autoclass:: xgboost_ray.RayDMatrix
    :members:
```

```{eval-rst}
.. autofunction:: xgboost_ray.train
```

```{eval-rst}
.. autofunction:: xgboost_ray.predict
```

### scikit-learn API

```{eval-rst}
.. autoclass:: xgboost_ray.RayXGBClassifier
    :members:
```

```{eval-rst}
.. autoclass:: xgboost_ray.RayXGBRegressor
    :members:
```

```{eval-rst}
.. autoclass:: xgboost_ray.RayXGBRFClassifier
    :members:
```

```{eval-rst}
.. autoclass:: xgboost_ray.RayXGBRFRegressor
    :members:
```-->


================================================
FILE: format.sh
================================================
#!/usr/bin/env bash
# Black + Clang formatter (if installed). This script formats all changed files from the last mergebase.
# You are encouraged to run this locally before pushing changes for review.

# Cause the script to exit if a single command fails
set -euo pipefail

FLAKE8_VERSION_REQUIRED="3.9.1"
BLACK_VERSION_REQUIRED="22.10.0"
SHELLCHECK_VERSION_REQUIRED="0.7.1"
ISORT_VERSION_REQUIRED="5.10.1"

check_python_command_exist() {
    VERSION=""
    case "$1" in
        black)
            VERSION=$BLACK_VERSION_REQUIRED
            ;;
        flake8)
            VERSION=$FLAKE8_VERSION_REQUIRED
            ;;
        isort)
            VERSION=$ISORT_VERSION_REQUIRED
            ;;
        *)
            echo "$1 is not a required dependency"
            exit 1
    esac
    if ! [ -x "$(command -v "$1")" ]; then
        echo "$1 not installed. Install the python package with: pip install $1==$VERSION"
        exit 1
    fi
}

check_docstyle() {
    echo "Checking docstyle..."
    violations=$(git ls-files | grep '.py$' | xargs grep -E '^[ ]+[a-z_]+ ?\([a-zA-Z]+\): ' | grep -v 'str(' | grep -v noqa || true)
    if [[ -n "$violations" ]]; then
        echo
        echo "=== Found Ray docstyle violations ==="
        echo "$violations"
        echo
        echo "Per the Google pydoc style, omit types from pydoc args as they are redundant: https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#code-style "
        echo "If this is a false positive, you can add a '# noqa' comment to the line to ignore."
        exit 1
    fi
    return 0
}

check_python_command_exist black
check_python_command_exist flake8
check_python_command_exist isort

# this stops git rev-parse from failing if we run this from the .git directory
builtin cd "$(dirname "${BASH_SOURCE:-$0}")"

ROOT="$(git rev-parse --show-toplevel)"
builtin cd "$ROOT" || exit 1

# NOTE(edoakes): black version differs based on installation method:
#   Option 1) 'black, 21.12b0 (compiled: no)'
#   Option 2) 'black, version 21.12b0'
#   For newer versions (at least 22.10.0), a second line is printed which must be dropped:
#
#     black, 22.10.0 (compiled: yes)
#     Python (CPython) 3.9.13
BLACK_VERSION_STR=$(black --version)
if [[ "$BLACK_VERSION_STR" == *"compiled"* ]]
then
    BLACK_VERSION=$(echo "$BLACK_VERSION_STR" | head -n 1 | awk '{print $2}')
else
    BLACK_VERSION=$(echo "$BLACK_VERSION_STR" | head -n 1 | awk '{print $3}')
fi
FLAKE8_VERSION=$(flake8 --version | head -n 1 | awk '{print $1}')
ISORT_VERSION=$(isort --version | grep VERSION | awk '{print $2}')

# params: tool name, tool version, required version
tool_version_check() {
    if [ "$2" != "$3" ]; then
        echo "WARNING: Ray uses $1 $3, You currently are using $2. This might generate different results."
    fi
}

tool_version_check "flake8" "$FLAKE8_VERSION" "$FLAKE8_VERSION_REQUIRED"
tool_version_check "black" "$BLACK_VERSION" "$BLACK_VERSION_REQUIRED"
tool_version_check "isort" "$ISORT_VERSION" "$ISORT_VERSION_REQUIRED"

if command -v shellcheck >/dev/null; then
    SHELLCHECK_VERSION=$(shellcheck --version | awk '/^version:/ {print $2}')
    tool_version_check "shellcheck" "$SHELLCHECK_VERSION" "$SHELLCHECK_VERSION_REQUIRED"
else
    echo "INFO: Ray uses shellcheck for shell scripts, which is not installed. You may install shellcheck=$SHELLCHECK_VERSION_REQUIRED with your system package manager."
fi

if command -v clang-format >/dev/null; then
  CLANG_FORMAT_VERSION=$(clang-format --version | awk '{print $3}')
  tool_version_check "clang-format" "$CLANG_FORMAT_VERSION" "12.0.0"
else
    echo "WARNING: clang-format is not installed!"
fi

if [[ $(flake8 --version) != *"flake8_quotes"* ]]; then
    echo "WARNING: Ray uses flake8 with flake8_quotes. Might error without it. Install with: pip install flake8-quotes"
fi

if [[ $(flake8 --version) != *"flake8-bugbear"* ]]; then
    echo "WARNING: Ray uses flake8 with flake8-bugbear. Might error without it. Install with: pip install flake8-bugbear"
fi

SHELLCHECK_FLAGS=(
  --exclude=1090  # "Can't follow non-constant source. Use a directive to specify location."
  --exclude=1091  # "Not following {file} due to some error"
  --exclude=2207  # "Prefer mapfile or read -a to split command output (or quote to avoid splitting)." -- these aren't compatible with macOS's old Bash
)


BLACK_EXCLUDES=(
    '--force-exclude'
    'python/ray/cloudpickle/*|'`
    `'python/build/*|'`
    `'python/ray/core/src/ray/gcs/*|'`
    `'python/ray/thirdparty_files/*|'`
    `'python/ray/_private/thirdparty/*|'`
    `'python/ray/serve/tests/test_config_files/syntax_error\.py'
)

GIT_LS_EXCLUDES=(
  ':(exclude)python/ray/cloudpickle/'
  ':(exclude)python/ray/_private/runtime_env/_clonevirtualenv.py'
)

# TODO(barakmich): This should be cleaned up. I've at least excised the copies
# of these arguments to this location, but the long-term answer is to actually
# make a flake8 config file
FLAKE8_PYX_IGNORES="--ignore=C408,E121,E123,E126,E211,E225,E226,E227,E24,E704,E999,W503,W504,W605"

shellcheck_scripts() {
  shellcheck "${SHELLCHECK_FLAGS[@]}" "$@"
}

# Format specified files
format_files() {
    local shell_files=() python_files=() bazel_files=()

    local name
    for name in "$@"; do
      local base="${name%.*}"
      local suffix="${name#"${base}"}"

      local shebang=""
      read -r shebang < "${name}" || true
      case "${shebang}" in
        '#!'*)
          shebang="${shebang#/usr/bin/env }"
          shebang="${shebang%% *}"
          shebang="${shebang##*/}"
          ;;
      esac

      if [ "${base}" = "WORKSPACE" ] || [ "${base}" = "BUILD" ] || [ "${suffix}" = ".BUILD" ] || [ "${suffix}" = ".bazel" ] || [ "${suffix}" = ".bzl" ]; then
        bazel_files+=("${name}")
      elif [ -z "${suffix}" ] && [ "${shebang}" != "${shebang#python}" ] || [ "${suffix}" != "${suffix#.py}" ]; then
        python_files+=("${name}")
      elif [ -z "${suffix}" ] && [ "${shebang}" != "${shebang%sh}" ] || [ "${suffix}" != "${suffix#.sh}" ]; then
        shell_files+=("${name}")
      else
        echo "error: failed to determine file type: ${name}" 1>&2
        return 1
      fi
    done

    if [ 0 -lt "${#python_files[@]}" ]; then
      isort "${python_files[@]}"
      black "${python_files[@]}"
    fi

    if command -v shellcheck >/dev/null; then
      if shellcheck --shell=sh --format=diff - < /dev/null; then
        if [ 0 -lt "${#shell_files[@]}" ]; then
          local difference
          difference="$(shellcheck_scripts --format=diff "${shell_files[@]}" || true && printf "-")"
          difference="${difference%-}"
          printf "%s" "${difference}" | patch -p1
        fi
      else
        echo "error: this version of shellcheck does not support diffs"
      fi
    fi
}

format_all_scripts() {
    command -v flake8 &> /dev/null;
    HAS_FLAKE8=$?

    # Run isort before black to fix imports and let black deal with file format.
    echo "$(date)" "isort...."
    git ls-files -- '*.py' "${GIT_LS_EXCLUDES[@]}" | xargs -P 10 \
      isort
    echo "$(date)" "Black...."
    git ls-files -- '*.py' "${GIT_LS_EXCLUDES[@]}" | xargs -P 10 \
      black "${BLACK_EXCLUDES[@]}"
    if [ $HAS_FLAKE8 ]; then
      echo "$(date)" "Flake8...."
      git ls-files -- '*.py' "${GIT_LS_EXCLUDES[@]}" | xargs -P 5 \
        flake8 --config=.flake8
    fi

    if command -v shellcheck >/dev/null; then
      local shell_files non_shell_files
      non_shell_files=($(git ls-files -- ':(exclude)*.sh'))
      shell_files=($(git ls-files -- '*.sh'))
      if [ 0 -lt "${#non_shell_files[@]}" ]; then
        shell_files+=($(git --no-pager grep -l -- '^#!\(/usr\)\?/bin/\(env \+\)\?\(ba\)\?sh' "${non_shell_files[@]}" || true))
      fi
      if [ 0 -lt "${#shell_files[@]}" ]; then
        echo "$(date)" "shellcheck scripts...."
        shellcheck_scripts "${shell_files[@]}"
      fi
    fi
}

# Format files that differ from main branch. Ignores dirs that are not slated
# for autoformat yet.
format_changed() {
    # The `if` guard ensures that the list of filenames is not empty, which
    # could cause the formatter to receive 0 positional arguments, making
    # Black error.
    #
    # `diff-filter=ACRM` and $MERGEBASE is to ensure we only format files that
    # exist on both branches.
    MERGEBASE="$(git merge-base upstream/master HEAD)"

    if ! git diff --diff-filter=ACRM --quiet --exit-code "$MERGEBASE" -- '*.py' &>/dev/null; then
        git diff --name-only --diff-filter=ACRM "$MERGEBASE" -- '*.py' | xargs -P 5 \
            isort
    fi

    if ! git diff --diff-filter=ACRM --quiet --exit-code "$MERGEBASE" -- '*.py' &>/dev/null; then
        git diff --name-only --diff-filter=ACRM "$MERGEBASE" -- '*.py' | xargs -P 5 \
            black "${BLACK_EXCLUDES[@]}"
        if which flake8 >/dev/null; then
            git diff --name-only --diff-filter=ACRM "$MERGEBASE" -- '*.py' | xargs -P 5 \
                 flake8 --config=.flake8
        fi
    fi

    if ! git diff --diff-filter=ACRM --quiet --exit-code "$MERGEBASE" -- '*.pyx' '*.pxd' '*.pxi' &>/dev/null; then
        if which flake8 >/dev/null; then
            git diff --name-only --diff-filter=ACRM "$MERGEBASE" -- '*.pyx' '*.pxd' '*.pxi' | xargs -P 5 \
                 flake8 --config=.flake8 "$FLAKE8_PYX_IGNORES"
        fi
    fi

    if which clang-format >/dev/null; then
        if ! git diff --diff-filter=ACRM --quiet --exit-code "$MERGEBASE" -- '*.cc' '*.h' &>/dev/null; then
            git diff --name-only --diff-filter=ACRM "$MERGEBASE" -- '*.cc' '*.h' | xargs -P 5 \
                 clang-format -i
        fi
    fi

    if command -v shellcheck >/dev/null; then
        local shell_files non_shell_files
        non_shell_files=($(git diff --name-only --diff-filter=ACRM "$MERGEBASE" -- ':(exclude)*.sh'))
        shell_files=($(git diff --name-only --diff-filter=ACRM "$MERGEBASE" -- '*.sh'))
        if [ 0 -lt "${#non_shell_files[@]}" ]; then
            shell_files+=($(git --no-pager grep -l -- '^#!\(/usr\)\?/bin/\(env \+\)\?\(ba\)\?sh' "${non_shell_files[@]}" || true))
        fi
        if [ 0 -lt "${#shell_files[@]}" ]; then
            shellcheck_scripts "${shell_files[@]}"
        fi
    fi
}

# This flag formats individual files. --files *must* be the first command line
# arg to use this option.
if [ "${1-}" == '--files' ]; then
    format_files "${@:2}"
# If `--all` or `--scripts` are passed, then any further arguments are ignored.
# Format the entire python directory and other scripts.
elif [ "${1-}" == '--all-scripts' ]; then
    format_all_scripts "${@}"
    if [ -n "${FORMAT_SH_PRINT_DIFF-}" ]; then git --no-pager diff; fi
# Format the all Python, C++, Java and other script files.
elif [ "${1-}" == '--all' ]; then
    format_all_scripts "${@}"
    if [ -n "${FORMAT_SH_PRINT_DIFF-}" ]; then git --no-pager diff; fi
else
    # Add the upstream remote if it doesn't exist
    if ! git remote -v | grep -q upstream; then
        git remote add 'upstream' 'https://github.com/ray-project/xgboost_ray.git'
    fi

    # Only fetch master since that's the branch we're diffing against.
    git fetch upstream master || true

    # Format only the files that changed in last commit.
    format_changed
fi

check_docstyle

if ! git diff --quiet &>/dev/null; then
    echo 'Reformatted changed files. Please review and stage the changes.'
    echo 'Files updated:'
    echo

    git --no-pager diff --name-only

    exit 1
fi


================================================
FILE: requirements/lint-requirements.txt
================================================
flake8==3.9.1
flake8-comprehensions==3.10.1
flake8-quotes==2.0.0
flake8-bugbear==21.9.2
black==22.10.0
isort==5.10.1
importlib-metadata==4.13.0


================================================
FILE: requirements/test-requirements.txt
================================================
packaging
petastorm
pytest
pyarrow<15.0.0
ray[tune, data, default]
scikit-learn
# modin==0.23.1.post0 is not compatible with xgboost_ray py38
modin<=0.23.1; python_version == '3.8'
# modin==0.26.0 is not compatible with xgboost_ray py39+
modin<0.26.0; python_version > '3.8'
dask

#workaround for now
protobuf<4.0.0
tensorboardX==2.2


================================================
FILE: run_ci_examples.sh
================================================
#!/bin/bash
set -e

TUNE=1

for i in "$@"
do
echo "$i"
case "$i" in
    --no-tune)
    TUNE=0
    ;;
    *)
    echo "unknown arg, $i"
    exit 1
    ;;
esac
done

pushd xgboost_ray/examples/ || exit 1
ray stop || true
echo "================"
echo "Running examples"
echo "================"
echo "running readme.py" && python readme.py
echo "running readme_sklearn_api.py" && python readme_sklearn_api.py
echo "running simple.py" && python simple.py --smoke-test
echo "running simple_predict.py" && python simple_predict.py
echo "running simple_dask.py" && python simple_dask.py --smoke-test
echo "running simple_modin.py" && python simple_modin.py --smoke-test
echo "running simple_objectstore.py" && python simple_objectstore.py --smoke-test
echo "running simple_ray_dataset.py" && python simple_objectstore.py --smoke-test
echo "running simple_partitioned.py" && python simple_partitioned.py --smoke-test

if [ "$TUNE" = "1" ]; then
  echo "running simple_tune.py" && python simple_tune.py --smoke-test
else
  echo "skipping tune example"
fi

echo "running train_on_test_data.py" && python train_on_test_data.py --smoke-test
popd

pushd xgboost_ray/tests
echo "running examples with Ray Client"
python -m pytest -v --durations=0 -x test_client.py
popd || exit 1


================================================
FILE: run_ci_tests.sh
================================================
#!/bin/bash
TUNE=1

for i in "$@"
do
echo "$i"
case "$i" in
    --no-tune)
    TUNE=0
    ;;
    *)
    echo "unknown arg, $i"
    exit 1
    ;;
esac
done

pushd xgboost_ray/tests || exit 1
echo "============="
echo "Running tests"
echo "============="
END_STATUS=0
if ! python -m pytest -vv -s --log-cli-level=DEBUG --durations=0 -x "test_colocation.py" ; then END_STATUS=1; fi
if ! python -m pytest -v --durations=0 -x "test_matrix.py" ; then END_STATUS=1; fi
if ! python -m pytest -v --durations=0 -x "test_data_source.py" ; then END_STATUS=1; fi
if ! python -m pytest -v --durations=0 -x "test_xgboost_api.py" ; then END_STATUS=1; fi
if ! python -m pytest -v --durations=0 -x "test_fault_tolerance.py" ; then END_STATUS=1; fi
if ! python -m pytest -v --durations=0 -x "test_end_to_end.py" ; then END_STATUS=1; fi
if ! python -m pytest -v --durations=0 -x "test_sklearn.py" ; then END_STATUS=1; fi
if ! python -m pytest -v --durations=0 -x "test_sklearn_matrix.py" ; then END_STATUS=1; fi

if [ "$TUNE" = "1" ]; then
  if ! python -m pytest -v --durations=0 -x "test_tune.py" ; then END_STATUS=1; fi
else
  echo "skipping tune tests"
fi

echo "running smoke test on benchmark_cpu_gpu.py" && if ! python release/benchmark_cpu_gpu.py 2 10 20 --smoke-test; then END_STATUS=1; fi
popd || exit 1

if [ "$END_STATUS" = "1" ]; then
  echo "At least one test has failed, exiting with code 1"
fi
exit "$END_STATUS"

================================================
FILE: setup.py
================================================
from setuptools import find_packages, setup

setup(
    name="xgboost_ray",
    packages=find_packages(where=".", include="xgboost_ray*"),
    version="0.1.20",
    author="Ray Team",
    description="A Ray backend for distributed XGBoost",
    license="Apache 2.0",
    long_description="A distributed backend for XGBoost built on top of "
    "distributed computing framework Ray.",
    url="https://github.com/ray-project/xgboost_ray",
    install_requires=[
        "ray>=2.7",
        "numpy>=1.16",
        "pandas",
        "wrapt>=1.12.1",
        "xgboost>=0.90",
        "packaging",
    ],
)


================================================
FILE: xgboost_ray/__init__.py
================================================
from xgboost_ray.main import RayParams, predict, train
from xgboost_ray.matrix import (
    Data,
    RayDeviceQuantileDMatrix,
    RayDMatrix,
    RayFileType,
    RayShardingMode,
    combine_data,
)

# workaround for legacy xgboost==0.9.0
try:
    from xgboost_ray.sklearn import (
        RayXGBClassifier,
        RayXGBRanker,
        RayXGBRegressor,
        RayXGBRFClassifier,
        RayXGBRFRegressor,
    )
except ImportError:
    pass

__version__ = "0.1.20"

__all__ = [
    "__version__",
    "RayParams",
    "RayDMatrix",
    "RayDeviceQuantileDMatrix",
    "RayFileType",
    "RayShardingMode",
    "Data",
    "combine_data",
    "train",
    "predict",
    "RayXGBClassifier",
    "RayXGBRegressor",
    "RayXGBRFClassifier",
    "RayXGBRFRegressor",
    "RayXGBRanker",
]


================================================
FILE: xgboost_ray/callback.py
================================================
import os
from abc import ABC
from typing import TYPE_CHECKING, Any, Dict, Sequence, Union

import pandas as pd
from ray.util.annotations import DeveloperAPI, PublicAPI

if TYPE_CHECKING:
    from xgboost_ray.main import RayXGBoostActor
    from xgboost_ray.matrix import RayDMatrix


@PublicAPI(stability="beta")
class DistributedCallback(ABC):
    """Distributed callbacks for RayXGBoostActors.

    The hooks of these callbacks are executed on the remote Ray actors
    at different points in time. They can be used to set environment
    variables or to prepare the training/prediction environment in other
    ways. Distributed callback objects are de-serialized on each actor
    and are then independent of each other - changing the state of one
    callback will not alter the state of the other copies on different actors.

    Callbacks can be passed to xgboost_ray via
    :class:`RayParams <xgboost_ray.main.RayParams>` using the
    ``distributed_callbacks`` parameter.
    """

    def on_init(self, actor: "RayXGBoostActor", *args, **kwargs):
        pass

    def before_data_loading(
        self, actor: "RayXGBoostActor", data: "RayDMatrix", *args, **kwargs
    ):
        pass

    def after_data_loading(
        self, actor: "RayXGBoostActor", data: "RayDMatrix", *args, **kwargs
    ):
        pass

    def before_train(self, actor: "RayXGBoostActor", *args, **kwargs):
        pass

    def after_train(self, actor: "RayXGBoostActor", result_dict: Dict, *args, **kwargs):
        pass

    def before_predict(self, actor: "RayXGBoostActor", *args, **kwargs):
        pass

    def after_predict(
        self,
        actor: "RayXGBoostActor",
        predictions: Union[pd.Series, pd.DataFrame],
        *args,
        **kwargs
    ):
        pass


@DeveloperAPI
class DistributedCallbackContainer:
    def __init__(self, callbacks: Sequence[DistributedCallback]):
        self.callbacks = callbacks or []

    def on_init(self, actor: "RayXGBoostActor", *args, **kwargs):
        for callback in self.callbacks:
            callback.on_init(actor, *args, **kwargs)

    def before_data_loading(
        self, actor: "RayXGBoostActor", data: "RayDMatrix", *args, **kwargs
    ):
        for callback in self.callbacks:
            callback.before_data_loading(actor, data, *args, **kwargs)

    def after_data_loading(
        self, actor: "RayXGBoostActor", data: "RayDMatrix", *args, **kwargs
    ):
        for callback in self.callbacks:
            callback.after_data_loading(actor, data, *args, **kwargs)

    def before_train(self, actor: "RayXGBoostActor", *args, **kwargs):
        for callback in self.callbacks:
            callback.before_train(actor, *args, **kwargs)

    def after_train(self, actor: "RayXGBoostActor", result_dict: Dict, *args, **kwargs):
        for callback in self.callbacks:
            callback.after_train(actor, result_dict, *args, **kwargs)

    def before_predict(self, actor: "RayXGBoostActor", *args, **kwargs):
        for callback in self.callbacks:
            callback.before_predict(actor, *args, **kwargs)

    def after_predict(
        self,
        actor: "RayXGBoostActor",
        predictions: Union[pd.Series, pd.DataFrame],
        *args,
        **kwargs
    ):
        for callback in self.callbacks:
            callback.after_predict(actor, predictions, *args, **kwargs)


class EnvironmentCallback(DistributedCallback):
    def __init__(self, env_dict: Dict[str, Any]):
        self.env_dict = env_dict

    def on_init(self, actor, *args, **kwargs):
        os.environ.update(self.env_dict)


================================================
FILE: xgboost_ray/compat/__init__.py
================================================
from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from xgboost_ray.xgb import xgboost as xgb

try:
    from xgboost.callback import TrainingCallback

    LEGACY_CALLBACK = False
except ImportError:

    class TrainingCallback:
        def __init__(self):
            if hasattr(self, "before_iteration"):
                # XGBoost < 1.0 is looking up __dict__ to see if a
                # callback should be called before or after an iteration.
                # So here we move this to self._before_iteration and
                # overwrite the dict.
                self._before_iteration = getattr(self, "before_iteration")
                self.__dict__["before_iteration"] = True

        def __call__(self, callback_env: "xgb.core.CallbackEnv"):
            if hasattr(self, "_before_iteration"):
                self._before_iteration(
                    model=callback_env.model,
                    epoch=callback_env.iteration,
                    evals_log=callback_env.evaluation_result_list,
                )

            if hasattr(self, "after_iteration"):
                self.after_iteration(
                    model=callback_env.model,
                    epoch=callback_env.iteration,
                    evals_log=callback_env.evaluation_result_list,
                )

        def before_training(self, model):
            pass

        def after_training(self, model):
            pass

    LEGACY_CALLBACK = True

try:
    from xgboost import RabitTracker
except ImportError:
    from xgboost_ray.compat.tracker import RabitTracker

__all__ = ["TrainingCallback", "RabitTracker"]


================================================
FILE: xgboost_ray/compat/tracker.py
================================================
# flake8: noqa

# Copyright 2021 by XGBoost Contributors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# File copied from:
# https://github.com/dmlc/xgboost/blob/8760ec48277b345aaaa895b82570c25566fc0503/python-package/xgboost/tracker.py

import logging

# License:
# https://github.com/dmlc/xgboost/blob/8760ec48277b345aaaa895b82570c25566fc0503/LICENSE
import socket
import struct
import time
from threading import Thread


class ExSocket(object):
    """
    Extension of socket to handle recv and send of special data
    """

    def __init__(self, sock):
        self.sock = sock

    def recvall(self, nbytes):
        res = []
        nread = 0
        while nread < nbytes:
            chunk = self.sock.recv(min(nbytes - nread, 1024))
            nread += len(chunk)
            res.append(chunk)
        return b"".join(res)

    def recvint(self):
        return struct.unpack("@i", self.recvall(4))[0]

    def sendint(self, n):
        self.sock.sendall(struct.pack("@i", n))

    def sendstr(self, s):
        self.sendint(len(s))
        self.sock.sendall(s.encode())

    def recvstr(self):
        slen = self.recvint()
        return self.recvall(slen).decode()


# magic number used to verify existence of data
kMagic = 0xFF99


def get_some_ip(host):
    return socket.getaddrinfo(host, None)[0][4][0]


def get_host_ip(hostIP=None):
    if hostIP is None or hostIP == "auto":
        hostIP = "ip"

    if hostIP == "dns":
        hostIP = socket.getfqdn()
    elif hostIP == "ip":
        from socket import gaierror

        try:
            hostIP = socket.gethostbyname(socket.getfqdn())
        except gaierror:
            logging.debug(
                "gethostbyname(socket.getfqdn()) failed... trying on hostname()"
            )
            hostIP = socket.gethostbyname(socket.gethostname())
        if hostIP.startswith("127."):
            s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
            # doesn't have to be reachable
            s.connect(("10.255.255.255", 1))
            hostIP = s.getsockname()[0]
    return hostIP


def get_family(addr):
    return socket.getaddrinfo(addr, None)[0][0]


class SlaveEntry(object):
    def __init__(self, sock, s_addr):
        slave = ExSocket(sock)
        self.sock = slave
        self.host = get_some_ip(s_addr[0])
        magic = slave.recvint()
        assert magic == kMagic, "invalid magic number=%d from %s" % (magic, self.host)
        slave.sendint(kMagic)
        self.rank = slave.recvint()
        self.world_size = slave.recvint()
        self.jobid = slave.recvstr()
        self.cmd = slave.recvstr()
        self.wait_accept = 0
        self.port = None

    def decide_rank(self, job_map):
        if self.rank >= 0:
            return self.rank
        if self.jobid != "NULL" and self.jobid in job_map:
            return job_map[self.jobid]
        return -1

    def assign_rank(self, rank, wait_conn, tree_map, parent_map, ring_map):
        self.rank = rank
        nnset = set(tree_map[rank])
        rprev, rnext = ring_map[rank]
        self.sock.sendint(rank)
        # send parent rank
        self.sock.sendint(parent_map[rank])
        # send world size
        self.sock.sendint(len(tree_map))
        self.sock.sendint(len(nnset))
        # send the rprev and next link
        for r in nnset:
            self.sock.sendint(r)
        # send prev link
        if rprev not in (-1, rank):
            nnset.add(rprev)
            self.sock.sendint(rprev)
        else:
            self.sock.sendint(-1)
        # send next link
        if rnext not in (-1, rank):
            nnset.add(rnext)
            self.sock.sendint(rnext)
        else:
            self.sock.sendint(-1)
        while True:
            ngood = self.sock.recvint()
            goodset = set([])
            for _ in range(ngood):
                goodset.add(self.sock.recvint())
            assert goodset.issubset(nnset)
            badset = nnset - goodset
            conset = []
            for r in badset:
                if r in wait_conn:
                    conset.append(r)
            self.sock.sendint(len(conset))
            self.sock.sendint(len(badset) - len(conset))
            for r in conset:
                self.sock.sendstr(wait_conn[r].host)
                self.sock.sendint(wait_conn[r].port)
                self.sock.sendint(r)
            nerr = self.sock.recvint()
            if nerr != 0:
                continue
            self.port = self.sock.recvint()
            rmset = []
            # all connection was successuly setup
            for r in conset:
                wait_conn[r].wait_accept -= 1
                if wait_conn[r].wait_accept == 0:
                    rmset.append(r)
            for r in rmset:
                wait_conn.pop(r, None)
            self.wait_accept = len(badset) - len(conset)
            return rmset


class RabitTracker(object):
    """
    tracker for rabit
    """

    def __init__(self, hostIP, nslave, port=9091, port_end=9999):
        sock = socket.socket(get_family(hostIP), socket.SOCK_STREAM)
        for _port in range(port, port_end):
            try:
                sock.bind((hostIP, _port))
                self.port = _port
                break
            except socket.error as e:
                if e.errno in [98, 48]:
                    continue
                raise
        sock.listen(256)
        self.sock = sock
        self.hostIP = hostIP
        self.thread = None
        self.start_time = None
        self.end_time = None
        self.nslave = nslave
        logging.info("start listen on %s:%d", hostIP, self.port)

    def __del__(self):
        self.sock.close()

    @staticmethod
    def get_neighbor(rank, nslave):
        rank = rank + 1
        ret = []
        if rank > 1:
            ret.append(rank // 2 - 1)
        if rank * 2 - 1 < nslave:
            ret.append(rank * 2 - 1)
        if rank * 2 < nslave:
            ret.append(rank * 2)
        return ret

    def slave_envs(self):
        """
        get enviroment variables for slaves
        can be passed in as args or envs
        """
        return {"DMLC_TRACKER_URI": self.hostIP, "DMLC_TRACKER_PORT": self.port}

    def get_tree(self, nslave):
        tree_map = {}
        parent_map = {}
        for r in range(nslave):
            tree_map[r] = self.get_neighbor(r, nslave)
            parent_map[r] = (r + 1) // 2 - 1
        return tree_map, parent_map

    def find_share_ring(self, tree_map, parent_map, r):
        """
        get a ring structure that tends to share nodes with the tree
        return a list starting from r
        """
        nset = set(tree_map[r])
        cset = nset - set([parent_map[r]])
        if not cset:
            return [r]
        rlst = [r]
        cnt = 0
        for v in cset:
            vlst = self.find_share_ring(tree_map, parent_map, v)
            cnt += 1
            if cnt == len(cset):
                vlst.reverse()
            rlst += vlst
        return rlst

    def get_ring(self, tree_map, parent_map):
        """
        get a ring connection used to recover local data
        """
        assert parent_map[0] == -1
        rlst = self.find_share_ring(tree_map, parent_map, 0)
        assert len(rlst) == len(tree_map)
        ring_map = {}
        nslave = len(tree_map)
        for r in range(nslave):
            rprev = (r + nslave - 1) % nslave
            rnext = (r + 1) % nslave
            ring_map[rlst[r]] = (rlst[rprev], rlst[rnext])
        return ring_map

    def get_link_map(self, nslave):
        """
        get the link map, this is a bit hacky, call for better algorithm
        to place similar nodes together
        """
        tree_map, parent_map = self.get_tree(nslave)
        ring_map = self.get_ring(tree_map, parent_map)
        rmap = {0: 0}
        k = 0
        for i in range(nslave - 1):
            k = ring_map[k][1]
            rmap[k] = i + 1

        ring_map_ = {}
        tree_map_ = {}
        parent_map_ = {}
        for k, v in ring_map.items():
            ring_map_[rmap[k]] = (rmap[v[0]], rmap[v[1]])
        for k, v in tree_map.items():
            tree_map_[rmap[k]] = [rmap[x] for x in v]
        for k, v in parent_map.items():
            if k != 0:
                parent_map_[rmap[k]] = rmap[v]
            else:
                parent_map_[rmap[k]] = -1
        return tree_map_, parent_map_, ring_map_

    def accept_slaves(self, nslave):
        # set of nodes that finishs the job
        shutdown = {}
        # set of nodes that is waiting for connections
        wait_conn = {}
        # maps job id to rank
        job_map = {}
        # list of workers that is pending to be assigned rank
        pending = []
        # lazy initialize tree_map
        tree_map = None

        while len(shutdown) != nslave:
            fd, s_addr = self.sock.accept()
            s = SlaveEntry(fd, s_addr)
            if s.cmd == "print":
                msg = s.sock.recvstr()
                print(msg.strip(), flush=True)
                continue
            if s.cmd == "shutdown":
                assert s.rank >= 0 and s.rank not in shutdown
                assert s.rank not in wait_conn
                shutdown[s.rank] = s
                logging.debug("Received %s signal from %d", s.cmd, s.rank)
                continue
            assert s.cmd == "start" or s.cmd == "recover"
            # lazily initialize the slaves
            if tree_map is None:
                assert s.cmd == "start"
                if s.world_size > 0:
                    nslave = s.world_size
                tree_map, parent_map, ring_map = self.get_link_map(nslave)
                # set of nodes that is pending for getting up
                todo_nodes = list(range(nslave))
            else:
                assert s.world_size == -1 or s.world_size == nslave
            if s.cmd == "recover":
                assert s.rank >= 0

            rank = s.decide_rank(job_map)
            # batch assignment of ranks
            if rank == -1:
                assert todo_nodes
                pending.append(s)
                if len(pending) == len(todo_nodes):
                    pending.sort(key=lambda x: x.host)
                    for s in pending:
                        rank = todo_nodes.pop(0)
                        if s.jobid != "NULL":
                            job_map[s.jobid] = rank
                        s.assign_rank(rank, wait_conn, tree_map, parent_map, ring_map)
                        if s.wait_accept > 0:
                            wait_conn[rank] = s
                        logging.debug(
                            "Received %s signal from %s; assign rank %d",
                            s.cmd,
                            s.host,
                            s.rank,
                        )
                if not todo_nodes:
                    logging.info("@tracker All of %d nodes getting started", nslave)
                    self.start_time = time.time()
            else:
                s.assign_rank(rank, wait_conn, tree_map, parent_map, ring_map)
                logging.debug("Received %s signal from %d", s.cmd, s.rank)
                if s.wait_accept > 0:
                    wait_conn[rank] = s
        logging.info("@tracker All nodes finishes job")
        self.end_time = time.time()
        logging.info(
            "@tracker %s secs between node start and job finish",
            str(self.end_time - self.start_time),
        )

    def start(self, nslave):
        def run():
            self.accept_slaves(nslave)

        self.thread = Thread(target=run, args=(), daemon=True)
        self.thread.start()

    def join(self):
        while self.thread.is_alive():
            self.thread.join(100)

    def alive(self):
        return self.thread.is_alive()


================================================
FILE: xgboost_ray/data_sources/__init__.py
================================================
from xgboost_ray.data_sources.csv import CSV
from xgboost_ray.data_sources.dask import Dask
from xgboost_ray.data_sources.data_source import DataSource, RayFileType
from xgboost_ray.data_sources.modin import Modin
from xgboost_ray.data_sources.numpy import Numpy
from xgboost_ray.data_sources.object_store import ObjectStore
from xgboost_ray.data_sources.pandas import Pandas
from xgboost_ray.data_sources.parquet import Parquet
from xgboost_ray.data_sources.partitioned import Partitioned
from xgboost_ray.data_sources.petastorm import Petastorm
from xgboost_ray.data_sources.ray_dataset import RayDataset

data_sources = [
    Numpy,
    Pandas,
    Partitioned,
    Modin,
    Dask,
    Petastorm,
    CSV,
    Parquet,
    ObjectStore,
    RayDataset,
]

__all__ = [
    "DataSource",
    "RayFileType",
    "Numpy",
    "Pandas",
    "Modin",
    "Dask",
    "Petastorm",
    "CSV",
    "Parquet",
    "ObjectStore",
    "RayDataset",
    "Partitioned",
]


================================================
FILE: xgboost_ray/data_sources/_distributed.py
================================================
import itertools
import math
from collections import defaultdict
from typing import Any, Dict, Sequence

import ray
from ray.actor import ActorHandle


def get_actor_rank_ips(actors: Sequence[ActorHandle]) -> Dict[int, str]:
    """Get a dict mapping from actor ranks to their IPs"""
    no_obj = ray.put(None)
    # Build a dict mapping actor ranks to their IP addresses
    actor_rank_ips: Dict[int, str] = dict(
        enumerate(
            ray.get(
                [actor.ip.remote() if actor is not None else no_obj for actor in actors]
            )
        )
    )
    return actor_rank_ips


def assign_partitions_to_actors(
    ip_to_parts: Dict[int, Any], actor_rank_ips: Dict[int, str]
) -> Dict[int, Sequence[Any]]:
    """Assign partitions from a distributed dataframe to actors.

    This function collects distributed partitions and evenly distributes
    them to actors, trying to minimize data transfer by respecting
    co-locality.

    This function currently does _not_ take partition sizes into account
    for distributing data. It assumes that all partitions have (more or less)
    the same length.

    Instead, partitions are evenly distributed. E.g. for 8 partitions and 3
    actors, each actor gets assigned 2 or 3 partitions. Which partitions are
    assigned depends on the data locality.

    The algorithm is as follows: For any number of data partitions, get the
    Ray object references to the shards and the IP addresses where they
    currently live.

    Calculate the minimum and maximum amount of partitions per actor. These
    numbers should differ by at most 1. Also calculate how many actors will
    get more partitions assigned than the other actors.

    First, each actor gets assigned up to ``max_parts_per_actor`` co-located
    partitions. Only up to ``num_actors_with_max_parts`` actors get the
    maximum number of partitions, the rest try to fill the minimum.

    The rest of the partitions (all of which cannot be assigned to a
    co-located actor) are assigned to actors until there are none left.
    """
    num_partitions = sum(len(parts) for parts in ip_to_parts.values())
    num_actors = len(actor_rank_ips)
    min_parts_per_actor = max(0, math.floor(num_partitions / num_actors))
    max_parts_per_actor = max(1, math.ceil(num_partitions / num_actors))
    num_actors_with_max_parts = num_partitions % num_actors

    # This is our result dict that maps actor objects to a list of partitions
    actor_to_partitions = defaultdict(list)

    # First we loop through the actors and assign them partitions from their
    # own IPs. Do this until each actor has `min_parts_per_actor` partitions
    partition_assigned = True
    while partition_assigned:
        partition_assigned = False

        # Loop through each actor once, assigning
        for rank, actor_ip in actor_rank_ips.items():
            num_parts_left_on_ip = len(ip_to_parts[actor_ip])
            num_actor_parts = len(actor_to_partitions[rank])

            if num_parts_left_on_ip > 0 and num_actor_parts < max_parts_per_actor:
                if num_actor_parts >= min_parts_per_actor:
                    # Only allow up to `num_actors_with_max_parts actors to
                    # have the maximum number of partitions assigned.
                    if num_actors_with_max_parts <= 0:
                        continue
                    num_actors_with_max_parts -= 1
                actor_to_partitions[rank].append(ip_to_parts[actor_ip].pop(0))
                partition_assigned = True

    # The rest of the partitions, no matter where they are located, could not
    # be assigned to co-located actors. Thus, we assign them
    # to actors who still need partitions.
    rest_parts = list(itertools.chain(*ip_to_parts.values()))
    partition_assigned = True
    while len(rest_parts) > 0 and partition_assigned:
        partition_assigned = False
        for rank in actor_rank_ips:
            num_actor_parts = len(actor_to_partitions[rank])
            if num_actor_parts < max_parts_per_actor:
                if num_actor_parts >= min_parts_per_actor:
                    if num_actors_with_max_parts <= 0:
                        continue
                    num_actors_with_max_parts -= 1
                actor_to_partitions[rank].append(rest_parts.pop(0))
                partition_assigned = True
            if len(rest_parts) <= 0:
                break

    if len(rest_parts) != 0:
        raise RuntimeError(
            "There are still partitions left to assign, but no actor "
            "has capacity for more. This is probably a bug. Please go "
            "to https://github.com/ray-project/xgboost_ray to report it."
        )

    return actor_to_partitions


================================================
FILE: xgboost_ray/data_sources/csv.py
================================================
from typing import Any, Iterable, Optional, Sequence, Union

import pandas as pd

from xgboost_ray.data_sources.data_source import DataSource, RayFileType
from xgboost_ray.data_sources.pandas import Pandas


class CSV(DataSource):
    """Read one or many CSV files."""

    supports_central_loading = True
    supports_distributed_loading = True

    @staticmethod
    def is_data_type(data: Any, filetype: Optional[RayFileType] = None) -> bool:
        return filetype == RayFileType.CSV

    @staticmethod
    def get_filetype(data: Any) -> Optional[RayFileType]:
        if data.endswith(".csv") or data.endswith("csv.gz"):
            return RayFileType.CSV
        return None

    @staticmethod
    def load_data(
        data: Union[str, Sequence[str]],
        ignore: Optional[Sequence[str]] = None,
        indices: Optional[Sequence[int]] = None,
        **kwargs
    ):
        if isinstance(data, Iterable) and not isinstance(data, str):
            shards = []

            for i, shard in enumerate(data):
                if indices and i not in indices:
                    continue
                shard_df = pd.read_csv(shard, **kwargs)
                shards.append(Pandas.load_data(shard_df, ignore=ignore))
            return pd.concat(shards, copy=False)
        else:
            local_df = pd.read_csv(data, **kwargs)
            return Pandas.load_data(local_df, ignore=ignore)

    @staticmethod
    def get_n(data: Any):
        return len(list(data))


================================================
FILE: xgboost_ray/data_sources/dask.py
================================================
from collections import defaultdict
from typing import Any, Dict, List, Optional, Sequence, Tuple, Union

import pandas as pd
import ray
import wrapt
from ray.actor import ActorHandle

from xgboost_ray.data_sources._distributed import (
    assign_partitions_to_actors,
    get_actor_rank_ips,
)
from xgboost_ray.data_sources.data_source import DataSource, RayFileType

try:
    import dask  # noqa: F401
    from ray.util.dask import ray_dask_get

    DASK_INSTALLED = True
except ImportError:
    DASK_INSTALLED = False


def _assert_dask_installed():
    if not DASK_INSTALLED:
        raise RuntimeError(
            "Tried to use Dask as a data source, but dask is not "
            "installed. This function shouldn't have been called. "
            "\nFIX THIS by installing dask: `pip install dask`. "
            "\nPlease also raise an issue on our GitHub: "
            "https://github.com/ray-project/xgboost_ray as this part of "
            "the code should not have been reached."
        )


@wrapt.decorator
def ensure_ray_dask_initialized(
    func: Any, instance: Any, args: List[Any], kwargs: Any
) -> Any:
    _assert_dask_installed()
    dask.config.set(scheduler=ray_dask_get)
    return func(*args, **kwargs)


class Dask(DataSource):
    """Read from distributed Dask dataframe.

    A `Dask dataframe <https://docs.dask.org/en/latest/dataframe.html>`_
    is a distributed drop-in replacement for pandas.

    Dask dataframes are stored on multiple actors, making them
    suitable for distributed loading.
    """

    supports_central_loading = True
    supports_distributed_loading = True

    @staticmethod
    def is_data_type(data: Any, filetype: Optional[RayFileType] = None) -> bool:
        if not DASK_INSTALLED:
            return False
        from dask.dataframe import DataFrame as DaskDataFrame
        from dask.dataframe import Series as DaskSeries

        return isinstance(data, (DaskDataFrame, DaskSeries))

    @ensure_ray_dask_initialized
    @staticmethod
    def load_data(
        data: Any,  # dask.pandas.DataFrame
        ignore: Optional[Sequence[str]] = None,
        indices: Optional[Union[Sequence[int], Sequence[int]]] = None,
        **kwargs
    ) -> pd.DataFrame:
        _assert_dask_installed()

        import dask.dataframe as dd

        if indices is not None and len(indices) > 0 and isinstance(indices[0], Tuple):
            # We got a list of partition IDs belonging to Dask partitions
            return dd.concat([data.partitions[i] for (i,) in indices]).compute()

        # Dask does not support iloc() for row selection, so we have to
        # compute a local pandas dataframe first
        local_df = data.compute()

        if indices:
            local_df = local_df.iloc[indices]

        if ignore:
            local_df = local_df[local_df.columns.difference(ignore)]

        return local_df

    @ensure_ray_dask_initialized
    @staticmethod
    def convert_to_series(data: Any) -> pd.Series:
        _assert_dask_installed()
        from dask.array import Array as DaskArray
        from dask.dataframe import DataFrame as DaskDataFrame
        from dask.dataframe import Series as DaskSeries

        if isinstance(data, DaskDataFrame):
            return pd.Series(data.compute().squeeze())
        elif isinstance(data, DaskSeries):
            return data.compute()
        elif isinstance(data, DaskArray):
            return pd.Series(data.compute())

        return DataSource.convert_to_series(data)

    @ensure_ray_dask_initialized
    @staticmethod
    def get_actor_shards(
        data: Any, actors: Sequence[ActorHandle]  # dask.dataframe.DataFrame
    ) -> Tuple[Any, Optional[Dict[int, Any]]]:
        _assert_dask_installed()

        actor_rank_ips = get_actor_rank_ips(actors)

        # Get IPs and partitions
        ip_to_parts = get_ip_to_parts(data)

        return data, assign_partitions_to_actors(ip_to_parts, actor_rank_ips)

    @ensure_ray_dask_initialized
    @staticmethod
    def get_n(data: Any):
        """
        For naive distributed loading we just return the number of rows
        here. Loading by shard is achieved via `get_actor_shards()`
        """
        return len(data)


def get_ip_to_parts(data: Any) -> Dict[int, Sequence[Any]]:
    persisted = data.persist(scheduler=ray_dask_get)
    name = persisted._name

    node_ids_to_node = {node["NodeID"]: node for node in ray.state.nodes()}

    # This is a hacky way to get the partition node IDs, and it's not
    # 100% accurate as the map task could get scheduled on a different node
    # (though Ray tries to keep locality). We need to use that until
    # ray.state.objects() or something like it is available again.
    partition_locations_df = persisted.map_partitions(
        lambda df: pd.DataFrame([ray.get_runtime_context().get_node_id()])
    ).compute()
    partition_locations = [
        partition_locations_df[0].iloc[i] for i in range(partition_locations_df.size)
    ]

    ip_to_parts = defaultdict(list)
    for (obj_name, pid), obj_ref in dask.base.collections_to_dsk([persisted]).items():
        assert obj_name == name

        if isinstance(obj_ref, ray.ObjectRef):
            node_id = partition_locations[pid]
            node = node_ids_to_node.get(node_id, {})
            ip = node.get("NodeManagerAddress", "_no_ip")
        else:
            ip = "_no_ip"

        # Pass tuples here (integers can be misinterpreted as row numbers)
        ip_to_parts[ip].append((pid,))

    return ip_to_parts


================================================
FILE: xgboost_ray/data_sources/data_source.py
================================================
from enum import Enum
from typing import TYPE_CHECKING, Any, Dict, List, Optional, Sequence, Tuple, Union

import pandas as pd
from ray.actor import ActorHandle
from ray.util.annotations import PublicAPI

if TYPE_CHECKING:
    from xgboost_ray.xgb import xgboost as xgb


@PublicAPI(stability="beta")
class RayFileType(Enum):
    """Enum for different file types (used for overrides)."""

    CSV = 1
    PARQUET = 2
    PETASTORM = 3


@PublicAPI(stability="beta")
class DataSource:
    """Abstract class for data sources.

    xgboost_ray supports reading from various sources, such as files
    (e.g. CSV, Parquet) or distributed datasets (Modin).

    This abstract class defines an interface to read from these sources.
    New data sources can be added by implementing this interface.

    ``DataSource`` classes are not instantiated. Instead, static and
    class methods are called directly.
    """

    supports_central_loading = True
    supports_distributed_loading = False
    needs_partitions = True

    @staticmethod
    def is_data_type(data: Any, filetype: Optional[RayFileType] = None) -> bool:
        """Check if the supplied data matches this data source.

        Args:
            data: Dataset.
            filetype: RayFileType of the provided
                dataset. Some DataSource implementations might require
                that this is explicitly set (e.g. if multiple sources can
                read CSV files).

        Returns:
            Boolean indicating if this data source belongs to/is compatible
                with the data.
        """
        return False

    @staticmethod
    def get_filetype(data: Any) -> Optional[RayFileType]:
        """Method to help infer the filetype.

        Returns None if the supplied data type (usually a filename)
        is not covered by this data source, otherwise the filetype
        is returned.

        Args:
            data: Data set

        Returns:
            RayFileType or None.
        """
        return None

    @staticmethod
    def load_data(
        data: Any,
        ignore: Optional[Sequence[str]] = None,
        indices: Optional[Sequence[Any]] = None,
        **kwargs
    ) -> pd.DataFrame:
        """
        Load data into a pandas dataframe.

        Ignore specific columns, and optionally select specific indices.

        Args:
            data: Input data
            ignore: Column names to ignore
            indices: Indices to select. What an
                index indicates depends on the data source.

        Returns:
            Pandas DataFrame.
        """
        raise NotImplementedError

    @staticmethod
    def update_feature_names(matrix: "xgb.DMatrix", feature_names: Optional[List[str]]):
        """Optionally update feature names before training/prediction

        Args:
            matrix: xgboost DMatrix object.
            feature_names: Feature names manually passed to the
                ``RayDMatrix`` object.

        """
        pass

    @staticmethod
    def convert_to_series(data: Any) -> pd.Series:
        """Convert data from the data source type to a pandas series"""
        if isinstance(data, pd.DataFrame):
            return pd.Series(data.squeeze())

        if not isinstance(data, pd.Series):
            return pd.Series(data)

        return data

    @classmethod
    def get_column(
        cls, data: pd.DataFrame, column: Any
    ) -> Tuple[pd.Series, Optional[Union[str, List]]]:
        """Helper method wrapping around convert to series.

        This method should usually not be overwritten.
        """
        if isinstance(column, str) or isinstance(column, List):
            return data[column], column
        elif column is not None:
            return cls.convert_to_series(column), None
        return column, None

    @staticmethod
    def get_n(data: Any):
        """Get length of data source partitions for sharding."""
        return len(data)

    @staticmethod
    def get_actor_shards(
        data: Any, actors: Sequence[ActorHandle]
    ) -> Tuple[Any, Optional[Dict[int, Any]]]:
        """Get a dict mapping actor ranks to shards.

        Args:
            data: Data to shard.

        Returns:
            Returns a tuple of which the first element indicates the new
                data object that will overwrite the existing data object
                in the RayDMatrix (e.g. when the object is not serializable).
                The second element is a dict mapping actor ranks to shards.
                These objects are usually passed to the ``load_data()`` method
                for distributed loading, so that method needs to be able to
                deal with the respective data.
        """
        return data, None


================================================
FILE: xgboost_ray/data_sources/modin.py
================================================
from collections import defaultdict
from typing import Any, Dict, Optional, Sequence, Tuple, Union

import pandas as pd
import ray
from ray import ObjectRef
from ray.actor import ActorHandle

from xgboost_ray.data_sources._distributed import (
    assign_partitions_to_actors,
    get_actor_rank_ips,
)
from xgboost_ray.data_sources.data_source import DataSource, RayFileType
from xgboost_ray.data_sources.object_store import ObjectStore

try:
    import modin  # noqa: F401
    from modin.config.envvars import Engine
    from modin.distributed.dataframe.pandas import unwrap_partitions  # noqa: F401
    from modin.pandas import DataFrame as ModinDataFrame  # noqa: F401
    from modin.pandas import Series as ModinSeries  # noqa: F401
    from packaging.version import Version

    MODIN_INSTALLED = Version(modin.__version__) >= Version("0.9.0")

    # Check if importing the Ray engine leads to errors
    Engine().get()

except (ImportError, AttributeError):
    MODIN_INSTALLED = False


def _assert_modin_installed():
    if not MODIN_INSTALLED:
        raise RuntimeError(
            "Tried to use Modin as a data source, but modin is not "
            "installed or it conflicts with the pandas version. "
            "This function shouldn't have been called. "
            "\nFIX THIS by installing modin: `pip install modin` "
            "and making sure that the installed pandas version is "
            "supported by modin."
            "\nPlease also raise an issue on our GitHub: "
            "https://github.com/ray-project/xgboost_ray as this part of "
            "the code should not have been reached."
        )


class Modin(DataSource):
    """Read from distributed Modin dataframe.

    `Modin <https://github.com/modin-project/modin>`_ is a distributed
    drop-in replacement for pandas supporting Ray as a backend.

    Modin dataframes are stored on multiple actors, making them
    suitable for distributed loading.
    """

    supports_central_loading = True
    supports_distributed_loading = True

    @staticmethod
    def is_data_type(data: Any, filetype: Optional[RayFileType] = None) -> bool:
        if not MODIN_INSTALLED:
            return False
        # Has to be imported again.
        from modin.pandas import DataFrame as ModinDataFrame  # noqa: F811
        from modin.pandas import Series as ModinSeries  # noqa: F811

        return isinstance(data, (ModinDataFrame, ModinSeries))

    @staticmethod
    def load_data(
        data: Any,  # modin.pandas.DataFrame
        ignore: Optional[Sequence[str]] = None,
        indices: Optional[Union[Sequence[int], Sequence[ObjectRef]]] = None,
        **kwargs
    ) -> pd.DataFrame:
        _assert_modin_installed()

        if (
            indices is not None
            and len(indices) > 0
            and isinstance(indices[0], ObjectRef)
        ):
            # We got a list of ObjectRefs belonging to Modin partitions
            return ObjectStore.load_data(data=indices, indices=None, ignore=ignore)

        local_df = data
        if indices:
            local_df = local_df.iloc[indices]

        local_df = local_df._to_pandas()

        if ignore:
            local_df = local_df[local_df.columns.difference(ignore)]

        return local_df

    @staticmethod
    def convert_to_series(data: Any) -> pd.Series:
        _assert_modin_installed()
        # Has to be imported again.
        from modin.pandas import DataFrame as ModinDataFrame  # noqa: F811
        from modin.pandas import Series as ModinSeries  # noqa: F811

        if isinstance(data, ModinDataFrame):
            return pd.Series(data._to_pandas().squeeze())
        elif isinstance(data, ModinSeries):
            return data._to_pandas()

        return DataSource.convert_to_series(data)

    @staticmethod
    def get_actor_shards(
        data: Any, actors: Sequence[ActorHandle]  # modin.pandas.DataFrame
    ) -> Tuple[Any, Optional[Dict[int, Any]]]:
        _assert_modin_installed()

        # Has to be imported again.
        from modin.distributed.dataframe.pandas import unwrap_partitions  # noqa: F811

        actor_rank_ips = get_actor_rank_ips(actors)

        # Get IPs and partitions
        unwrapped = unwrap_partitions(data, axis=0, get_ip=True)
        ip_objs, part_objs = zip(*unwrapped)

        # Build a table mapping from IP to list of partitions
        ip_to_parts = defaultdict(list)
        for ip, part_obj in zip(ray.get(list(ip_objs)), part_objs):
            ip_to_parts[ip].append(part_obj)

        # Modin dataframes are not serializable, so pass None here
        # as the first return value
        return None, assign_partitions_to_actors(ip_to_parts, actor_rank_ips)

    @staticmethod
    def get_n(data: Any):
        """
        For naive distributed loading we just return the number of rows
        here. Loading by shard is achieved via `get_actor_shards()`
        """
        return len(data)


================================================
FILE: xgboost_ray/data_sources/numpy.py
================================================
from typing import TYPE_CHECKING, Any, List, Optional, Sequence

import numpy as np
import pandas as pd

from xgboost_ray.data_sources.data_source import DataSource, RayFileType
from xgboost_ray.data_sources.pandas import Pandas

if TYPE_CHECKING:
    from xgboost_ray.xgb import xgboost as xgb


class Numpy(DataSource):
    """Read from numpy arrays."""

    @staticmethod
    def is_data_type(data: Any, filetype: Optional[RayFileType] = None) -> bool:
        return isinstance(data, np.ndarray)

    @staticmethod
    def update_feature_names(matrix: "xgb.DMatrix", feature_names: Optional[List[str]]):
        # Potentially unset feature names
        matrix.feature_names = feature_names

    @staticmethod
    def load_data(
        data: np.ndarray,
        ignore: Optional[Sequence[str]] = None,
        indices: Optional[Sequence[int]] = None,
        **kwargs,
    ) -> pd.DataFrame:
        local_df = pd.DataFrame(data, columns=[f"f{i}" for i in range(data.shape[1])])
        return Pandas.load_data(local_df, ignore=ignore, indices=indices)


================================================
FILE: xgboost_ray/data_sources/object_store.py
================================================
from typing import Any, Optional, Sequence

import pandas as pd
import ray
from ray import ObjectRef

from xgboost_ray.data_sources.data_source import DataSource, RayFileType
from xgboost_ray.data_sources.pandas import Pandas


class ObjectStore(DataSource):
    """Read pandas dataframes and series from ray object store."""

    @staticmethod
    def is_data_type(data: Any, filetype: Optional[RayFileType] = None) -> bool:
        if isinstance(data, Sequence):
            return all(isinstance(d, ObjectRef) for d in data)
        return isinstance(data, ObjectRef)

    @staticmethod
    def load_data(
        data: Sequence[ObjectRef],
        ignore: Optional[Sequence[str]] = None,
        indices: Optional[Sequence[int]] = None,
        **kwargs
    ) -> pd.DataFrame:
        if indices is not None:
            data = [data[i] for i in indices]

        local_df = ray.get(data)

        return Pandas.load_data(pd.concat(local_df, copy=False), ignore=ignore)

    @staticmethod
    def convert_to_series(data: Any) -> pd.Series:
        if isinstance(data, ObjectRef):
            data = ray.get(data)
        else:
            data = pd.concat(ray.get(data), copy=False)
        return DataSource.convert_to_series(data)


================================================
FILE: xgboost_ray/data_sources/pandas.py
================================================
from typing import Any, Optional, Sequence

import pandas as pd

from xgboost_ray.data_sources.data_source import DataSource, RayFileType


class Pandas(DataSource):
    """Read from pandas dataframes and series."""

    @staticmethod
    def is_data_type(data: Any, filetype: Optional[RayFileType] = None) -> bool:
        return isinstance(data, (pd.DataFrame, pd.Series))

    @staticmethod
    def load_data(
        data: Any,
        ignore: Optional[Sequence[str]] = None,
        indices: Optional[Sequence[int]] = None,
        **kwargs
    ) -> pd.DataFrame:
        local_df = data

        if ignore:
            local_df = local_df[local_df.columns.difference(ignore)]

        if indices:
            return local_df.iloc[indices]

        return local_df


================================================
FILE: xgboost_ray/data_sources/parquet.py
================================================
from typing import Any, Iterable, Optional, Sequence, Union

import pandas as pd

from xgboost_ray.data_sources.data_source import DataSource, RayFileType
from xgboost_ray.data_sources.pandas import Pandas


class Parquet(DataSource):
    """Read one or many Parquet files."""

    supports_central_loading = True
    supports_distributed_loading = True

    @staticmethod
    def is_data_type(data: Any, filetype: Optional[RayFileType] = None) -> bool:
        return filetype == RayFileType.PARQUET

    @staticmethod
    def get_filetype(data: Any) -> Optional[RayFileType]:
        if data.endswith(".parquet"):
            return RayFileType.PARQUET
        return None

    @staticmethod
    def load_data(
        data: Union[str, Sequence[str]],
        ignore: Optional[Sequence[str]] = None,
        indices: Optional[Sequence[int]] = None,
        **kwargs
    ) -> pd.DataFrame:
        if isinstance(data, Iterable) and not isinstance(data, str):
            shards = []

            for i, shard in enumerate(data):
                if indices and i not in indices:
                    continue

                shard_df = pd.read_parquet(shard, **kwargs)
                shards.append(Pandas.load_data(shard_df, ignore=ignore))
            return pd.concat(shards, copy=False)
        else:
            local_df = pd.read_parquet(data, **kwargs)
            return Pandas.load_data(local_df, ignore=ignore)

    @staticmethod
    def get_n(data: Any):
        return len(list(data))


================================================
FILE: xgboost_ray/data_sources/partitioned.py
================================================
from collections import defaultdict
from typing import Any, Dict, Optional, Sequence, Tuple

import numpy as np
import pandas as pd
from ray import ObjectRef
from ray.actor import ActorHandle

from xgboost_ray.data_sources._distributed import (
    assign_partitions_to_actors,
    get_actor_rank_ips,
)
from xgboost_ray.data_sources.data_source import DataSource, RayFileType
from xgboost_ray.data_sources.numpy import Numpy
from xgboost_ray.data_sources.pandas import Pandas


class Partitioned(DataSource):
    """Read from distributed data structure implementing __partitioned__.

    __partitioned__ provides meta data about how the data is partitioned and
    distributed across several compute nodes, making supporting objects them
    suitable for distributed loading.

    Also see the __partitioned__ spec:
    https://github.com/IntelPython/DPPY-Spec/blob/draft/partitioned/Partitioned.md
    """

    supports_central_loading = True
    supports_distributed_loading = True

    @staticmethod
    def is_data_type(data: Any, filetype: Optional[RayFileType] = None) -> bool:
        return hasattr(data, "__partitioned__")

    @staticmethod
    def load_data(
        data: Any,  # __partitioned__ dict
        ignore: Optional[Sequence[str]] = None,
        indices: Optional[Sequence[ObjectRef]] = None,
        **kwargs
    ) -> pd.DataFrame:

        assert isinstance(data, dict), "Expected __partitioned__ dict"
        _get = data["get"]

        if indices is None or len(indices) == 0:
            tiling = data["partition_tiling"]
            ndims = len(tiling)
            # we need tuples to access partitions in the right order
            pos_suffix = (0,) * (ndims - 1)
            parts = data["partitions"]
            # get the full data, e.g. all shards/partitions
            local_df = [
                _get(parts[(i,) + pos_suffix]["data"]) for i in range(tiling[0])
            ]
        else:
            # here we got a list of futures for partitions
            local_df = _get(indices)

        if isinstance(local_df[0], pd.DataFrame):
            return Pandas.load_data(pd.concat(local_df, copy=False), ignore=ignore)
        else:
            return Numpy.load_data(np.concatenate(local_df), ignore=ignore)

    @staticmethod
    def get_actor_shards(
        data: Any, actors: Sequence[ActorHandle]  # partitioned.pandas.DataFrame
    ) -> Tuple[Any, Optional[Dict[int, Any]]]:
        assert hasattr(data, "__partitioned__")

        actor_rank_ips = get_actor_rank_ips(actors)

        # Get accessor func and partitions
        parted = data.__partitioned__
        parts = parted["partitions"]
        tiling = parted["partition_tiling"]
        ndims = len(tiling)
        if ndims < 1 or ndims > 2 or any(tiling[x] != 1 for x in range(1, ndims)):
            raise RuntimeError(
                "Only row-wise partitionings of 1d/2d structures supported."
            )

        # Now build a table mapping from IP to list of partitions
        ip_to_parts = defaultdict(lambda: [])
        # we need tuples to access partitions in the right order
        pos_suffix = (0,) * (ndims - 1)
        for i in range(tiling[0]):
            part = parts[(i,) + pos_suffix]  # this works for 1d and 2d
            ip_to_parts[part["location"][0]].append(part["data"])
        # __partitioned__ is serializable, so pass it here
        # as the first return value
        ret = parted, assign_partitions_to_actors(ip_to_parts, actor_rank_ips)
        return ret

    @staticmethod
    def get_n(data: Any):
        """Get length of data source partitions for sharding."""
        return data.__partitioned__["shape"][0]


================================================
FILE: xgboost_ray/data_sources/petastorm.py
================================================
from typing import Any, List, Optional, Sequence, Union

import pandas as pd

from xgboost_ray.data_sources.data_source import DataSource, RayFileType

try:
    import petastorm

    PETASTORM_INSTALLED = True
except ImportError:
    PETASTORM_INSTALLED = False


def _assert_petastorm_installed():
    if not PETASTORM_INSTALLED:
        raise RuntimeError(
            "Tried to use Petastorm as a data source, but petastorm is not "
            "installed. This function shouldn't have been called. "
            "\nFIX THIS by installing petastorm: `pip install petastorm`. "
            "\nPlease also raise an issue on our GitHub: "
            "https://github.com/ray-project/xgboost_ray as this part of "
            "the code should not have been reached."
        )


class Petastorm(DataSource):
    """Read with Petastorm.

    `Petastorm <https://github.com/uber/petastorm>`_ is a machine learning
    training and evaluation library.

    This class accesses Petastorm's dataset loading interface for efficient
    loading of large datasets.
    """

    supports_central_loading = True
    supports_distributed_loading = True

    @staticmethod
    def is_data_type(data: Any, filetype: Optional[RayFileType] = None) -> bool:
        return PETASTORM_INSTALLED and filetype == RayFileType.PETASTORM

    @staticmethod
    def get_filetype(data: Any) -> Optional[RayFileType]:
        if not PETASTORM_INSTALLED:
            return None

        if not isinstance(data, List):
            data = [data]

        def _is_compatible(url: str):
            return url.endswith(".parquet") and (
                url.startswith("s3://")
                or url.startswith("gs://")
                or url.startswith("hdfs://")
                or url.startswith("file://")
            )

        if all(_is_compatible(url) for url in data):
            return RayFileType.PETASTORM

        return None

    @staticmethod
    def load_data(
        data: Union[str, Sequence[str]],
        ignore: Optional[Sequence[str]] = None,
        indices: Optional[Sequence[int]] = None,
        **kwargs
    ) -> pd.DataFrame:
        _assert_petastorm_installed()
        with petastorm.make_batch_reader(data) as reader:
            shards = [
                pd.DataFrame(batch._asdict())
                for i, batch in enumerate(reader)
                if not indices or i in indices
            ]

        local_df = pd.concat(shards, copy=False)

        if ignore:
            local_df = local_df[local_df.columns.difference(ignore)]

        return local_df

    @staticmethod
    def get_n(data: Any):
        return len(list(data))


================================================
FILE: xgboost_ray/data_sources/ray_dataset.py
================================================
from typing import Any, Dict, Optional, Sequence, Tuple, Union

import pandas as pd
import ray
from ray.actor import ActorHandle

from xgboost_ray.data_sources.data_source import DataSource, RayFileType
from xgboost_ray.data_sources.pandas import Pandas

try:
    import ray.data.dataset  # noqa: F401

    RAY_DATASET_AVAILABLE = True
except (ImportError, AttributeError):
    RAY_DATASET_AVAILABLE = False

DATASET_TO_PANDAS_LIMIT = float("inf")


def _assert_ray_data_available():
    if not RAY_DATASET_AVAILABLE:
        raise RuntimeError(
            "Tried to use Ray datasets as a data source, but your version "
            "of Ray does not support it. "
            "\nFIX THIS by upgrading Ray: `pip install -U ray`. "
            "\nPlease also raise an issue on our GitHub: "
            "https://github.com/ray-project/xgboost_ray as this part of "
            "the code should not have been reached."
        )


class RayDataset(DataSource):
    """Read from distributed Ray dataset."""

    supports_central_loading = True
    supports_distributed_loading = True
    needs_partitions = False

    @staticmethod
    def is_data_type(data: Any, filetype: Optional[RayFileType] = None) -> bool:
        if not RAY_DATASET_AVAILABLE:
            return False

        return isinstance(data, ray.data.dataset.Dataset)

    @staticmethod
    def load_data(
        data: "ray.data.dataset.Dataset",
        ignore: Optional[Sequence[str]] = None,
        indices: Optional[
            Union[Sequence[int], Sequence["ray.data.dataset.Dataset"]]
        ] = None,
        **kwargs
    ) -> pd.DataFrame:
        _assert_ray_data_available()

        if indices is not None:
            if len(indices) > 0 and isinstance(indices[0], ray.data.dataset.Dataset):
                # We got a list of Datasets belonging a partition
                data = indices
            else:
                data = [data[i] for i in indices]

        if isinstance(data, ray.data.dataset.Dataset):
            local_df = data.to_pandas(limit=DATASET_TO_PANDAS_LIMIT)
        else:
            local_df = pd.concat(
                [ds.to_pandas(limit=DATASET_TO_PANDAS_LIMIT) for ds in data], copy=False
            )
        return Pandas.load_data(local_df, ignore=ignore)

    @staticmethod
    def convert_to_series(
        data: Union["ray.data.dataset.Dataset", Sequence["ray.data.dataset.Dataset"]]
    ) -> pd.Series:
        _assert_ray_data_available()

        if isinstance(data, ray.data.dataset.Dataset):
            data = data.to_pandas(limit=DATASET_TO_PANDAS_LIMIT)
        else:
            data = pd.concat(
                [ds.to_pandas(limit=DATASET_TO_PANDAS_LIMIT) for ds in data], copy=False
            )
        return DataSource.convert_to_series(data)

    @staticmethod
    def get_actor_shards(
        data: "ray.data.dataset.Dataset", actors: Sequence[ActorHandle]
    ) -> Tuple[Any, Optional[Dict[int, Any]]]:
        _assert_ray_data_available()

        # We do not use our assign_partitions_to_actors as assignment of splits
        # to actors is handled by locality_hints argument.

        dataset_splits = data.split(
            len(actors),
            equal=True,
            locality_hints=actors,
        )

        return None, {
            i: [dataset_split] for i, dataset_split in enumerate(dataset_splits)
        }

    @staticmethod
    def get_n(data: "ray.data.dataset.Dataset"):
        """
        Return number of distributed blocks.
        """
        return data._plan.initial_num_blocks()


================================================
FILE: xgboost_ray/elastic.py
================================================
import time
from typing import Callable, Dict, List, Optional, Tuple

import ray

from xgboost_ray.main import (
    ENV,
    ActorHandle,
    RayParams,
    RayXGBoostActorAvailable,
    _create_actor,
    _PrepareActorTask,
    _TrainingState,
    logger,
)
from xgboost_ray.matrix import RayDMatrix


def _maybe_schedule_new_actors(
    training_state: _TrainingState,
    num_cpus_per_actor: int,
    num_gpus_per_actor: int,
    resources_per_actor: Optional[Dict],
    ray_params: RayParams,
    load_data: List[RayDMatrix],
) -> bool:
    """Schedule new actors for elastic training if resources are available.

    Potentially starts new actors and triggers data loading."""

    # This is only enabled for elastic training.
    if not ray_params.elastic_training:
        return False

    missing_actor_ranks = [
        rank
        for rank, actor in enumerate(training_state.actors)
        if actor is None and rank not in training_state.pending_actors
    ]

    # If all actors are alive, there is nothing to do.
    if not missing_actor_ranks:
        return False

    now = time.time()

    # Check periodically every n seconds.
    if (
        now
        < training_state.last_resource_check_at + ENV.ELASTIC_RESTART_RESOURCE_CHECK_S
    ):
        return False

    training_state.last_resource_check_at = now

    new_pending_actors: Dict[int, Tuple[ActorHandle, _PrepareActorTask]] = {}
    for rank in missing_actor_ranks:
        # Actor rank should not be already pending
        if rank in training_state.pending_actors or rank in new_pending_actors:
            continue

        # Try to schedule this actor
        actor = _create_actor(
            rank=rank,
            num_actors=ray_params.num_actors,
            num_cpus_per_actor=num_cpus_per_actor,
            num_gpus_per_actor=num_gpus_per_actor,
            resources_per_actor=resources_per_actor,
            placement_group=training_state.placement_group,
            queue=training_state.queue,
            checkpoint_frequency=ray_params.checkpoint_frequency,
            distributed_callbacks=ray_params.distributed_callbacks,
        )

        task = _PrepareActorTask(
            actor,
            queue=training_state.queue,
            stop_event=training_state.stop_event,
            load_data=load_data,
        )

        new_pending_actors[rank] = (actor, task)
        logger.debug(
            f"Re-scheduled actor with rank {rank}. Waiting for "
            f"placement and data loading before promoting it "
            f"to training."
        )
    if new_pending_actors:
        training_state.pending_actors.update(new_pending_actors)
        logger.info(
            f"Re-scheduled {len(new_pending_actors)} actors for "
            f"training. Once data loading finished, they will be "
            f"integrated into training again."
        )
    return bool(new_pending_actors)


def _update_scheduled_actor_states(training_state: _TrainingState):
    """Update status of scheduled actors in elastic training.

    If actors finished their preparation tasks, promote them to
    proper training actors (set the `training_state.actors` entry).

    Also schedule a `RayXGBoostActorAvailable` exception so that training
    is restarted with the new actors.

    """
    now = time.time()
    actor_became_ready = False

    # Wrap in list so we can alter the `training_state.pending_actors` dict
    for rank in list(training_state.pending_actors.keys()):
        actor, task = training_state.pending_actors[rank]
        if task.is_ready():
            # Promote to proper actor
            training_state.actors[rank] = actor
            del training_state.pending_actors[rank]
            actor_became_ready = True

    if actor_became_ready:
        if not training_state.pending_actors:
            # No other actors are pending, so let's restart right away.
            training_state.restart_training_at = now - 1.0

        # If an actor became ready but other actors are pending, we wait
        # for n seconds before restarting, as chances are that they become
        # ready as well (e.g. if a large node came up).
        grace_period = ENV.ELASTIC_RESTART_GRACE_PERIOD_S
        if training_state.restart_training_at is None:
            logger.debug(
                f"A RayXGBoostActor became ready for training. Waiting "
                f"{grace_period} seconds before triggering training restart."
            )
            training_state.restart_training_at = now + grace_period

    if training_state.restart_training_at is not None:
        if now > training_state.restart_training_at:
            training_state.restart_training_at = None
            raise RayXGBoostActorAvailable(
                "A new RayXGBoostActor became available for training. "
                "Triggering restart."
            )


def _get_actor_alive_status(
    actors: List[ActorHandle], callback: Callable[[ActorHandle], None]
):
    """Loop through all actors. Invoke a callback on dead actors."""
    obj_to_rank = {}

    alive = 0
    dead = 0

    for rank, actor in enumerate(actors):
        if actor is None:
            dead += 1
            continue
        obj = actor.pid.remote()
        obj_to_rank[obj] = rank

    not_ready = list(obj_to_rank.keys())
    while not_ready:
        ready, not_ready = ray.wait(not_ready, timeout=0)

        for obj in ready:
            try:
                pid = ray.get(obj)
                rank = obj_to_rank[obj]
                logger.debug(f"Actor {actors[rank]} with PID {pid} is alive.")
                alive += 1
            except Exception:
                rank = obj_to_rank[obj]
                logger.debug(f"Actor {actors[rank]} is _not_ alive.")
                dead += 1
                callback(actors[rank])
    logger.info(f"Actor status: {alive} alive, {dead} dead " f"({alive+dead} total)")

    return alive, dead


================================================
FILE: xgboost_ray/examples/__init__.py
================================================


================================================
FILE: xgboost_ray/examples/create_test_data.py
================================================
from xgboost_ray.tests.utils import create_parquet


def main():
    create_parquet(
        "example.parquet",
        num_rows=1_000_000,
        num_partitions=100,
        num_features=8,
        num_classes=2,
    )


if __name__ == "__main__":
    main()


================================================
FILE: xgboost_ray/examples/higgs.py
================================================
import os
import time

from xgboost_ray import RayDMatrix, RayParams, train

FILENAME_CSV = "HIGGS.csv.gz"


def download_higgs(target_file):
    url = (
        "https://archive.ics.uci.edu/ml/machine-learning-databases/"
        "00280/HIGGS.csv.gz"
    )

    try:
        import urllib.request
    except ImportError as e:
        raise ValueError(
            f"Automatic downloading of the HIGGS dataset requires `urllib`."
            f"\nFIX THIS by running `pip install urllib` or manually "
            f"downloading the dataset from {url}."
        ) from e

    print(f"Downloading HIGGS dataset to {target_file}")
    urllib.request.urlretrieve(url, target_file)
    return os.path.exists(target_file)


def main():
    # Example adapted from this blog post:
    # https://medium.com/rapids-ai/a-new-official-dask-api-for-xgboost-e8b10f3d1eb7
    # This uses the HIGGS dataset. Download here:
    # https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz

    if not os.path.exists(FILENAME_CSV):
        assert download_higgs(FILENAME_CSV), "Downloading of HIGGS dataset failed."
        print("HIGGS dataset downloaded.")
    else:
        print("HIGGS dataset found locally.")

    colnames = ["label"] + ["feature-%02d" % i for i in range(1, 29)]

    dtrain = RayDMatrix(os.path.abspath(FILENAME_CSV), label="label", names=colnames)

    config = {
        "tree_method": "hist",
        "eval_metric": ["logloss", "error"],
    }

    evals_result = {}

    start = time.time()
    bst = train(
        config,
        dtrain,
        evals_result=evals_result,
        ray_params=RayParams(max_actor_restarts=1, num_actors=1),
        num_boost_round=100,
        evals=[(dtrain, "train")],
    )
    taken = time.time() - start
    print(f"TRAIN TIME TAKEN: {taken:.2f} seconds")

    bst.save_model("higgs.xgb")
    print("Final training error: {:.4f}".format(evals_result["train"]["error"][-1]))


if __name__ == "__main__":
    import ray

    ray.init()

    start = time.time()
    main()
    taken = time.time() - start
    print(f"TOTAL TIME TAKEN: {taken:.2f} seconds")


================================================
FILE: xgboost_ray/examples/higgs_parquet.py
================================================
import os
import time

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from higgs import download_higgs

from xgboost_ray import RayDMatrix, RayParams, train

FILENAME_CSV = "HIGGS.csv.gz"
FILENAME_PARQUET = "HIGGS.parquet"


def csv_to_parquet(in_file, out_file, chunksize=100_000, **csv_kwargs):
    if os.path.exists(out_file):
        return False

    print(f"Converting CSV {in_file} to PARQUET {out_file}")
    csv_stream = pd.read_csv(
        in_file, sep=",", chunksize=chunksize, low_memory=False, **csv_kwargs
    )

    parquet_schema = None
    parquet_writer = None
    for i, chunk in enumerate(csv_stream):
        print("Chunk", i)
        if not parquet_schema:
            # Guess the schema of the CSV file from the first chunk
            parquet_schema = pa.Table.from_pandas(df=chunk).schema
            # Open a Parquet file for writing
            parquet_writer = pq.ParquetWriter(
                out_file, parquet_schema, compression="snappy"
            )
        # Write CSV chunk to the parquet file
        table = pa.Table.from_pandas(chunk, schema=parquet_schema)
        parquet_writer.write_table(table)

    parquet_writer.close()
    return True


def main():
    # Example adapted from this blog post:
    # https://medium.com/rapids-ai/a-new-official-dask-api-for-xgboost-e8b10f3d1eb7
    # This uses the HIGGS dataset. Download here:
    # https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz

    if not os.path.exists(FILENAME_PARQUET):
        if not os.path.exists(FILENAME_CSV):
            download_higgs(FILENAME_CSV)
            print("Downloaded HIGGS csv dataset")
        print("Converting HIGGS csv dataset to parquet")
        csv_to_parquet(
            FILENAME_CSV,
            FILENAME_PARQUET,
            names=[
                "label",
                "feature-01",
                "feature-02",
                "feature-03",
                "feature-04",
                "feature-05",
                "feature-06",
                "feature-07",
                "feature-08",
                "feature-09",
                "feature-10",
                "feature-11",
                "feature-12",
                "feature-13",
                "feature-14",
                "feature-15",
                "feature-16",
                "feature-17",
                "feature-18",
                "feature-19",
                "feature-20",
                "feature-21",
                "feature-22",
                "feature-23",
                "feature-24",
                "feature-25",
                "feature-26",
                "feature-27",
                "feature-28",
            ],
        )

    colnames = ["label"] + ["feature-%02d" % i for i in range(1, 29)]

    # Here we load the Parquet file
    dtrain = RayDMatrix(
        os.path.abspath(FILENAME_PARQUET), label="label", columns=colnames
    )

    config = {
        "tree_method": "hist",
        "eval_metric": ["logloss", "error"],
    }

    evals_result = {}

    start = time.time()
    bst = train(
        config,
        dtrain,
        evals_result=evals_result,
        ray_params=RayParams(max_actor_restarts=1, num_actors=1),
        num_boost_round=100,
        evals=[(dtrain, "train")],
    )
    taken = time.time() - start
    print(f"TRAIN TIME TAKEN: {taken:.2f} seconds")

    bst.save_model("higgs.xgb")
    print("Final training error: {:.4f}".format(evals_result["train"]["error"][-1]))


if __name__ == "__main__":
    import ray

    ray.init()

    start = time.time()
    main()
    taken = time.time() - start
    print(f"TOTAL TIME TAKEN: {taken:.2f} seconds")


================================================
FILE: xgboost_ray/examples/readme.py
================================================
# flake8: noqa E501


def readme_simple():
    from sklearn.datasets import load_breast_cancer

    from xgboost_ray import RayDMatrix, RayParams, train

    train_x, train_y = load_breast_cancer(return_X_y=True)
    train_set = RayDMatrix(train_x, train_y)

    evals_result = {}
    bst = train(
        {
            "objective": "binary:logistic",
            "eval_metric": ["logloss", "error"],
        },
        train_set,
        evals_result=evals_result,
        evals=[(train_set, "train")],
        verbose_eval=False,
        ray_params=RayParams(num_actors=2, cpus_per_actor=1),
    )

    bst.save_model("model.xgb")
    print("Final training error: {:.4f}".format(evals_result["train"]["error"][-1]))


def readme_predict():
    import xgboost as xgb
    from sklearn.datasets import load_breast_cancer

    from xgboost_ray import RayDMatrix, RayParams, predict

    data, labels = load_breast_cancer(return_X_y=True)

    dpred = RayDMatrix(data, labels)

    bst = xgb.Booster(model_file="model.xgb")
    pred_ray = predict(bst, dpred, ray_params=RayParams(num_actors=2))

    print(pred_ray)


def readme_tune():
    from sklearn.datasets import load_breast_cancer

    from xgboost_ray import RayDMatrix, RayParams, train

    num_actors = 4
    num_cpus_per_actor = 1

    ray_params = RayParams(num_actors=num_actors, cpus_per_actor=num_cpus_per_actor)

    def train_model(config):
        train_x, train_y = load_breast_cancer(return_X_y=True)
        train_set = RayDMatrix(train_x, train_y)

        evals_result = {}
        bst = train(
            params=config,
            dtrain=train_set,
            evals_result=evals_result,
            evals=[(train_set, "train")],
            verbose_eval=False,
            ray_params=ray_params,
        )
        bst.save_model("model.xgb")

    from ray import tune

    # Specify the hyperparameter search space.
    config = {
        "tree_method": "approx",
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
        "eta": tune.loguniform(1e-4, 1e-1),
        "subsample": tune.uniform(0.5, 1.0),
        "max_depth": tune.randint(1, 9),
    }

    # Make sure to use the `get_tune_resources` method to set the `resources_per_trial`
    analysis = tune.run(
        train_model,
        config=config,
        metric="train-error",
        mode="min",
        num_samples=4,
        resources_per_trial=ray_params.get_tune_resources(),
    )
    print("Best hyperparameters", analysis.best_config)


if __name__ == "__main__":
    import ray

    ray.init(num_cpus=5)

    print("Readme: Simple example")
    readme_simple()
    readme_predict()
    try:
        print("Readme: Ray Tune example")
        readme_tune()
    except ImportError:
        print("Ray Tune not installed.")


================================================
FILE: xgboost_ray/examples/readme_sklearn_api.py
================================================
def readme_sklearn_api():
    from sklearn.datasets import load_breast_cancer
    from sklearn.model_selection import train_test_split

    from xgboost_ray import RayParams, RayXGBClassifier

    seed = 42

    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, train_size=0.25, random_state=42
    )

    clf = RayXGBClassifier(
        n_jobs=4, random_state=seed  # In XGBoost-Ray, n_jobs sets the number of actors
    )

    # scikit-learn API will automatically conver the data
    # to RayDMatrix format as needed.
    # You can also pass X as a RayDMatrix, in which case
    # y will be ignored.

    clf.fit(X_train, y_train)

    pred_ray = clf.predict(X_test)
    print(pred_ray)

    pred_proba_ray = clf.predict_proba(X_test)
    print(pred_proba_ray)

    # It is also possible to pass a RayParams object
    # to fit/predict/predict_proba methods - will override
    # n_jobs set during initialization

    clf.fit(X_train, y_train, ray_params=RayParams(num_actors=2))

    pred_ray = clf.predict(X_test, ray_params=RayParams(num_actors=2))
    print(pred_ray)


if __name__ == "__main__":
    import ray

    ray.init(num_cpus=5)

    print("Readme: scikit-learn API example")
    readme_sklearn_api()


================================================
FILE: xgboost_ray/examples/simple.py
================================================
import argparse

import ray
from sklearn import datasets
from sklearn.model_selection import train_test_split

from xgboost_ray import RayDMatrix, RayParams, train


def main(cpus_per_actor, num_actors):
    # Load dataset
    data, labels = datasets.load_breast_cancer(return_X_y=True)
    # Split into train and test set
    train_x, test_x, train_y, test_y = train_test_split(data, labels, test_size=0.25)

    train_set = RayDMatrix(train_x, train_y)
    test_set = RayDMatrix(test_x, test_y)

    evals_result = {}

    # Set XGBoost config.
    xgboost_params = {
        "tree_method": "approx",
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    }

    # Train the classifier
    bst = train(
        params=xgboost_params,
        dtrain=train_set,
        evals=[(test_set, "eval")],
        evals_result=evals_result,
        ray_params=RayParams(
            max_actor_restarts=0,
            gpus_per_actor=0,
            cpus_per_actor=cpus_per_actor,
            num_actors=num_actors,
        ),
        verbose_eval=False,
        num_boost_round=10,
    )

    model_path = "simple.xgb"
    bst.save_model(model_path)
    print("Final validation error: {:.4f}".format(evals_result["eval"]["error"][-1]))


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--address", required=False, type=str, help="the address to use for Ray"
    )
    parser.add_argument(
        "--server-address",
        required=False,
        type=str,
        help="Address of the remote server if using Ray Client.",
    )
    parser.add_argument(
        "--cpus-per-actor",
        type=int,
        default=1,
        help="Sets number of CPUs per xgboost training worker.",
    )
    parser.add_argument(
        "--num-actors",
        type=int,
        default=4,
        help="Sets number of xgboost workers to use.",
    )
    parser.add_argument("--smoke-test", action="store_true", default=False, help="gpu")

    args, _ = parser.parse_known_args()

    if args.smoke_test:
        ray.init(num_cpus=args.num_actors)
    elif args.server_address:
        ray.util.connect(args.server_address)
    else:
        ray.init(address=args.address)

    main(args.cpus_per_actor, args.num_actors)


================================================
FILE: xgboost_ray/examples/simple_dask.py
================================================
import argparse

import numpy as np
import pandas as pd
import ray

from xgboost_ray import RayDMatrix, RayParams, train
from xgboost_ray.data_sources.dask import DASK_INSTALLED


def main(cpus_per_actor, num_actors):
    if not DASK_INSTALLED:
        print("Dask is not installed. Install with `pip install dask`")
        return

    # Local import so the installation check comes first
    import dask
    import dask.dataframe as dd
    from ray.util.dask import ray_dask_get

    dask.config.set(scheduler=ray_dask_get)

    # Generate dataset
    x = np.repeat(range(8), 16).reshape((32, 4))
    # Even numbers --> 0, odd numbers --> 1
    y = np.tile(np.repeat(range(2), 4), 4)

    # Flip some bits to reduce max accuracy
    bits_to_flip = np.random.choice(32, size=6, replace=False)
    y[bits_to_flip] = 1 - y[bits_to_flip]

    data = pd.DataFrame(x)
    data["label"] = y

    # Split into 4 partitions
    dask_df = dd.from_pandas(data, npartitions=4)

    train_set = RayDMatrix(dask_df, "label")

    evals_result = {}
    # Set XGBoost config.
    xgboost_params = {
        "tree_method": "approx",
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    }

    # Train the classifier
    bst = train(
        params=xgboost_params,
        dtrain=train_set,
        evals=[(train_set, "train")],
        evals_result=evals_result,
        ray_params=RayParams(
            max_actor_restarts=0,
            gpus_per_actor=0,
            cpus_per_actor=cpus_per_actor,
            num_actors=num_actors,
        ),
        verbose_eval=False,
        num_boost_round=10,
    )

    model_path = "dask.xgb"
    bst.save_model(model_path)
    print("Final training error: {:.4f}".format(evals_result["train"]["error"][-1]))


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--address", required=False, type=str, help="the address to use for Ray"
    )
    parser.add_argument(
        "--server-address",
        required=False,
        type=str,
        help="Address of the remote server if using Ray Client.",
    )
    parser.add_argument(
        "--cpus-per-actor",
        type=int,
        default=1,
        help="Sets number of CPUs per xgboost training worker.",
    )
    parser.add_argument(
        "--num-actors",
        type=int,
        default=4,
        help="Sets number of xgboost workers to use.",
    )
    parser.add_argument("--smoke-test", action="store_true", default=False, help="gpu")

    args, _ = parser.parse_known_args()

    if args.smoke_test:
        ray.init(num_cpus=args.num_actors + 1)
    elif args.server_address:
        ray.util.connect(args.server_address)
    else:
        ray.init(address=args.address)

    main(args.cpus_per_actor, args.num_actors)


================================================
FILE: xgboost_ray/examples/simple_modin.py
================================================
import argparse

import numpy as np
import pandas as pd
import ray

from xgboost_ray import RayDMatrix, RayParams, train
from xgboost_ray.data_sources.modin import MODIN_INSTALLED


def main(cpus_per_actor, num_actors):
    if not MODIN_INSTALLED:
        print(
            "Modin is not installed or installed in a version that is not "
            "compatible with xgboost_ray (< 0.9.0)."
        )
        return

    # Import modin after initializing Ray
    from modin.distributed.dataframe.pandas import from_partitions

    # Generate dataset
    x = np.repeat(range(8), 16).reshape((32, 4))
    # Even numbers --> 0, odd numbers --> 1
    y = np.tile(np.repeat(range(2), 4), 4)

    # Flip some bits to reduce max accuracy
    bits_to_flip = np.random.choice(32, size=6, replace=False)
    y[bits_to_flip] = 1 - y[bits_to_flip]

    data = pd.DataFrame(x)
    data["label"] = y

    # Split into 4 partitions
    partitions = [ray.put(part) for part in np.split(data, 4)]

    # Create modin df here
    modin_df = from_partitions(partitions, axis=0)

    train_set = RayDMatrix(modin_df, "label")

    evals_result = {}
    # Set XGBoost config.
    xgboost_params = {
        "tree_method": "approx",
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    }

    # Train the classifier
    bst = train(
        params=xgboost_params,
        dtrain=train_set,
        evals=[(train_set, "train")],
        evals_result=evals_result,
        ray_params=RayParams(
            max_actor_restarts=0,
            gpus_per_actor=0,
            cpus_per_actor=cpus_per_actor,
            num_actors=num_actors,
        ),
        verbose_eval=False,
        num_boost_round=10,
    )

    model_path = "modin.xgb"
    bst.save_model(model_path)
    print("Final training error: {:.4f}".format(evals_result["train"]["error"][-1]))


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--address", required=False, type=str, help="the address to use for Ray"
    )
    parser.add_argument(
        "--server-address",
        required=False,
        type=str,
        help="Address of the remote server if using Ray Client.",
    )
    parser.add_argument(
        "--cpus-per-actor",
        type=int,
        default=1,
        help="Sets number of CPUs per xgboost training worker.",
    )
    parser.add_argument(
        "--num-actors",
        type=int,
        default=4,
        help="Sets number of xgboost workers to use.",
    )
    parser.add_argument("--smoke-test", action="store_true", default=False, help="gpu")

    args, _ = parser.parse_known_args()

    if args.smoke_test:
        ray.init(num_cpus=args.num_actors + 1)
    elif args.server_address:
        ray.util.connect(args.server_address)
    else:
        ray.init(address=args.address)

    main(args.cpus_per_actor, args.num_actors)


================================================
FILE: xgboost_ray/examples/simple_objectstore.py
================================================
import argparse

import numpy as np
import pandas as pd
import ray

from xgboost_ray import RayDMatrix, RayParams, train


def main(cpus_per_actor, num_actors):
    # Generate dataset
    x = np.repeat(range(8), 16).reshape((32, 4))
    # Even numbers --> 0, odd numbers --> 1
    y = np.tile(np.repeat(range(2), 4), 4)

    # Flip some bits to reduce max accuracy
    bits_to_flip = np.random.choice(32, size=6, replace=False)
    y[bits_to_flip] = 1 - y[bits_to_flip]

    data = pd.DataFrame(x)
    data["label"] = y

    # Split into 4 partitions
    partitions = [ray.put(part) for part in np.split(data, 4)]

    train_set = RayDMatrix(partitions, "label")

    evals_result = {}
    # Set XGBoost config.
    xgboost_params = {
        "tree_method": "approx",
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    }

    # Train the classifier
    bst = train(
        params=xgboost_params,
        dtrain=train_set,
        evals=[(train_set, "train")],
        evals_result=evals_result,
        ray_params=RayParams(
            max_actor_restarts=0,
            gpus_per_actor=0,
            cpus_per_actor=cpus_per_actor,
            num_actors=num_actors,
        ),
        verbose_eval=False,
        num_boost_round=10,
    )

    model_path = "modin.xgb"
    bst.save_model(model_path)
    print("Final training error: {:.4f}".format(evals_result["train"]["error"][-1]))


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--address", required=False, type=str, help="the address to use for Ray"
    )
    parser.add_argument(
        "--server-address",
        required=False,
        type=str,
        help="Address of the remote server if using Ray Client.",
    )
    parser.add_argument(
        "--cpus-per-actor",
        type=int,
        default=1,
        help="Sets number of CPUs per xgboost training worker.",
    )
    parser.add_argument(
        "--num-actors",
        type=int,
        default=4,
        help="Sets number of xgboost workers to use.",
    )
    parser.add_argument("--smoke-test", action="store_true", default=False, help="gpu")

    args, _ = parser.parse_known_args()

    if args.smoke_test:
        ray.init(num_cpus=args.num_actors + 1)
    elif args.server_address:
        ray.util.connect(args.server_address)
    else:
        ray.init(address=args.address)

    main(args.cpus_per_actor, args.num_actors)


================================================
FILE: xgboost_ray/examples/simple_partitioned.py
================================================
import argparse

import numpy as np
import ray
from sklearn import datasets
from sklearn.model_selection import train_test_split

from xgboost_ray import RayDMatrix, RayParams, train

nc = 31


@ray.remote
class AnActor:
    """We mimic a distributed DF by having several actors create
    data which form the global DF.
    """

    @ray.method(num_returns=2)
    def genData(self, rank, nranks, nrows):
        """Generate global dataset and cut out local piece.
        In real life each actor would of course directly create local data.
        """
        # Load dataset
        data, labels = datasets.load_breast_cancer(return_X_y=True)
        # Split into train and test set
        train_x, _, train_y, _ = train_test_split(data, labels, test_size=0.25)
        train_y = train_y.reshape((train_y.shape[0], 1))
        train = np.hstack([train_x, train_y])
        assert nrows <= train.shape[0]
        assert nc == train.shape[1]
        sz = nrows // nranks
        return train[sz * rank : sz * (rank + 1)], ray.util.get_node_ip_address()


class Parted:
    """Class exposing __partitioned__"""

    def __init__(self, parted):
        self.__partitioned__ = parted


def main(cpus_per_actor, num_actors):
    nr = 424
    actors = [AnActor.remote() for _ in range(num_actors)]
    parts = [actors[i].genData.remote(i, num_actors, nr) for i in range(num_actors)]
    rowsperpart = nr // num_actors
    nr = rowsperpart * num_actors
    parted = Parted(
        {
            "shape": (nr, nc),
            "partition_tiling": (num_actors, 1),
            "get": lambda x: ray.get(x),
            "partitions": {
                (i, 0): {
                    "start": (i * rowsperpart, 0),
                    "shape": (rowsperpart, nc),
                    "data": parts[i][0],
                    "location": [ray.get(parts[i][1])],
                }
                for i in range(num_actors)
            },
        }
    )

    yl = nc - 1
    # Let's create DMatrix from our __partitioned__ structure
    train_set = RayDMatrix(parted, f"f{yl}")

    evals_result = {}
    # Set XGBoost config.
    xgboost_params = {
        "tree_method": "approx",
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    }

    # Train the classifier
    bst = train(
        params=xgboost_params,
        dtrain=train_set,
        evals=[(train_set, "train")],
        evals_result=evals_result,
        ray_params=RayParams(
            max_actor_restarts=0,
            gpus_per_actor=0,
            cpus_per_actor=cpus_per_actor,
            num_actors=num_actors,
        ),
        verbose_eval=False,
        num_boost_round=10,
    )

    model_path = "partitioned.xgb"
    bst.save_model(model_path)
    print("Final training error: {:.4f}".format(evals_result["train"]["error"][-1]))


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--address", required=False, type=str, help="the address to use for Ray"
    )
    parser.add_argument(
        "--server-address",
        required=False,
        type=str,
        help="Address of the remote server if using Ray Client.",
    )
    parser.add_argument(
        "--cpus-per-actor",
        type=int,
        default=1,
        help="Sets number of CPUs per xgboost training worker.",
    )
    parser.add_argument(
        "--num-actors",
        type=int,
        default=4,
        help="Sets number of xgboost workers to use.",
    )
    parser.add_argument("--smoke-test", action="store_true", default=False, help="gpu")

    args, _ = parser.parse_known_args()

    if not ray.is_initialized():
        if args.smoke_test:
            ray.init(num_cpus=args.num_actors + 1)
        elif args.server_address:
            ray.util.connect(args.server_address)
        else:
            ray.init(address=args.address)

    main(args.cpus_per_actor, args.num_actors)


================================================
FILE: xgboost_ray/examples/simple_predict.py
================================================
import os

import numpy as np
import xgboost as xgb
from sklearn import datasets

from xgboost_ray import RayDMatrix, RayParams, predict


def main():
    if not os.path.exists("simple.xgb"):
        raise ValueError(
            "Model file not found: `simple.xgb`"
            "\nFIX THIS by running `python `simple.py` first to "
            "train the model."
        )

    # Load dataset
    data, labels = datasets.load_breast_cancer(return_X_y=True)

    dmat_xgb = xgb.DMatrix(data, labels)
    dmat_ray = RayDMatrix(data, labels)

    bst = xgb.Booster(model_file="simple.xgb")

    pred_xgb = bst.predict(dmat_xgb)
    pred_ray = predict(bst, dmat_ray, ray_params=RayParams(num_actors=2))

    np.testing.assert_array_equal(pred_xgb, pred_ray)
    print(pred_ray)


if __name__ == "__main__":
    main()


================================================
FILE: xgboost_ray/examples/simple_ray_dataset.py
================================================
import argparse

import numpy as np
import pandas as pd
import ray
from xgboost import DMatrix

from xgboost_ray import RayDMatrix, RayParams, train


def main(cpus_per_actor, num_actors):
    np.random.seed(1234)
    # Generate dataset
    x = np.repeat(range(8), 16).reshape((32, 4))
    # Even numbers --> 0, odd numbers --> 1
    y = np.tile(np.repeat(range(2), 4), 4)

    # Flip some bits to reduce max accuracy
    bits_to_flip = np.random.choice(32, size=6, replace=False)
    y[bits_to_flip] = 1 - y[bits_to_flip]

    data = pd.DataFrame(x)
    # Ray Datasets require all columns to be string
    data.columns = [str(c) for c in data.columns]
    data["label"] = y

    ray_ds = ray.data.from_pandas(data)
    train_set = RayDMatrix(ray_ds, "label")

    evals_result = {}
    # Set XGBoost config.
    xgboost_params = {
        "tree_method": "approx",
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    }

    # Train the classifier
    bst = train(
        params=xgboost_params,
        dtrain=train_set,
        evals=[(train_set, "train")],
        evals_result=evals_result,
        ray_params=RayParams(
            max_actor_restarts=0,
            gpus_per_actor=0,
            cpus_per_actor=cpus_per_actor,
            num_actors=num_actors,
        ),
        verbose_eval=False,
        num_boost_round=10,
    )

    model_path = "ray_datasets.xgb"
    bst.save_model(model_path)
    print("Final training error: {:.4f}".format(evals_result["train"]["error"][-1]))

    # Distributed prediction
    scored = ray_ds.drop_columns(["label"]).map_batches(
        lambda batch: {"pred": bst.predict(DMatrix(batch))}, batch_format="pandas"
    )
    print(scored.to_pandas())


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--address", required=False, type=str, help="the address to use for Ray"
    )
    parser.add_argument(
        "--server-address",
        required=False,
        type=str,
        help="Address of the remote server if using Ray Client.",
    )
    parser.add_argument(
        "--cpus-per-actor",
        type=int,
        default=1,
        help="Sets number of CPUs per xgboost training worker.",
    )
    parser.add_argument(
        "--num-actors",
        type=int,
        default=4,
        help="Sets number of xgboost workers to use.",
    )
    parser.add_argument("--smoke-test", action="store_true", default=False, help="gpu")

    args, _ = parser.parse_known_args()

    if args.smoke_test:
        ray.init(num_cpus=args.num_actors + 1)
    elif args.server_address:
        ray.util.connect(args.server_address)
    else:
        ray.init(address=args.address)

    main(args.cpus_per_actor, args.num_actors)


================================================
FILE: xgboost_ray/examples/simple_tune.py
================================================
import argparse
import os

import ray
from ray import tune
from sklearn import datasets
from sklearn.model_selection import train_test_split

import xgboost_ray
from xgboost_ray import RayDMatrix, RayParams, train


def train_breast_cancer(config, ray_params):
    # Load dataset
    data, labels = datasets.load_breast_cancer(return_X_y=True)
    # Split into train and test set
    train_x, test_x, train_y, test_y = train_test_split(data, labels, test_size=0.25)

    train_set = RayDMatrix(train_x, train_y)
    test_set = RayDMatrix(test_x, test_y)

    evals_result = {}

    bst = train(
        params=config,
        dtrain=train_set,
        evals=[(test_set, "eval")],
        evals_result=evals_result,
        ray_params=ray_params,
        verbose_eval=False,
        num_boost_round=10,
    )

    model_path = "tuned.xgb"
    bst.save_model(model_path)
    print("Final validation error: {:.4f}".format(evals_result["eval"]["error"][-1]))


def main(cpus_per_actor, num_actors, num_samples):
    # Set XGBoost config.
    config = {
        "tree_method": "approx",
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
        "eta": tune.loguniform(1e-4, 1e-1),
        "subsample": tune.uniform(0.5, 1.0),
        "max_depth": tune.randint(1, 9),
    }

    ray_params = RayParams(
        max_actor_restarts=1,
        gpus_per_actor=0,
        cpus_per_actor=cpus_per_actor,
        num_actors=num_actors,
    )

    analysis = tune.run(
        tune.with_parameters(train_breast_cancer, ray_params=ray_params),
        # Use the `get_tune_resources` helper function to set the resources.
        resources_per_trial=ray_params.get_tune_resources(),
        config=config,
        num_samples=num_samples,
        metric="eval-error",
        mode="min",
    )

    # Load the best model checkpoint.
    best_bst = xgboost_ray.tune.load_model(
        os.path.join(analysis.best_trial.local_path, "tuned.xgb")
    )

    best_bst.save_model("best_model.xgb")

    accuracy = 1.0 - analysis.best_result["eval-error"]
    print(f"Best model parameters: {analysis.best_config}")
    print(f"Best model total accuracy: {accuracy:.4f}")


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--address", required=False, type=str, help="the address to use for Ray"
    )
    parser.add_argument(
        "--server-address",
        required=False,
        type=str,
        help="Address of the remote server if using Ray Client.",
    )
    parser.add_argument(
        "--cpus-per-actor",
        type=int,
        default=1,
        help="Sets number of CPUs per XGBoost training worker.",
    )
    parser.add_argument(
        "--num-actors",
        type=int,
        default=1,
        help="Sets number of XGBoost workers to use.",
    )
    parser.add_argument(
        "--num-samples", type=int, default=4, help="Number of samples to use for Tune."
    )
    parser.add_argument("--smoke-test", action="store_true", default=False)

    args, _ = parser.parse_known_args()

    if args.smoke_test:
        ray.init(num_cpus=args.num_actors * args.num_samples)
    elif args.server_address:
        ray.util.connect(args.server_address)
    else:
        ray.init(address=args.address)

    main(args.cpus_per_actor, args.num_actors, args.num_samples)


================================================
FILE: xgboost_ray/examples/train_on_test_data.py
================================================
import argparse
import os
import shutil
import time

from xgboost_ray import RayDMatrix, RayParams, train
from xgboost_ray.tests.utils import create_parquet_in_tempdir

####
# Run `create_test_data.py` first to create a large fake data set.
# Alternatively, run with `--smoke-test` to create an ephemeral small fake
# data set.
####


def main(fname, num_actors=2):
    dtrain = RayDMatrix(os.path.abspath(fname), label="labels", ignore=["partition"])

    config = {
        "tree_method": "hist",
        "eval_metric": ["logloss", "error"],
    }

    evals_result = {}

    start = time.time()
    bst = train(
        config,
        dtrain,
        evals_result=evals_result,
        ray_params=RayParams(max_actor_restarts=1, num_actors=num_actors),
        num_boost_round=10,
        evals=[(dtrain, "train")],
    )
    taken = time.time() - start
    print(f"TRAIN TIME TAKEN: {taken:.2f} seconds")

    bst.save_model("test_data.xgb")
    print("Final training error: {:.4f}".format(evals_result["train"]["error"][-1]))


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--smoke-test",
        action="store_true",
        default=False,
        help="Finish quickly for testing",
    )
    args = parser.parse_args()

    temp_dir, path = None, None
    if args.smoke_test:
        temp_dir, path = create_parquet_in_tempdir(
            "smoketest.parquet",
            num_rows=1_000,
            num_features=4,
            num_classes=2,
            num_partitions=2,
        )
    else:
        path = os.path.join(os.path.dirname(__file__), "parted.parquet")

    import ray

    ray.init()

    start = time.time()
    main(path)
    taken = time.time() - start
    print(f"TOTAL TIME TAKEN: {taken:.2f} seconds")

    if args.smoke_test:
        shutil.rmtree(temp_dir)


================================================
FILE: xgboost_ray/examples/train_with_ml_dataset.py
================================================
import argparse
import os
import shutil
import time

from ray.util.data import read_parquet

from xgboost_ray import RayDMatrix, RayParams, train
from xgboost_ray.tests.utils import create_parquet_in_tempdir

####
# Run `create_test_data.py` first to create a large fake data set.
# Alternatively, run with `--smoke-test` to create an ephemeral small fake
# data set.
####


def main(fname, num_actors=2):
    ml_dataset = read_parquet(fname, num_shards=num_actors)

    dtrain = RayDMatrix(ml_dataset, label="labels", ignore=["partition"])

    config = {
        "tree_method": "hist",
        "eval_metric": ["logloss", "error"],
    }

    evals_result = {}

    start = time.time()
    bst = train(
        config,
        dtrain,
        evals_result=evals_result,
        ray_params=RayParams(max_actor_restarts=1, num_actors=num_actors),
        num_boost_round=10,
        evals=[(dtrain, "train")],
    )
    taken = time.time() - start
    print(f"TRAIN TIME TAKEN: {taken:.2f} seconds")

    bst.save_model("test_data.xgb")
    print("Final training error: {:.4f}".format(evals_result["train"]["error"][-1]))


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--smoke-test",
        action="store_true",
        default=False,
        help="Finish quickly for testing",
    )
    args = parser.parse_args()

    temp_dir, path = None, None
    if args.smoke_test:
        temp_dir, path = create_parquet_in_tempdir(
            "smoketest.parquet",
            num_rows=1_000,
            num_features=4,
            num_classes=2,
            num_partitions=2,
        )
    else:
        path = os.path.join(os.path.dirname(__file__), "parted.parquet")

    import ray

    ray.init()

    start = time.time()
    main(path)
    taken = time.time() - start
    print(f"TOTAL TIME TAKEN: {taken:.2f} seconds")

    if args.smoke_test:
        shutil.rmtree(temp_dir)


================================================
FILE: xgboost_ray/main.py
================================================
import functools
import inspect
import multiprocessing
import os
import pickle
import platform
import threading
import time
import warnings
from dataclasses import dataclass, field
from typing import Any, Callable, Dict, List, Optional, Sequence, Tuple, Union

import numpy as np
import pandas as pd
from packaging.version import Version
from xgboost.core import XGBoostError

from xgboost_ray.xgb import xgboost as xgb

try:
    from xgboost.core import EarlyStopException
except ImportError:

    class EarlyStopException(XGBoostError):
        pass


# From xgboost>=1.7.0, rabit is replaced by a collective communicator.
try:
    from xgboost.collective import CommunicatorContext

    rabit = None
    HAS_COLLECTIVE = True
except ImportError:
    from xgboost import rabit  # noqa

    CommunicatorContext = None
    HAS_COLLECTIVE = False

from xgboost_ray.callback import DistributedCallback, DistributedCallbackContainer
from xgboost_ray.compat import LEGACY_CALLBACK, RabitTracker, TrainingCallback

try:
    import ray
    from ray import logger
    from ray.actor import ActorHandle
    from ray.exceptions import RayActorError, RayTaskError
    from ray.util import get_node_ip_address, placement_group
    from ray.util.annotations import DeveloperAPI, PublicAPI
    from ray.util.placement_group import (
        PlacementGroup,
        get_current_placement_group,
        remove_placement_group,
    )
    from ray.util.queue import Queue
    from ray.util.scheduling_strategies import (
        NodeAffinitySchedulingStrategy,
        PlacementGroupSchedulingStrategy,
    )

    from xgboost_ray.util import Event, MultiActorTask, force_on_current_node

    DEFAULT_PG = "default"

    RAY_INSTALLED = True
except ImportError:
    ray = get_node_ip_address = Queue = Event = ActorHandle = logger = None

    def PublicAPI(f):
        @functools.wraps(f)
        def inner_f(*args, **kwargs):
            return f(*args, **kwargs)

        return inner_f

    DeveloperAPI = PublicAPI
    RAY_INSTALLED = False

from xgboost_ray.matrix import (
    LEGACY_MATRIX,
    QUANTILE_AVAILABLE,
    RayDataIter,
    RayDeviceQuantileDMatrix,
    RayDMatrix,
    RayQuantileDMatrix,
    combine_data,
    concat_dataframes,
)
from xgboost_ray.session import (
    get_rabit_rank,
    init_session,
    put_queue,
    set_session_queue,
)

RAY_TUNE_INSTALLED = True

try:
    import ray.train
    import ray.tune
except (ImportError, ModuleNotFoundError):
    RAY_TUNE_INSTALLED = False

if RAY_TUNE_INSTALLED:
    from xgboost_ray.tune import _get_tune_resources, _try_add_tune_callback
else:
    _get_tune_resources = _try_add_tune_callback = None


def _get_environ(item: str, old_val: Any):
    env_var = f"RXGB_{item}"
    new_val = old_val
    if env_var in os.environ:
        new_val_str = os.environ.get(env_var)

        if isinstance(old_val, bool):
            new_val = bool(int(new_val_str))
        elif isinstance(old_val, int):
            new_val = int(new_val_str)
        elif isinstance(old_val, float):
            new_val = float(new_val_str)
        else:
            new_val = new_val_str

    return new_val


@dataclass
class _XGBoostEnv:
    # Whether to use SPREAD placement group strategy for training.
    USE_SPREAD_STRATEGY: bool = True

    # How long to wait for placement group creation before failing.
    PLACEMENT_GROUP_TIMEOUT_S: int = 100

    # Status report frequency when waiting for initial actors
    # and during training
    STATUS_FREQUENCY_S: int = 30

    # If restarting failed actors is disabled
    ELASTIC_RESTART_DISABLED: bool = False

    # How often to check for new available resources
    ELASTIC_RESTART_RESOURCE_CHECK_S: int = 30

    # How long to wait before triggering a new start of the training loop
    # when new actors become available
    ELASTIC_RESTART_GRACE_PERIOD_S: int = 10

    # Whether to allow soft-placement of communication processes. If True,
    # the Queue and Event actors may be scheduled on non-driver nodes.
    COMMUNICATION_SOFT_PLACEMENT: bool = True

    def __getattribute__(self, item):
        old_val = super(_XGBoostEnv, self).__getattribute__(item)
        new_val = _get_environ(item, old_val)
        if new_val != old_val:
            setattr(self, item, new_val)
        return super(_XGBoostEnv, self).__getattribute__(item)


ENV = _XGBoostEnv()

xgboost_version = xgb.__version__ if xgb else "0.0.0"

LEGACY_WARNING = (
    f"You are using `xgboost_ray` with a legacy XGBoost version "
    f"(version {xgboost_version}). While we try to support "
    f"older XGBoost versions, please note that this library is only "
    f"fully tested and supported for XGBoost >= 1.4. Please consider "
    f"upgrading your XGBoost version (`pip install -U xgboost`)."
)

# XGBoost Version for comparisions
XGBOOST_VERSION = Version(xgboost_version)


class RayXGBoostTrainingError(RuntimeError):
    """Raised from RayXGBoostActor.train() when the local xgb.train function
    did not complete."""

    pass


class RayXGBoostTrainingStopped(RuntimeError):
    """Raised from RayXGBoostActor.train() when training was deliberately
    stopped."""

    pass


class RayXGBoostActorAvailable(RuntimeError):
    """Raise from `_update_scheduled_actor_states()` when new actors become
    available in elastic training"""

    pass


def _assert_ray_support():
    if not RAY_INSTALLED:
        raise ImportError(
            "Ray needs to be installed in order to use this module. "
            "Try: `pip install ray`"
        )


def _in_ray_tune_session() -> bool:
    return (
        RAY_TUNE_INSTALLED and ray.train.get_context().get_trial_resources() is not None
    )


def _maybe_print_legacy_warning():
    if LEGACY_MATRIX or LEGACY_CALLBACK:
        warnings.warn(LEGACY_WARNING)


def _is_client_connected() -> bool:
    try:
        return ray.util.client.ray.is_connected()
    except Exception:
        return False


class _RabitTrackerCompatMixin:
    """Fallback calls to legacy terminology"""

    def accept_workers(self, n_workers: int):
        return self.accept_slaves(n_workers)

    def worker_envs(self):
        return self.slave_envs()


class _RabitTracker(RabitTracker, _RabitTrackerCompatMixin):
    """
    This method overwrites the xgboost-provided RabitTracker to switch
    from a daemon thread to a multiprocessing Process. This is so that
    we are able to terminate/kill the tracking process at will.
    """

    def start(self, nworker):
        # TODO: refactor RabitTracker to support spawn process creation.
        # In python 3.8, spawn is used as default process creation on macOS.
        # But spawn doesn't work because `run` is not pickleable.
        # For now we force the start method to use fork.
        multiprocessing.set_start_method("fork", force=True)

        def run():
            self.accept_workers(nworker)

        self.thread = multiprocessing.Process(target=run, args=())
        self.thread.start()


def _start_rabit_tracker(num_workers: int):
    """Start Rabit tracker. The workers connect to this tracker to share
    their results.

    The Rabit tracker is the main process that all local workers connect to
    to share their weights. When one or more actors die, we want to
    restart the Rabit tracker, too, for two reasons: First we don't want to
    be potentially stuck with stale connections from old training processes.
    Second, we might restart training with a different number of actors, and
    for that we would have to restart the tracker anyway.

    To do this we start the Tracker in its own subprocess with its own PID.
    We can use this process then to specifically kill/terminate the tracker
    process in `_stop_rabit_tracker` without touching other functionality.
    """
    host = get_node_ip_address()

    env = {"DMLC_NUM_WORKER": num_workers}

    rabit_tracker = _RabitTracker(host, num_workers)

    # Get tracker Host + IP
    env.update(rabit_tracker.worker_envs())
    rabit_tracker.start(num_workers)

    logger.debug(f"Started Rabit tracker process with PID {rabit_tracker.thread.pid}")

    return rabit_tracker.thread, env


def _stop_rabit_tracker(rabit_process: multiprocessing.Process):
    logger.debug(f"Stopping Rabit process with PID {rabit_process.pid}")
    rabit_process.join(timeout=5)
    rabit_process.terminate()


class _RabitContextBase:
    """This context is used by local training actors to connect to the
    Rabit tracker.

    Args:
        actor_id: Unique actor ID
        args: Arguments for Rabit initialisation. These are
            environment variables to configure Rabit clients.
    """

    def __init__(self, actor_id: int, args: dict):
        args["DMLC_TASK_ID"] = "[xgboost.ray]:" + actor_id
        self.args = args


# From xgboost>=1.7.0, rabit is replaced by a collective communicator
if HAS_COLLECTIVE:

    class _RabitContext(_RabitContextBase, CommunicatorContext):
        pass

else:

    class _RabitContext(_RabitContextBase):
        def __init__(self, actor_id: int, args: dict):
            super().__init__(actor_id, args)
            self._list_args = [("%s=%s" % item).encode() for item in self.args.items()]

        def __enter__(self):
            xgb.rabit.init(self._list_args)

        def __exit__(self, *args):
            xgb.rabit.finalize()


def _ray_get_actor_cpus():
    # Get through resource IDs
    if Version(ray.__version__) < Version("2.0.0"):
        # Remove after 2.2?
        resource_ids = ray.worker.get_resource_ids()
        if "CPU" in resource_ids:
            return sum(cpu[1] for cpu in resource_ids["CPU"])
    else:
        resource_ids = ray.get_runtime_context().get_assigned_resources()
        for key in resource_ids.keys():
            if key.startswith("CPU"):
                return resource_ids[key]
        return 1


def _ray_get_cluster_cpus():
    return ray.cluster_resources().get("CPU", None)


def _get_min_node_cpus():
    max_node_cpus = min(
        node.get("Resources", {}).get("CPU", 0.0)
        for node in ray.nodes()
        if node.get("Alive", False)
    )
    return max_node_cpus if max_node_cpus > 0.0 else 1.0


def _set_omp_num_threads():
    ray_cpus = _ray_get_actor_cpus()
    if ray_cpus:
        os.environ["OMP_NUM_THREADS"] = str(int(ray_cpus))
    else:
        if "OMP_NUM_THREADS" in os.environ:
            del os.environ["OMP_NUM_THREADS"]
    return int(float(os.environ.get("OMP_NUM_THREADS", "0.0")))


def _prepare_dmatrix_params(param: Dict) -> Dict:
    dm_param = {
        "data": concat_dataframes(param["data"]),
        "label": concat_dataframes(param["label"]),
        "weight": concat_dataframes(param["weight"]),
        "feature_weights": concat_dataframes(param["feature_weights"]),
        "qid": concat_dataframes(param["qid"]),
        "base_margin": concat_dataframes(param["base_margin"]),
        "label_lower_bound": concat_dataframes(param["label_lower_bound"]),
        "label_upper_bound": concat_dataframes(param["label_upper_bound"]),
    }
    return dm_param


def _get_dmatrix(data: RayDMatrix, param: Dict) -> xgb.DMatrix:
    if QUANTILE_AVAILABLE and isinstance(data, RayQuantileDMatrix):
        if isinstance(param["data"], list):
            qdm_param = _prepare_dmatrix_params(param)
            param.update(qdm_param)
        if data.enable_categorical is not None:
            param["enable_categorical"] = data.enable_categorical
        matrix = xgb.QuantileDMatrix(**param)
    if not LEGACY_MATRIX and isinstance(data, RayDeviceQuantileDMatrix):
        # If we only got a single data shard, create a list so we can
        # iterate over it
        if not isinstance(param["data"], list):
            param["data"] = [param["data"]]

            if not isinstance(param["label"], list):
                param["label"] = [param["label"]]
            if not isinstance(param["weight"], list):
                param["weight"] = [param["weight"]]
            if not isinstance(param["feature_weights"], list):
                param["feature_weights"] = [param["feature_weights"]]
            if not isinstance(param["qid"], list):
                param["qid"] = [param["qid"]]
            if not isinstance(param["data"], list):
                param["base_margin"] = [param["base_margin"]]

        param["label_lower_bound"] = [None]
        param["label_upper_bound"] = [None]

        dm_param = {
            "feature_names": data.feature_names,
            "feature_types": data.feature_types,
            "missing": data.missing,
        }

        if data.enable_categorical is not None:
            dm_param["enable_categorical"] = data.enable_categorical

        param.update(dm_param)
        it = RayDataIter(**param)
        matrix = xgb.DeviceQuantileDMatrix(it, **dm_param)
    else:
        if isinstance(param["data"], list):
            dm_param = _prepare_dmatrix_params(param)
            param.update(dm_param)

        ll = param.pop("label_lower_bound", None)
        lu = param.pop("label_upper_bound", None)
        fw = param.pop("feature_weights", None)

        if LEGACY_MATRIX:
            param.pop("base_margin", None)

        if "qid" not in inspect.signature(xgb.DMatrix).parameters:
            param.pop("qid", None)

        if data.enable_categorical is not None:
            param["enable_categorical"] = data.enable_categorical

        matrix = xgb.DMatrix(**param)

        if not LEGACY_MATRIX:
            matrix.set_info(
                label_lower_bound=ll, label_upper_bound=lu, feature_weights=fw
            )

    data.update_matrix_properties(matrix)
    return matrix


@PublicAPI(stability="beta")
@dataclass
class RayParams:
    """Parameters to configure Ray-specific behavior.

    Args:
        num_actors: Number of parallel Ray actors.
        cpus_per_actor: Number of CPUs to be used per Ray actor.
        gpus_per_actor: Number of GPUs to be used per Ray actor.
        resources_per_actor: Dict of additional resources
            required per Ray actor.
        elastic_training: If True, training will continue with
            fewer actors if an actor fails. Default False.
        max_failed_actors: If `elastic_training` is True, this
            specifies the maximum number of failed actors with which
            we still continue training.
        max_actor_restarts: Number of retries when Ray actors fail.
            Defaults to 0 (no retries). Set to -1 for unlimited retries.
        checkpoint_frequency: How often to save checkpoints. Defaults
            to ``5`` (every 5th iteration).
        verbose: Whether to output Ray-specific info messages
            during training/prediction.
        placement_options: Optional kwargs to pass to
            ``PlacementGroupFactory`` in ``get_tune_resources()``.
    """

    # Actor scheduling
    num_actors: int = 0
    cpus_per_actor: int = 0
    gpus_per_actor: int = -1
    resources_per_actor: Optional[Dict] = None

    # Fault tolerance
    elastic_training: bool = False
    max_failed_actors: int = 0
    max_actor_restarts: int = 0
    checkpoint_frequency: int = 5

    # Distributed callbacks
    distributed_callbacks: Optional[List[DistributedCallback]] = None

    verbose: Optional[bool] = None
    placement_options: Dict[str, Any] = None

    def get_tune_resources(self):
        """Return the resources to use for xgboost_ray training with Tune."""
        if self.cpus_per_actor <= 0 or self.num_actors <= 0:
            raise ValueError(
                "num_actors and cpus_per_actor both must be " "greater than 0."
            )
        return _get_tune_resources(
            num_actors=self.num_actors,
            cpus_per_actor=self.cpus_per_actor,
            gpus_per_actor=max(0, self.gpus_per_actor),
            resources_per_actor=self.resources_per_actor,
            placement_options=self.placement_options,
        )


@dataclass
class _Checkpoint:
    iteration: int = 0
    value: Optional[bytes] = None


def _validate_ray_params(ray_params: Union[None, RayParams, dict]) -> RayParams:
    if ray_params is None:
        ray_params = RayParams()
    elif isinstance(ray_params, dict):
        ray_params = RayParams(**ray_params)
    elif not isinstance(ray_params, RayParams):
        raise ValueError(
            f"`ray_params` must be a `RayParams` instance, a dict, or None, "
            f"but it was {type(ray_params)}."
            f"\nFIX THIS preferably by passing a `RayParams` instance as "
            f"the `ray_params` parameter."
        )
    if ray_params.num_actors <= 0:
        raise ValueError(
            "The `num_actors` parameter is set to 0. Please always specify "
            "the number of distributed actors you want to use."
            "\nFIX THIS by passing a `RayParams(num_actors=X)` argument "
            "to your call to xgboost_ray."
        )
    elif ray_params.num_actors < 2:
        warnings.warn(
            f"`num_actors` in `ray_params` is smaller than 2 "
            f"({ray_params.num_actors}). XGBoost will NOT be distributed!"
        )
    if ray_params.verbose is None and RAY_TUNE_INSTALLED:
        # In Tune/Train sessions, reduce verbosity
        ray_params.verbose = not _in_ray_tune_session()
    return ray_params


@DeveloperAPI
class RayXGBoostActor:
    """Remote Ray XGBoost actor class.

    This remote actor handles local training and prediction of one data
    shard. It initializes a Rabit context, thus connecting to the Rabit
    all-reduce ring, and initializes local training, sending updates
    to other workers.

    The actor with rank 0 also checkpoints the model periodically and
    sends the checkpoint back to the driver.

    Args:
        rank: Rank of the actor. Must be ``0 <= rank < num_actors``.
        num_actors: Total number of actors.
        queue: Ray queue to communicate with main process.
        checkpoint_frequency: How often to store checkpoints. Defaults
            to ``5``, saving checkpoints every 5 boosting rounds.

    """

    def __init__(
        self,
        rank: int,
        num_actors: int,
        queue: Optional[Queue] = None,
        stop_event: Optional[Event] = None,
        checkpoint_frequency: int = 5,
        distributed_callbacks: Optional[List[DistributedCallback]] = None,
    ):
        self.queue = queue
        init_session(rank, self.queue)

        self.rank = rank
        self.num_actors = num_actors

        self.checkpoint_frequency = checkpoint_frequency

        self._data: Dict[RayDMatrix, dict] = {}
        self._local_n: Dict[RayDMatrix, int] = {}

        self._stop_event = stop_event

        self._distributed_callbacks = DistributedCallbackContainer(
            distributed_callbacks
        )

        self._distributed_callbacks.on_init(self)
        _set_omp_num_threads()
        logger.debug(f"Initialized remote XGBoost actor with rank {self.rank}")

    def set_queue(self, queue: Queue):
        self.queue = queue
        set_session_queue(self.queue)

    def set_stop_event(self, stop_event: Event):
        self._stop_event = stop_event

    def _get_stop_event(self):
        return self._stop_event

    def pid(self):
        """Get process PID. Used for checking if still alive"""
        return os.getpid()

    def ip(self):
        """Get node IP address."""
        return get_node_ip_address()

    def _save_checkpoint_callback(self):
        """Send checkpoints to driver"""
        this = self

        class _SaveInternalCheckpointCallback(TrainingCallback):
            def after_iteration(self, model, epoch, evals_log):
                if get_rabit_rank() == 0 and epoch % this.checkpoint_frequency == 0:
                    put_queue(_Checkpoint(epoch, pickle.dumps(model)))

            def after_training(self, model):
                if get_rabit_rank() == 0:
                    put_queue(_Checkpoint(-1, pickle.dumps(model)))
                return model

        return _SaveInternalCheckpointCallback()

    def _stop_callback(self):
        """Stop if event is set"""
        this = self
        # Keep track of initial stop event. Since we're training in a thread,
        # the stop event might be overwritten, which should he handled
        # as if the previous stop event was set.
        initial_stop_event = self._stop_event

        class _StopCallback(TrainingCallback):
            def after_iteration(self, model, epoch, evals_log):
                try:
                    if (
                        this._stop_event.is_set()
                        or this._get_stop_event() is not initial_stop_event
                    ):
                        if LEGACY_CALLBACK:
                            raise EarlyStopException(epoch)
                        # Returning True stops training
                        return True
                except RayActorError:
                    if LEGACY_CALLBACK:
                        raise EarlyStopException(epoch)
                    return True

        return _StopCallback()

    def load_data(self, data: RayDMatrix):
        if data in self._data:
            return

        self._distributed_callbacks.before_data_loading(self, data)

        param = data.get_data(self.rank, self.num_actors)
        if isinstance(param["data"], list):
            self._local_n[data] = sum(len(a) for a in param["data"])
        else:
            self._local_n[data] = len(param["data"])

        # set nthread for dmatrix conversion
        param["nthread"] = int(_ray_get_actor_cpus())
        self._data[data] = param

        self._distributed_callbacks.after_data_loading(self, data)

    def train(
        self,
        rabit_args: List[str],
        return_bst: bool,
        params: Dict[str, Any],
        dtrain: RayDMatrix,
        evals: Tuple[RayDMatrix, str],
        *args,
        **kwargs,
    ) -> Dict[str, Any]:
        self._distributed_callbacks.before_train(self)

        num_threads = _set_omp_num_threads()

        local_params = params.copy()

        if "xgb_model" in kwargs:
            if isinstance(kwargs["xgb_model"], bytes):
                # bytearray type gets lost in remote actor call
                kwargs["xgb_model"] = bytearray(kwargs["xgb_model"])

        if "nthread" not in local_params and "n_jobs" not in local_params:
            if num_threads > 0:
                local_params["nthread"] = num_threads
                local_params["n_jobs"] = num_threads
            else:
                local_params["nthread"] = _ray_get_actor_cpus()
                local_params["n_jobs"] = local_params["nthread"]

        if dtrain not in self._data:
            self.load_data(dtrain)

        for deval, _name in evals:
            if deval not in self._data:
                self.load_data(deval)

        evals_result = dict()

        if "callbacks" in kwargs:
            callbacks = kwargs["callbacks"] or []
        else:
            callbacks = []
        callbacks.append(self._save_checkpoint_callback())
        callbacks.append(self._stop_callback())
        kwargs["callbacks"] = callbacks

        result_dict = {}
        error_dict = {}

        # We run xgb.train in a thread to be able to react to the stop event.
        def _train():
            try:
                with _RabitContext(str(id(self)), rabit_args):
                    local_dtrain = _get_dmatrix(dtrain, self._data[dtrain])

                    if not local_dtrain.get_label().size:
                        raise RuntimeError(
                            "Training data has no label set. Please make sure "
                            "to set the `label` argument when initializing "
                            "`RayDMatrix()` for data you would like "
                            "to train on."
                        )

                    local_evals = []
                    for deval, name in evals:
                        local_evals.append(
                            (_get_dmatrix(deval, self._data[deval]), name)
                        )
                    if LEGACY_CALLBACK:
                        for xgb_callback in kwargs.get("callbacks", []):
                            if isinstance(xgb_callback, TrainingCallback):
                                xgb_callback.before_training(None)

                    bst = xgb.train(
                        local_params,
                        local_dtrain,
                        *args,
                        evals=local_evals,
                        evals_result=evals_result,
                        **kwargs,
                    )

                    if LEGACY_CALLBACK:
                        for xgb_callback in kwargs.get("callbacks", []):
                            if isinstance(xgb_callback, TrainingCallback):
                                xgb_callback.after_training(bst)

                    result_dict.update(
                        {
                            "bst": bst,
                            "evals_result": evals_result,
                            "train_n": self._local_n[dtrain],
                        }
                    )
            except EarlyStopException:
                # Usually this should be caught by XGBoost core.
                # Silent fail, will be raised as RayXGBoostTrainingStopped.
                return
            except XGBoostError as e:
                error_dict.update({"exception": e})
                return

        thread = threading.Thread(target=_train)
        thread.daemon = True
        thread.start()
        while thread.is_alive():
            thread.join(timeout=0)
            if self._stop_event.is_set():
                raise RayXGBoostTrainingStopped("Training was interrupted.")
            time.sleep(0.1)

        if not result_dict:
            raise_from = error_dict.get("exception", None)
            raise RayXGBoostTrainingError("Training failed.") from raise_from

        thread.join()
        self._distributed_callbacks.after_train(self, result_dict)

        if not return_bst:
            result_dict.pop("bst", None)

        return result_dict

    def predict(self, model: xgb.Booster, data: RayDMatrix, **kwargs):
        self._distributed_callbacks.before_predict(self)

        _set_omp_num_threads()

        if data not in self._data:
            self.load_data(data)
        local_data = _get_dmatrix(data, self._data[data])

        predictions = model.predict(local_data, **kwargs)
        if predictions.ndim == 1:
            callback_predictions = pd.Series(predictions)
        else:
            callback_predictions = pd.DataFrame(predictions)
        self._distributed_callbacks.after_predict(self, callback_predictions)
        return predictions


@ray.remote
class _RemoteRayXGBoostActor(RayXGBoostActor):
    pass


class _PrepareActorTask(MultiActorTask):
    def __init__(
        self,
        actor: ActorHandle,
        queue: Queue,
        stop_event: Event,
        load_data: List[RayDMatrix],
    ):
        futures = []
        futures.append(actor.set_queue.remote(queue))
        futures.append(actor.set_stop_event.remote(stop_event))
        for data in load_data:
            futures.append(actor.load_data.remote(data))

        super(_PrepareActorTask, self).__init__(futures)


def _autodetect_resources(
    ray_params: RayParams, use_tree_method: bool = False
) -> Tuple[int, int]:
    gpus_per_actor = ray_params.gpus_per_actor
    cpus_per_actor = ray_params.cpus_per_actor

    # Automatically set gpus_per_actor if left at the default value
    if gpus_per_actor == -1:
        gpus_per_actor = 0
        if use_tree_method:
            gpus_per_actor = 1

    # Automatically set cpus_per_actor if left at the default value
    # Will be set to the number of cluster CPUs divided by the number of
    # actors, bounded by the minimum number of CPUs across actors nodes.
    if cpus_per_actor <= 0:
        cluster_cpus = _ray_get_cluster_cpus() or 1
        cpus_per_actor = max(
            1,
            min(
                int(_get_min_node_cpus() or 1),
                int(cluster_cpus // ray_params.num_actors),
            ),
        )
    return cpus_per_actor, gpus_per_actor


def _create_actor(
    rank: int,
    num_actors: int,
    num_cpus_per_actor: int,
    num_gpus_per_actor: int,
    resources_per_actor: Optional[Dict] = None,
    placement_group: Optional[PlacementGroup] = None,
    queue: Optional[Queue] = None,
    checkpoint_frequency: int = 5,
    distributed_callbacks: Optional[Sequence[DistributedCallback]] = None,
) -> ActorHandle:
    # Send DEFAULT_PG here, which changed in Ray >= 1.5.0
    # If we send `None`, this will ignore the parent placement group and
    # lead to errors e.g. when used within Ray Tune
    actor_cls = _RemoteRayXGBoostActor.options(
        num_cpus=num_cpus_per_actor,
        num_gpus=num_gpus_per_actor,
        resources=resources_per_actor,
        scheduling_strategy=PlacementGroupSchedulingStrategy(
            placement_group=placement_group or DEFAULT_PG,
            placement_group_capture_child_tasks=True,
        ),
    )

    return actor_cls.remote(
        rank=rank,
        num_actors=num_actors,
        queue=queue,
        checkpoint_frequency=checkpoint_frequency,
        distributed_callbacks=distributed_callbacks,
    )


def _trigger_data_load(actor, dtrain, evals):
    wait_load = [actor.load_data.remote(dtrain)]
    for deval, _name in evals:
        wait_load.append(actor.load_data.remote(deval))
    return wait_load


def _handle_queue(queue: Queue, checkpoint: _Checkpoint, callback_returns: Dict):
    """Handle results obtained from workers through the remote Queue object.

    Remote actors supply these results via the
    ``xgboost_ray.session.put_queue()`` function. These can be:

    - Callables. These will be called immediately with no arguments.
    - ``_Checkpoint`` objects. These will update the latest checkpoint
      object on the driver.
    - Any other type. These will be appended to an actor rank-specific
      ``callback_returns`` dict that will be written to the
      ``additional_returns`` dict of the :func:`train() <train>` method.
    """
    while not queue.empty():
        (actor_rank, item) = queue.get()
        if isinstance(item, Callable):
            item()
        elif isinstance(item, _Checkpoint):
            checkpoint.__dict__.update(item.__dict__)
        else:
            callback_returns[actor_rank].append(item)


def _shutdown(
    actors: List[ActorHandle],
    pending_actors: Optional[Dict[int, Tuple[ActorHandle, _PrepareActorTask]]] = None,
    queue: Optional[Queue] = None,
    event: Optional[Event] = None,
    placement_group: Optional[PlacementGroup] = None,
    force: bool = False,
):
    alive_actors = [a for a in actors if a is not None]
    if pending_actors:
        alive_actors += [a for (a, _) in pending_actors.values()]

    if force:
        for actor in alive_actors:
            ray.kill(actor)
    else:
        done_refs = [a.__ray_terminate__.remote() for a in alive_actors]
        # Wait 5 seconds for actors to die gracefully.
        done, not_done = ray.wait(done_refs, timeout=5)
        if not_done:
            # If all actors are not able to die gracefully, then kill them.
            for actor in alive_actors:
                ray.kill(actor)
    for i in range(len(actors)):
        actors[i] = None
    if queue:
        queue.shutdown()
    if event:
        event.shutdown()
    if placement_group:
        remove_placement_group(placement_group)


def _create_placement_group(
    cpus_per_actor, gpus_per_actor, resources_per_actor, num_actors, strategy
):
    resources_per_bundle = {"CPU": cpus_per_actor, "GPU": gpus_per_actor}
    extra_resources_per_bundle = (
        {} if resources_per_actor is None else resources_per_actor
    )
    # Create placement group for training worker colocation.
    bundles = [
        {**resources_per_bundle, **extra_resources_per_bundle}
        for _ in range(num_actors)
    ]
    pg = placement_group(bundles, strategy=strategy)
    # Wait for placement group to get created.
    logger.debug("Waiting for placement group to start.")
    timeout = ENV.PLACEMENT_GROUP_TIMEOUT_S
    ready, _ = ray.wait([pg.ready()], timeout=timeout)
    if ready:
        logger.debug("Placement group has started.")
    else:
        raise TimeoutError(
            f"Placement group creation timed out after {timeout} seconds. "
            "Make sure your cluster either has enough resources or use "
            "an autoscaling cluster. Current resources "
            f"available: {ray.available_resources()}, resources requested "
            f"by the placement group: {pg.bundle_specs}. "
            "You can change the timeout by setting the "
            "RXGB_PLACEMENT_GROUP_TIMEOUT_S environment variable."
        )
    return pg


def _create_communication_processes(added_tune_callback: bool = False):
    # Have to explicitly set num_cpus to 0.
    placement_option = {"num_cpus": 0}
    current_pg = get_current_placement_group()
    if current_pg is not None:
        # If we are already in a placement group, let's use it
        # Also, if we are specifically in Tune, let's
        # ensure that we force Queue and
        # StopEvent onto same bundle as the Trainable.
        placement_option.update(
            {
                "placement_group": current_pg,
                "placement_group_bundle_index": 0 if added_tune_callback else -1,
            }
        )
    else:
        # Create Queue and Event actors and make sure to colocate with
        # driver node.
        node_id = ray.get_runtime_context().get_node_id()
        placement_option.update(
            {

Download .txt

gitextract_x7_6mxkx/

├── .flake8
├── .github/
│   └── workflows/
│       ├── gpu.yaml
│       └── test.yaml
├── .gitignore
├── LICENSE
├── README.md
├── format.sh
├── requirements/
│   ├── lint-requirements.txt
│   └── test-requirements.txt
├── run_ci_examples.sh
├── run_ci_tests.sh
├── setup.py
└── xgboost_ray/
    ├── __init__.py
    ├── callback.py
    ├── compat/
    │   ├── __init__.py
    │   └── tracker.py
    ├── data_sources/
    │   ├── __init__.py
    │   ├── _distributed.py
    │   ├── csv.py
    │   ├── dask.py
    │   ├── data_source.py
    │   ├── modin.py
    │   ├── numpy.py
    │   ├── object_store.py
    │   ├── pandas.py
    │   ├── parquet.py
    │   ├── partitioned.py
    │   ├── petastorm.py
    │   └── ray_dataset.py
    ├── elastic.py
    ├── examples/
    │   ├── __init__.py
    │   ├── create_test_data.py
    │   ├── higgs.py
    │   ├── higgs_parquet.py
    │   ├── readme.py
    │   ├── readme_sklearn_api.py
    │   ├── simple.py
    │   ├── simple_dask.py
    │   ├── simple_modin.py
    │   ├── simple_objectstore.py
    │   ├── simple_partitioned.py
    │   ├── simple_predict.py
    │   ├── simple_ray_dataset.py
    │   ├── simple_tune.py
    │   ├── train_on_test_data.py
    │   └── train_with_ml_dataset.py
    ├── main.py
    ├── matrix.py
    ├── session.py
    ├── sklearn.py
    ├── tests/
    │   ├── __init__.py
    │   ├── conftest.py
    │   ├── env_info.sh
    │   ├── fault_tolerance.py
    │   ├── release/
    │   │   ├── benchmark_cpu_gpu.py
    │   │   ├── benchmark_ft.py
    │   │   ├── cluster_cpu.yaml
    │   │   ├── cluster_ft.yaml
    │   │   ├── cluster_gpu.yaml
    │   │   ├── create_learnable_data.py
    │   │   ├── create_test_data.py
    │   │   ├── custom_objective_metric.py
    │   │   ├── run_e2e_gpu.sh
    │   │   ├── setup_xgboost.sh
    │   │   ├── start_cpu_cluster.sh
    │   │   ├── start_ft_cluster.sh
    │   │   ├── start_gpu_cluster.sh
    │   │   ├── submit_cpu_gpu_benchmark.sh
    │   │   ├── submit_ft_benchmark.sh
    │   │   ├── tune_cluster.yaml
    │   │   └── tune_placement.py
    │   ├── test_client.py
    │   ├── test_colocation.py
    │   ├── test_data_source.py
    │   ├── test_end_to_end.py
    │   ├── test_fault_tolerance.py
    │   ├── test_matrix.py
    │   ├── test_sklearn.py
    │   ├── test_sklearn_matrix.py
    │   ├── test_tune.py
    │   ├── test_xgboost_api.py
    │   └── utils.py
    ├── tune.py
    ├── util.py
    └── xgb.py

Download .txt

SYMBOL INDEX (564 symbols across 54 files)

FILE: xgboost_ray/callback.py
  class DistributedCallback (line 14) | class DistributedCallback(ABC):
    method on_init (line 29) | def on_init(self, actor: "RayXGBoostActor", *args, **kwargs):
    method before_data_loading (line 32) | def before_data_loading(
    method after_data_loading (line 37) | def after_data_loading(
    method before_train (line 42) | def before_train(self, actor: "RayXGBoostActor", *args, **kwargs):
    method after_train (line 45) | def after_train(self, actor: "RayXGBoostActor", result_dict: Dict, *ar...
    method before_predict (line 48) | def before_predict(self, actor: "RayXGBoostActor", *args, **kwargs):
    method after_predict (line 51) | def after_predict(
  class DistributedCallbackContainer (line 62) | class DistributedCallbackContainer:
    method __init__ (line 63) | def __init__(self, callbacks: Sequence[DistributedCallback]):
    method on_init (line 66) | def on_init(self, actor: "RayXGBoostActor", *args, **kwargs):
    method before_data_loading (line 70) | def before_data_loading(
    method after_data_loading (line 76) | def after_data_loading(
    method before_train (line 82) | def before_train(self, actor: "RayXGBoostActor", *args, **kwargs):
    method after_train (line 86) | def after_train(self, actor: "RayXGBoostActor", result_dict: Dict, *ar...
    method before_predict (line 90) | def before_predict(self, actor: "RayXGBoostActor", *args, **kwargs):
    method after_predict (line 94) | def after_predict(
  class EnvironmentCallback (line 105) | class EnvironmentCallback(DistributedCallback):
    method __init__ (line 106) | def __init__(self, env_dict: Dict[str, Any]):
    method on_init (line 109) | def on_init(self, actor, *args, **kwargs):

FILE: xgboost_ray/compat/__init__.py
  class TrainingCallback (line 12) | class TrainingCallback:
    method __init__ (line 13) | def __init__(self):
    method __call__ (line 22) | def __call__(self, callback_env: "xgb.core.CallbackEnv"):
    method before_training (line 37) | def before_training(self, model):
    method after_training (line 40) | def after_training(self, model):

FILE: xgboost_ray/compat/tracker.py
  class ExSocket (line 30) | class ExSocket(object):
    method __init__ (line 35) | def __init__(self, sock):
    method recvall (line 38) | def recvall(self, nbytes):
    method recvint (line 47) | def recvint(self):
    method sendint (line 50) | def sendint(self, n):
    method sendstr (line 53) | def sendstr(self, s):
    method recvstr (line 57) | def recvstr(self):
  function get_some_ip (line 66) | def get_some_ip(host):
  function get_host_ip (line 70) | def get_host_ip(hostIP=None):
  function get_family (line 94) | def get_family(addr):
  class SlaveEntry (line 98) | class SlaveEntry(object):
    method __init__ (line 99) | def __init__(self, sock, s_addr):
    method decide_rank (line 113) | def decide_rank(self, job_map):
    method assign_rank (line 120) | def assign_rank(self, rank, wait_conn, tree_map, parent_map, ring_map):
  class RabitTracker (line 178) | class RabitTracker(object):
    method __init__ (line 183) | def __init__(self, hostIP, nslave, port=9091, port_end=9999):
    method __del__ (line 203) | def __del__(self):
    method get_neighbor (line 207) | def get_neighbor(rank, nslave):
    method slave_envs (line 218) | def slave_envs(self):
    method get_tree (line 225) | def get_tree(self, nslave):
    method find_share_ring (line 233) | def find_share_ring(self, tree_map, parent_map, r):
    method get_ring (line 252) | def get_ring(self, tree_map, parent_map):
    method get_link_map (line 267) | def get_link_map(self, nslave):
    method accept_slaves (line 294) | def accept_slaves(self, nslave):
    method start (line 368) | def start(self, nslave):
    method join (line 375) | def join(self):
    method alive (line 379) | def alive(self):

FILE: xgboost_ray/data_sources/_distributed.py
  function get_actor_rank_ips (line 10) | def get_actor_rank_ips(actors: Sequence[ActorHandle]) -> Dict[int, str]:
  function assign_partitions_to_actors (line 24) | def assign_partitions_to_actors(

FILE: xgboost_ray/data_sources/csv.py
  class CSV (line 9) | class CSV(DataSource):
    method is_data_type (line 16) | def is_data_type(data: Any, filetype: Optional[RayFileType] = None) ->...
    method get_filetype (line 20) | def get_filetype(data: Any) -> Optional[RayFileType]:
    method load_data (line 26) | def load_data(
    method get_n (line 46) | def get_n(data: Any):

FILE: xgboost_ray/data_sources/dask.py
  function _assert_dask_installed (line 24) | def _assert_dask_installed():
  function ensure_ray_dask_initialized (line 37) | def ensure_ray_dask_initialized(
  class Dask (line 45) | class Dask(DataSource):
    method is_data_type (line 59) | def is_data_type(data: Any, filetype: Optional[RayFileType] = None) ->...
    method load_data (line 69) | def load_data(
    method convert_to_series (line 97) | def convert_to_series(data: Any) -> pd.Series:
    method get_actor_shards (line 114) | def get_actor_shards(
    method get_n (line 128) | def get_n(data: Any):
  function get_ip_to_parts (line 136) | def get_ip_to_parts(data: Any) -> Dict[int, Sequence[Any]]:

FILE: xgboost_ray/data_sources/data_source.py
  class RayFileType (line 13) | class RayFileType(Enum):
  class DataSource (line 22) | class DataSource:
    method is_data_type (line 40) | def is_data_type(data: Any, filetype: Optional[RayFileType] = None) ->...
    method get_filetype (line 57) | def get_filetype(data: Any) -> Optional[RayFileType]:
    method load_data (line 73) | def load_data(
    method update_feature_names (line 96) | def update_feature_names(matrix: "xgb.DMatrix", feature_names: Optiona...
    method convert_to_series (line 108) | def convert_to_series(data: Any) -> pd.Series:
    method get_column (line 119) | def get_column(
    method get_n (line 133) | def get_n(data: Any):
    method get_actor_shards (line 138) | def get_actor_shards(

FILE: xgboost_ray/data_sources/modin.py
  function _assert_modin_installed (line 33) | def _assert_modin_installed():
  class Modin (line 48) | class Modin(DataSource):
    method is_data_type (line 62) | def is_data_type(data: Any, filetype: Optional[RayFileType] = None) ->...
    method load_data (line 72) | def load_data(
    method convert_to_series (line 100) | def convert_to_series(data: Any) -> pd.Series:
    method get_actor_shards (line 114) | def get_actor_shards(
    method get_n (line 138) | def get_n(data: Any):

FILE: xgboost_ray/data_sources/numpy.py
  class Numpy (line 13) | class Numpy(DataSource):
    method is_data_type (line 17) | def is_data_type(data: Any, filetype: Optional[RayFileType] = None) ->...
    method update_feature_names (line 21) | def update_feature_names(matrix: "xgb.DMatrix", feature_names: Optiona...
    method load_data (line 26) | def load_data(

FILE: xgboost_ray/data_sources/object_store.py
  class ObjectStore (line 11) | class ObjectStore(DataSource):
    method is_data_type (line 15) | def is_data_type(data: Any, filetype: Optional[RayFileType] = None) ->...
    method load_data (line 21) | def load_data(
    method convert_to_series (line 35) | def convert_to_series(data: Any) -> pd.Series:

FILE: xgboost_ray/data_sources/pandas.py
  class Pandas (line 8) | class Pandas(DataSource):
    method is_data_type (line 12) | def is_data_type(data: Any, filetype: Optional[RayFileType] = None) ->...
    method load_data (line 16) | def load_data(

FILE: xgboost_ray/data_sources/parquet.py
  class Parquet (line 9) | class Parquet(DataSource):
    method is_data_type (line 16) | def is_data_type(data: Any, filetype: Optional[RayFileType] = None) ->...
    method get_filetype (line 20) | def get_filetype(data: Any) -> Optional[RayFileType]:
    method load_data (line 26) | def load_data(
    method get_n (line 47) | def get_n(data: Any):

FILE: xgboost_ray/data_sources/partitioned.py
  class Partitioned (line 18) | class Partitioned(DataSource):
    method is_data_type (line 33) | def is_data_type(data: Any, filetype: Optional[RayFileType] = None) ->...
    method load_data (line 37) | def load_data(
    method get_actor_shards (line 67) | def get_actor_shards(
    method get_n (line 97) | def get_n(data: Any):

FILE: xgboost_ray/data_sources/petastorm.py
  function _assert_petastorm_installed (line 15) | def _assert_petastorm_installed():
  class Petastorm (line 27) | class Petastorm(DataSource):
    method is_data_type (line 41) | def is_data_type(data: Any, filetype: Optional[RayFileType] = None) ->...
    method get_filetype (line 45) | def get_filetype(data: Any) -> Optional[RayFileType]:
    method load_data (line 66) | def load_data(
    method get_n (line 88) | def get_n(data: Any):

FILE: xgboost_ray/data_sources/ray_dataset.py
  function _assert_ray_data_available (line 20) | def _assert_ray_data_available():
  class RayDataset (line 32) | class RayDataset(DataSource):
    method is_data_type (line 40) | def is_data_type(data: Any, filetype: Optional[RayFileType] = None) ->...
    method load_data (line 47) | def load_data(
    method convert_to_series (line 73) | def convert_to_series(
    method get_actor_shards (line 87) | def get_actor_shards(
    method get_n (line 106) | def get_n(data: "ray.data.dataset.Dataset"):

FILE: xgboost_ray/elastic.py
  function _maybe_schedule_new_actors (line 19) | def _maybe_schedule_new_actors(
  function _update_scheduled_actor_states (line 98) | def _update_scheduled_actor_states(training_state: _TrainingState):
  function _get_actor_alive_status (line 145) | def _get_actor_alive_status(

FILE: xgboost_ray/examples/create_test_data.py
  function main (line 4) | def main():

FILE: xgboost_ray/examples/higgs.py
  function download_higgs (line 9) | def download_higgs(target_file):
  function main (line 29) | def main():

FILE: xgboost_ray/examples/higgs_parquet.py
  function csv_to_parquet (line 15) | def csv_to_parquet(in_file, out_file, chunksize=100_000, **csv_kwargs):
  function main (line 43) | def main():

FILE: xgboost_ray/examples/readme.py
  function readme_simple (line 4) | def readme_simple():
  function readme_predict (line 29) | def readme_predict():
  function readme_tune (line 45) | def readme_tune():

FILE: xgboost_ray/examples/readme_sklearn_api.py
  function readme_sklearn_api (line 1) | def readme_sklearn_api():

FILE: xgboost_ray/examples/simple.py
  function main (line 10) | def main(cpus_per_actor, num_actors):

FILE: xgboost_ray/examples/simple_dask.py
  function main (line 11) | def main(cpus_per_actor, num_actors):

FILE: xgboost_ray/examples/simple_modin.py
  function main (line 11) | def main(cpus_per_actor, num_actors):

FILE: xgboost_ray/examples/simple_objectstore.py
  function main (line 10) | def main(cpus_per_actor, num_actors):

FILE: xgboost_ray/examples/simple_partitioned.py
  class AnActor (line 14) | class AnActor:
    method genData (line 20) | def genData(self, rank, nranks, nrows):
  class Parted (line 36) | class Parted:
    method __init__ (line 39) | def __init__(self, parted):
  function main (line 43) | def main(cpus_per_actor, num_actors):

FILE: xgboost_ray/examples/simple_predict.py
  function main (line 10) | def main():

FILE: xgboost_ray/examples/simple_ray_dataset.py
  function main (line 11) | def main(cpus_per_actor, num_actors):

FILE: xgboost_ray/examples/simple_tune.py
  function train_breast_cancer (line 13) | def train_breast_cancer(config, ray_params):
  function main (line 39) | def main(cpus_per_actor, num_actors, num_samples):

FILE: xgboost_ray/examples/train_on_test_data.py
  function main (line 16) | def main(fname, num_actors=2):

FILE: xgboost_ray/examples/train_with_ml_dataset.py
  function main (line 18) | def main(fname, num_actors=2):

FILE: xgboost_ray/main.py
  class EarlyStopException (line 24) | class EarlyStopException(XGBoostError):
  function PublicAPI (line 69) | def PublicAPI(f):
  function _get_environ (line 110) | def _get_environ(item: str, old_val: Any):
  class _XGBoostEnv (line 129) | class _XGBoostEnv:
    method __getattribute__ (line 154) | def __getattribute__(self, item):
  class RayXGBoostTrainingError (line 178) | class RayXGBoostTrainingError(RuntimeError):
  class RayXGBoostTrainingStopped (line 185) | class RayXGBoostTrainingStopped(RuntimeError):
  class RayXGBoostActorAvailable (line 192) | class RayXGBoostActorAvailable(RuntimeError):
  function _assert_ray_support (line 199) | def _assert_ray_support():
  function _in_ray_tune_session (line 207) | def _in_ray_tune_session() -> bool:
  function _maybe_print_legacy_warning (line 213) | def _maybe_print_legacy_warning():
  function _is_client_connected (line 218) | def _is_client_connected() -> bool:
  class _RabitTrackerCompatMixin (line 225) | class _RabitTrackerCompatMixin:
    method accept_workers (line 228) | def accept_workers(self, n_workers: int):
    method worker_envs (line 231) | def worker_envs(self):
  class _RabitTracker (line 235) | class _RabitTracker(RabitTracker, _RabitTrackerCompatMixin):
    method start (line 242) | def start(self, nworker):
  function _start_rabit_tracker (line 256) | def _start_rabit_tracker(num_workers: int):
  function _stop_rabit_tracker (line 286) | def _stop_rabit_tracker(rabit_process: multiprocessing.Process):
  class _RabitContextBase (line 292) | class _RabitContextBase:
    method __init__ (line 302) | def __init__(self, actor_id: int, args: dict):
  class _RabitContext (line 310) | class _RabitContext(_RabitContextBase, CommunicatorContext):
    method __init__ (line 316) | def __init__(self, actor_id: int, args: dict):
    method __enter__ (line 320) | def __enter__(self):
    method __exit__ (line 323) | def __exit__(self, *args):
  class _RabitContext (line 315) | class _RabitContext(_RabitContextBase):
    method __init__ (line 316) | def __init__(self, actor_id: int, args: dict):
    method __enter__ (line 320) | def __enter__(self):
    method __exit__ (line 323) | def __exit__(self, *args):
  function _ray_get_actor_cpus (line 327) | def _ray_get_actor_cpus():
  function _ray_get_cluster_cpus (line 342) | def _ray_get_cluster_cpus():
  function _get_min_node_cpus (line 346) | def _get_min_node_cpus():
  function _set_omp_num_threads (line 355) | def _set_omp_num_threads():
  function _prepare_dmatrix_params (line 365) | def _prepare_dmatrix_params(param: Dict) -> Dict:
  function _get_dmatrix (line 379) | def _get_dmatrix(data: RayDMatrix, param: Dict) -> xgb.DMatrix:
  class RayParams (line 450) | class RayParams:
    method get_tune_resources (line 492) | def get_tune_resources(self):
  class _Checkpoint (line 508) | class _Checkpoint:
  function _validate_ray_params (line 513) | def _validate_ray_params(ray_params: Union[None, RayParams, dict]) -> Ra...
  class RayXGBoostActor (line 544) | class RayXGBoostActor:
    method __init__ (line 564) | def __init__(
    method set_queue (line 594) | def set_queue(self, queue: Queue):
    method set_stop_event (line 598) | def set_stop_event(self, stop_event: Event):
    method _get_stop_event (line 601) | def _get_stop_event(self):
    method pid (line 604) | def pid(self):
    method ip (line 608) | def ip(self):
    method _save_checkpoint_callback (line 612) | def _save_checkpoint_callback(self):
    method _stop_callback (line 628) | def _stop_callback(self):
    method load_data (line 654) | def load_data(self, data: RayDMatrix):
    method train (line 672) | def train(
    method predict (line 795) | def predict(self, model: xgb.Booster, data: RayDMatrix, **kwargs):
  class _RemoteRayXGBoostActor (line 814) | class _RemoteRayXGBoostActor(RayXGBoostActor):
  class _PrepareActorTask (line 818) | class _PrepareActorTask(MultiActorTask):
    method __init__ (line 819) | def __init__(
  function _autodetect_resources (line 835) | def _autodetect_resources(
  function _create_actor (line 862) | def _create_actor(
  function _trigger_data_load (line 895) | def _trigger_data_load(actor, dtrain, evals):
  function _handle_queue (line 902) | def _handle_queue(queue: Queue, checkpoint: _Checkpoint, callback_return...
  function _shutdown (line 925) | def _shutdown(
  function _create_placement_group (line 958) | def _create_placement_group(
  function _create_communication_processes (line 990) | def _create_communication_processes(added_tune_callback: bool = False):
  function _validate_kwargs_for_func (line 1022) | def _validate_kwargs_for_func(kwargs: Dict[str, Any], func: Callable, fu...
  class _TrainingState (line 1039) | class _TrainingState:
  function _train (line 1061) | def _train(
  function train (line 1341) | def train(
  function _predict (line 1750) | def _predict(model: xgb.Booster, data: RayDMatrix, ray_params: RayParams...
  function predict (line 1810) | def predict(

FILE: xgboost_ray/matrix.py
  class RayDataset (line 39) | class RayDataset:
  function concat_dataframes (line 65) | def concat_dataframes(dfs: List[Optional[pd.DataFrame]]):
  function ensure_sorted_by_qid (line 70) | def ensure_sorted_by_qid(
  class RayShardingMode (line 106) | class RayShardingMode(Enum):
  class RayDataIter (line 128) | class RayDataIter(DataIter):
    method __init__ (line 129) | def __init__(
    method __len__ (line 163) | def __len__(self):
    method reset (line 166) | def reset(self):
    method _prop (line 169) | def _prop(self, ref):
    method next (line 179) | def next(self, input_data: Callable):
  class _RayDMatrixLoader (line 199) | class _RayDMatrixLoader:
    method __init__ (line 200) | def __init__(
    method get_data_source (line 262) | def get_data_source(self) -> Type[DataSource]:
    method assert_enough_shards_for_actors (line 265) | def assert_enough_shards_for_actors(self, num_actors: int):
    method update_matrix_properties (line 270) | def update_matrix_properties(self, matrix: "xgb.DMatrix"):
    method assign_shards_to_actors (line 274) | def assign_shards_to_actors(self, actors: Sequence[ActorHandle]) -> bool:
    method _split_dataframe (line 283) | def _split_dataframe(
    method load_data (line 360) | def load_data(
  class _CentralRayDMatrixLoader (line 366) | class _CentralRayDMatrixLoader(_RayDMatrixLoader):
    method get_data_source (line 369) | def get_data_source(self) -> Type[DataSource]:
    method load_data (line 431) | def load_data(
  class _DistributedRayDMatrixLoader (line 490) | class _DistributedRayDMatrixLoader(_RayDMatrixLoader):
    method get_data_source (line 493) | def get_data_source(self) -> Type[DataSource]:
    method assert_enough_shards_for_actors (line 576) | def assert_enough_shards_for_actors(self, num_actors: int):
    method assign_shards_to_actors (line 595) | def assign_shards_to_actors(self, actors: Sequence[ActorHandle]) -> bool:
    method load_data (line 614) | def load_data(
  class RayDMatrix (line 697) | class RayDMatrix:
    method __init__ (line 787) | def __init__(
    method has_label (line 891) | def has_label(self):
    method assign_shards_to_actors (line 894) | def assign_shards_to_actors(self, actors: Sequence[ActorHandle]) -> bool:
    method assert_enough_shards_for_actors (line 900) | def assert_enough_shards_for_actors(self, num_actors: int):
    method load_data (line 903) | def load_data(self, num_actors: Optional[int] = None, rank: Optional[i...
    method get_data (line 936) | def get_data(
    method unload_data (line 954) | def unload_data(self):
    method update_matrix_properties (line 961) | def update_matrix_properties(self, matrix: "xgb.DMatrix"):
    method __hash__ (line 964) | def __hash__(self):
    method __eq__ (line 967) | def __eq__(self, other):
  class RayQuantileDMatrix (line 971) | class RayQuantileDMatrix(RayDMatrix):
  class RayDeviceQuantileDMatrix (line 977) | class RayDeviceQuantileDMatrix(RayDMatrix):
    method __init__ (line 980) | def __init__(
    method get_data (line 1024) | def get_data(
  function _can_load_distributed (line 1036) | def _can_load_distributed(source: Data) -> bool:
  function _detect_distributed (line 1063) | def _detect_distributed(source: Data) -> bool:
  function _get_sharding_indices (line 1088) | def _get_sharding_indices(
  function combine_data (line 1114) | def combine_data(sharding: RayShardingMode, data: Iterable) -> np.ndarray:

FILE: xgboost_ray/session.py
  class RayXGBoostSession (line 8) | class RayXGBoostSession:
    method __init__ (line 9) | def __init__(self, rank: int, queue: Optional[Queue]):
    method get_actor_rank (line 13) | def get_actor_rank(self):
    method set_queue (line 16) | def set_queue(self, queue):
    method put_queue (line 19) | def put_queue(self, item):
  function init_session (line 33) | def init_session(*args, **kwargs):
  function get_session (line 44) | def get_session() -> RayXGBoostSession:
  function set_session_queue (line 56) | def set_session_queue(queue: Queue):
  function get_actor_rank (line 62) | def get_actor_rank() -> int:
  function get_rabit_rank (line 68) | def get_rabit_rank() -> int:
  function put_queue (line 79) | def put_queue(*args, **kwargs):

FILE: xgboost_ray/sklearn.py
  function _deprecate_positional_args (line 56) | def _deprecate_positional_args(f):
  function _wrap_evaluation_matrices (line 72) | def _wrap_evaluation_matrices(
  function _cls_predict_proba (line 176) | def _cls_predict_proba(n_classes: int, prediction, vstack: Callable):
  function _get_doc (line 223) | def _get_doc(object: Any) -> Optional[str]:
  function _treat_estimator_doc (line 239) | def _treat_estimator_doc(doc: Optional[str]) -> Optional[str]:
  function _treat_X_doc (line 253) | def _treat_X_doc(doc: Optional[str]) -> Optional[str]:
  function _xgboost_version_warn (line 268) | def _xgboost_version_warn(f):
  function _check_if_params_are_ray_dmatrix (line 280) | def _check_if_params_are_ray_dmatrix(
  class RayXGBMixin (line 338) | class RayXGBMixin:
    method _ray_set_ray_params_n_jobs (line 341) | def _ray_set_ray_params_n_jobs(
    method _ray_predict (line 357) | def _ray_predict(
    method _ray_get_wrap_evaluation_matrices_compat_kwargs (line 392) | def _ray_get_wrap_evaluation_matrices_compat_kwargs(
    method _configure_fit (line 418) | def _configure_fit(
    method _set_evaluation_result (line 442) | def _set_evaluation_result(self, evals_result) -> None:
  class RayXGBRegressor (line 451) | class RayXGBRegressor(XGBRegressor, RayXGBMixin):
    method fit (line 455) | def fit(
    method _can_use_inplace_predict (line 564) | def _can_use_inplace_predict(self) -> bool:
    method predict (line 567) | def predict(
    method load_model (line 593) | def load_model(self, fname):
  class RayXGBRFRegressor (line 602) | class RayXGBRFRegressor(RayXGBRegressor):
    method __init__ (line 607) | def __init__(self, *args, **kwargs):
    method __init__ (line 614) | def __init__(
    method get_xgb_params (line 631) | def get_xgb_params(self):
    method get_num_boosting_rounds (line 636) | def get_num_boosting_rounds(self):
  class RayXGBClassifier (line 644) | class RayXGBClassifier(XGBClassifier, RayXGBMixin):
    method fit (line 648) | def fit(
    method _can_use_inplace_predict (line 795) | def _can_use_inplace_predict(self) -> bool:
    method predict (line 798) | def predict(
    method predict_proba (line 839) | def predict_proba(
    method load_model (line 867) | def load_model(self, fname):
  class RayXGBRFClassifier (line 880) | class RayXGBRFClassifier(RayXGBClassifier):
    method __init__ (line 884) | def __init__(self, *args, **kwargs):
    method __init__ (line 891) | def __init__(
    method get_xgb_params (line 908) | def get_xgb_params(self):
    method get_num_boosting_rounds (line 913) | def get_num_boosting_rounds(self):
  class RayXGBRanker (line 921) | class RayXGBRanker(XGBRanker, RayXGBMixin):
    method fit (line 925) | def fit(
    method _can_use_inplace_predict (line 1048) | def _can_use_inplace_predict(self) -> bool:
    method predict (line 1051) | def predict(
    method load_model (line 1077) | def load_model(self, fname):

FILE: xgboost_ray/tests/conftest.py
  function get_default_fixure_system_config (line 14) | def get_default_fixure_system_config():
  function get_default_fixture_ray_kwargs (line 24) | def get_default_fixture_ray_kwargs():
  function _ray_start_cluster (line 37) | def _ray_start_cluster(**kwargs):
  function ray_start_cluster (line 69) | def ray_start_cluster(request):

FILE: xgboost_ray/tests/fault_tolerance.py
  class FaultToleranceManager (line 15) | class FaultToleranceManager:
    method __init__ (line 16) | def __init__(self, start_boost_round: int = 0):
    method schedule_kill (line 29) | def schedule_kill(self, rank: int, boost_round: int):
    method delay_return (line 33) | def delay_return(self, rank: int, start_boost_round: int, end_boost_ro...
    method inc_boost_round (line 37) | def inc_boost_round(self, rank: int):
    method log_iteration (line 42) | def log_iteration(self, rank: int, boost_round: int):
    method should_die (line 46) | def should_die(self, rank: int):
    method should_sleep (line 56) | def should_sleep(self, rank: int):
    method get_logs (line 64) | def get_logs(self):
  class DelayedLoadingCallback (line 68) | class DelayedLoadingCallback(DistributedCallback):
    method __init__ (line 71) | def __init__(self, ft_manager: ActorHandle, reload_data=True, sleep_ti...
    method after_data_loading (line 76) | def after_data_loading(self, actor, data, *args, **kwargs):
  class DieCallback (line 83) | class DieCallback(TrainingCallback):
    method __init__ (line 89) | def __init__(self, ft_manager: ActorHandle, training_delay: float = 0):
    method before_iteration (line 94) | def before_iteration(self, model, epoch, evals_log):
    method after_iteration (line 103) | def after_iteration(self, model, epoch, evals_log):

FILE: xgboost_ray/tests/release/benchmark_cpu_gpu.py
  function train_ray (line 22) | def train_ray(

FILE: xgboost_ray/tests/release/benchmark_ft.py
  function train_ray (line 32) | def train_ray(
  function ft_setup (line 160) | def ft_setup(
  function run_experiments (line 191) | def run_experiments(config, files, aws):

FILE: xgboost_ray/tests/release/custom_objective_metric.py
  class XGBoostDistributedAPITest (line 6) | class XGBoostDistributedAPITest(XGBoostAPITest):
    method _init_ray (line 7) | def _init_ray(self):

FILE: xgboost_ray/tests/release/tune_placement.py
  class PlacementCallback (line 47) | class PlacementCallback(TrainingCallback):
    method before_training (line 50) | def before_training(self, model):
    method after_iteration (line 55) | def after_iteration(self, model, epoch, evals_log):
  function tune_test (line 62) | def tune_test(

FILE: xgboost_ray/tests/test_client.py
  function start_client_server_4_cpus (line 11) | def start_client_server_4_cpus():
  function start_client_server_5_cpus (line 18) | def start_client_server_5_cpus():
  function start_client_server_5_cpus_modin (line 25) | def start_client_server_5_cpus_modin(monkeypatch):
  function test_simple_train (line 32) | def test_simple_train(start_client_server_4_cpus):
  function test_simple_tune (line 40) | def test_simple_tune(start_client_server_4_cpus):
  function test_simple_dask (line 47) | def test_simple_dask(start_client_server_5_cpus):
  function test_simple_modin (line 54) | def test_simple_modin(start_client_server_5_cpus_modin):
  function test_client_actor_cpus (line 61) | def test_client_actor_cpus(start_client_server_5_cpus):
  function test_simple_ray_dataset (line 88) | def test_simple_ray_dataset(start_client_server_5_cpus):

FILE: xgboost_ray/tests/test_colocation.py
  class _MockQueueActor (line 17) | class _MockQueueActor(_QueueActor):
    method get_node_id (line 18) | def get_node_id(self):
  class _MockEventActor (line 22) | class _MockEventActor(_EventActor):
    method get_node_id (line 23) | def get_node_id(self):
  class TestColocation (line 28) | class TestColocation(unittest.TestCase):
    method setUp (line 29) | def setUp(self) -> None:
    method tearDown (line 59) | def tearDown(self) -> None:
    method test_communication_colocation (line 66) | def test_communication_colocation(self):
    method test_no_tune_spread (line 104) | def test_no_tune_spread(self):
    method test_tune_pack (line 139) | def test_tune_pack(self):
    method test_timeout (line 191) | def test_timeout(self):

FILE: xgboost_ray/tests/test_data_source.py
  class _DistributedDataSourceTest (line 16) | class _DistributedDataSourceTest:
    method setUp (line 17) | def setUp(self):
    method tearDown (line 24) | def tearDown(self) -> None:
    method _init_ray (line 28) | def _init_ray(self):
    method _testAssignPartitions (line 32) | def _testAssignPartitions(self, part_nodes, actor_nodes, expected_acto...
    method _testDataSourceAssignment (line 35) | def _testDataSourceAssignment(self, part_nodes, actor_nodes, expected_...
    method testAssignEvenTrivial (line 38) | def testAssignEvenTrivial(self):
    method testAssignEvenRedistributeOne (line 56) | def testAssignEvenRedistributeOne(self):
    method testAssignEvenRedistributeMost (line 75) | def testAssignEvenRedistributeMost(self):
    method testAssignUnevenTrivial (line 96) | def testAssignUnevenTrivial(self):
    method testAssignUnevenRedistribute (line 112) | def testAssignUnevenRedistribute(self):
    method testAssignUnevenRedistributeColocated (line 129) | def testAssignUnevenRedistributeColocated(self):
    method testAssignUnevenRedistributeAll (line 146) | def testAssignUnevenRedistributeAll(self):
  class ModinDataSourceTest (line 167) | class ModinDataSourceTest(_DistributedDataSourceTest, unittest.TestCase):
    method _testAssignPartitions (line 170) | def _testAssignPartitions(self, part_nodes, actor_nodes, expected_acto...
    method _getActorToParts (line 190) | def _getActorToParts(self, actors_to_node, node_to_part):
    method _testDataSourceAssignment (line 209) | def _testDataSourceAssignment(self, part_nodes, actor_nodes, expected_...
  class DaskDataSourceTest (line 295) | class DaskDataSourceTest(_DistributedDataSourceTest, unittest.TestCase):
    method _testAssignPartitions (line 298) | def _testAssignPartitions(self, part_nodes, actor_nodes, expected_acto...
    method _getActorToParts (line 320) | def _getActorToParts(self, actors_to_node, node_to_part):
    method _testDataSourceAssignment (line 344) | def _testDataSourceAssignment(self, part_nodes, actor_nodes, expected_...
  class PartitionedSourceTest (line 439) | class PartitionedSourceTest(_DistributedDataSourceTest, unittest.TestCase):
    method _testAssignPartitions (line 440) | def _testAssignPartitions(self, part_nodes, actor_nodes, expected_acto...
    method _mk_partitioned (line 463) | def _mk_partitioned(self, part_to_node, nr, nc, shapes):
    method _getActorToParts (line 490) | def _getActorToParts(self, actors_to_node, partitions, part_to_node, p...
    method _testDataSourceAssignment (line 508) | def _testDataSourceAssignment(self, part_nodes, actor_nodes, expected_...

FILE: xgboost_ray/tests/test_end_to_end.py
  function _make_callback (line 19) | def _make_callback(tmpdir: str) -> DistributedCallback:
  class XGBoostRayEndToEndTest (line 56) | class XGBoostRayEndToEndTest(unittest.TestCase):
    method setUp (line 71) | def setUp(self):
    method tearDown (line 92) | def tearDown(self):
    method testSingleTraining (line 96) | def testSingleTraining(self):
    method testHalfTraining (line 105) | def testHalfTraining(self):
    method test_client_actor_cpus (line 141) | def test_client_actor_cpus(self):
    method _testJointTraining (line 162) | def _testJointTraining(self, sharding=RayShardingMode.INTERLEAVED, sof...
    method testJointTrainingInterleaved (line 203) | def testJointTrainingInterleaved(self):
    method testJointTrainingBatch (line 208) | def testJointTrainingBatch(self):
    method testTrainPredict (line 213) | def testTrainPredict(
    method testTrainPredictSoftprob (line 256) | def testTrainPredictSoftprob(self):
    method testTrainPredictRemote (line 262) | def testTrainPredictRemote(self):
    method testTrainPredictClient (line 266) | def testTrainPredictClient(self):
    method testDistributedCallbacksTrainPredict (line 279) | def testDistributedCallbacksTrainPredict(self, init=True, remote=False):
    method testDistributedCallbacksTrainPredictClient (line 307) | def testDistributedCallbacksTrainPredictClient(self):
    method testFailPrintErrors (line 321) | def testFailPrintErrors(self):
    method testKwargsValidation (line 355) | def testKwargsValidation(self):
    method testRanking (line 374) | def testRanking(self):
    method testFeatureWeightsParam (line 429) | def testFeatureWeightsParam(self):

FILE: xgboost_ray/tests/test_fault_tolerance.py
  class _FakeTask (line 30) | class _FakeTask(MagicMock):
    method is_ready (line 33) | def is_ready(self):
  class XGBoostRayFaultToleranceTest (line 37) | class XGBoostRayFaultToleranceTest(unittest.TestCase):
    method setUp (line 43) | def setUp(self):
    method tearDown (line 79) | def tearDown(self) -> None:
    method testTrainingContinuationKilled (line 90) | def testTrainingContinuationKilled(self):
    method testTrainingContinuationElasticKilled (line 125) | def testTrainingContinuationElasticKilled(self):
    method testTrainingContinuationElasticKilledRestarted (line 169) | def testTrainingContinuationElasticKilledRestarted(self):
    method testTrainingContinuationElasticMultiKilled (line 223) | def testTrainingContinuationElasticMultiKilled(self):
    method testTrainingContinuationElasticFailed (line 255) | def testTrainingContinuationElasticFailed(self):
    method testTrainingStop (line 297) | def testTrainingStop(self):
    method testTrainingStopElastic (line 309) | def testTrainingStopElastic(self):
    method testCheckpointContinuationValidity (line 340) | def testCheckpointContinuationValidity(self):
    method testSameResultWithAndWithoutError (line 401) | def testSameResultWithAndWithoutError(self):
    method testMaybeScheduleNewActors (line 454) | def testMaybeScheduleNewActors(self):
    method testFaultToleranceManager (line 587) | def testFaultToleranceManager(self):

FILE: xgboost_ray/tests/test_matrix.py
  class XGBoostRayDMatrixTest (line 21) | class XGBoostRayDMatrixTest(unittest.TestCase):
    method setUp (line 24) | def setUp(self):
    method setUpClass (line 47) | def setUpClass(cls):
    method tearDownClass (line 51) | def tearDownClass(cls):
    method testSameObject (line 54) | def testSameObject(self):
    method testColumnOrdering (line 64) | def testColumnOrdering(self):
    method _testMatrixCreation (line 74) | def _testMatrixCreation(self, in_x, in_y, multi_label=False, **kwargs):
    method testFromNumpy (line 115) | def testFromNumpy(self):
    method testFromPandasDfDf (line 120) | def testFromPandasDfDf(self):
    method testFromPandasDfSeries (line 125) | def testFromPandasDfSeries(self):
    method testFromPandasDfString (line 130) | def testFromPandasDfString(self):
    method testFromModinDfDf (line 135) | def testFromModinDfDf(self):
    method testFromModinDfSeries (line 148) | def testFromModinDfSeries(self):
    method testFromModinDfString (line 161) | def testFromModinDfString(self):
    method testFromDaskDfSeries (line 175) | def testFromDaskDfSeries(self):
    method testFromDaskDfArray (line 189) | def testFromDaskDfArray(self):
    method testFromDaskDfString (line 204) | def testFromDaskDfString(self):
    method testFromPetastormParquetString (line 219) | def testFromPetastormParquetString(self):
    method testFromPetastormMultiParquetString (line 236) | def testFromPetastormMultiParquetString(self):
    method testFromCSVString (line 261) | def testFromCSVString(self):
    method testFromMultiCSVString (line 273) | def testFromMultiCSVString(self):
    method testFromParquetStringMultiLabel (line 294) | def testFromParquetStringMultiLabel(self):
    method testFromParquetString (line 310) | def testFromParquetString(self):
    method testFromMultiParquetStringMultiLabel (line 321) | def testFromMultiParquetStringMultiLabel(self):
    method testFromMultiParquetString (line 343) | def testFromMultiParquetString(self):
    method testDetectDistributed (line 364) | def testDetectDistributed(self):
    method testTooManyActorsDistributed (line 393) | def testTooManyActorsDistributed(self):
    method testTooManyActorsCentral (line 399) | def testTooManyActorsCentral(self):
    method testBatchShardingAllActorsGetIndices (line 406) | def testBatchShardingAllActorsGetIndices(self):
    method testLegacyParams (line 411) | def testLegacyParams(self):
    method testFeatureWeightsParam (line 440) | def testFeatureWeightsParam(self):
    method testQidSortedBehaviorXGBoost (line 451) | def testQidSortedBehaviorXGBoost(self):
    method testQidSortedParquet (line 473) | def testQidSortedParquet(self):

FILE: xgboost_ray/tests/test_sklearn.py
  function softmax (line 53) | def softmax(x):
  function softprob_obj (line 58) | def softprob_obj(classes):
  function get_basescore (line 81) | def get_basescore(model: xgb.XGBModel) -> float:
  class TemporaryDirectory (line 91) | class TemporaryDirectory(object):
    method __enter__ (line 94) | def __enter__(self):
    method __exit__ (line 98) | def __exit__(self, exc_type, exc_value, traceback):
  class XGBoostRaySklearnTest (line 102) | class XGBoostRaySklearnTest(unittest.TestCase):
    method setUp (line 103) | def setUp(self):
    method tearDown (line 107) | def tearDown(self) -> None:
    method _init_ray (line 111) | def _init_ray(self):
    method run_binary_classification (line 115) | def run_binary_classification(self, cls, ray_dmatrix_params=None):
    method test_binary_classification (line 143) | def test_binary_classification(self):
    method test_binary_classification_dmatrix_params (line 146) | def test_binary_classification_dmatrix_params(self):
    method test_binary_rf_classification (line 156) | def test_binary_rf_classification(self):
    method test_multiclass_classification (line 159) | def test_multiclass_classification(self):
    method test_stacking_regression (line 210) | def test_stacking_regression(self):
    method test_stacking_classification (line 231) | def test_stacking_classification(self):
    method test_select_feature (line 262) | def test_select_feature(self):
    method test_num_parallel_tree (line 277) | def test_num_parallel_tree(self):
    method test_california_housing_regression (line 315) | def test_california_housing_regression(self):
    method run_california_housing_rf_regression (line 343) | def run_california_housing_rf_regression(self, tree_method):
    method test_california_housing_rf_regression (line 358) | def test_california_housing_rf_regression(self):
    method test_parameter_tuning (line 363) | def test_parameter_tuning(self):
    method test_regression_with_custom_objective (line 383) | def test_regression_with_custom_objective(self):
    method test_classification_with_custom_objective (line 419) | def test_classification_with_custom_objective(self):
    method test_sklearn_api (line 472) | def test_sklearn_api(self):
    method test_sklearn_api_gblinear (line 493) | def test_sklearn_api_gblinear(self):
    method test_sklearn_random_state (line 518) | def test_sklearn_random_state(self):
    method test_sklearn_n_jobs (line 535) | def test_sklearn_n_jobs(self):
    method test_parameters_access (line 548) | def test_parameters_access(self):
    method test_kwargs_error (line 574) | def test_kwargs_error(self):
    method test_kwargs_grid_search (line 582) | def test_kwargs_grid_search(self):
    method test_sklearn_clone (line 603) | def test_sklearn_clone(self):
    method test_sklearn_get_default_params (line 616) | def test_sklearn_get_default_params(self):
    method test_validation_weights_xgbmodel (line 634) | def test_validation_weights_xgbmodel(self):
    method test_validation_weights_xgbclassifier (line 711) | def test_validation_weights_xgbclassifier(self):
    method save_load_model (line 768) | def save_load_model(self, model_path):
    method test_save_load_model (line 808) | def test_save_load_model(self):
    method test_XGBClassifier_resume (line 913) | def test_XGBClassifier_resume(self):
    method test_constraint_parameters (line 957) | def test_constraint_parameters(self):
    method test_pandas_input (line 1057) | def test_pandas_input(self):
    method run_boost_from_prediction (line 1155) | def run_boost_from_prediction(self, tree_method):
    method boost_from_prediction (line 1187) | def boost_from_prediction(self, tree_method):
    method test_boost_from_prediction_hist (line 1196) | def test_boost_from_prediction_hist(self):
    method test_boost_from_prediction_approx (line 1203) | def test_boost_from_prediction_approx(self):
    method test_boost_from_prediction_exact (line 1208) | def test_boost_from_prediction_exact(self):
    method test_estimator_type (line 1216) | def test_estimator_type(self):
    method test_ranking (line 1240) | def test_ranking(self):

FILE: xgboost_ray/tests/test_sklearn_matrix.py
  class XGBoostRaySklearnMatrixTest (line 17) | class XGBoostRaySklearnMatrixTest(unittest.TestCase):
    method setUp (line 18) | def setUp(self):
    method tearDown (line 23) | def tearDown(self) -> None:
    method _init_ray (line 27) | def _init_ray(self):
    method testClassifierNoLabelEncoder (line 34) | def testClassifierNoLabelEncoder(self, n_class=2):
    method testClassifierMulticlassNoLabelEncoder (line 73) | def testClassifierMulticlassNoLabelEncoder(self):
    method testRegressor (line 76) | def testRegressor(self):

FILE: xgboost_ray/tests/test_tune.py
  class XGBoostRayTuneTest (line 19) | class XGBoostRayTuneTest(unittest.TestCase):
    method setUp (line 20) | def setUp(self):
    method tearDown (line 71) | def tearDown(self):
    method testNumIters (line 77) | def testNumIters(self):
    method testNumItersClient (line 104) | def testNumItersClient(self):
    method testPlacementOptions (line 116) | def testPlacementOptions(self):
    method testElasticFails (line 127) | def testElasticFails(self):
    method testReplaceTuneCheckpoints (line 138) | def testReplaceTuneCheckpoints(self):
    method testEndToEndCheckpointing (line 153) | def testEndToEndCheckpointing(self):
    method testEndToEndCheckpointingOrigTune (line 170) | def testEndToEndCheckpointingOrigTune(self):

FILE: xgboost_ray/tests/test_xgboost_api.py
  function gradient (line 16) | def gradient(predt: np.ndarray, dtrain: xgb.DMatrix) -> np.ndarray:
  function hessian (line 21) | def hessian(predt: np.ndarray, dtrain: xgb.DMatrix) -> np.ndarray:
  function squared_log (line 26) | def squared_log(
  function rmsle (line 35) | def rmsle(predt: np.ndarray, dtrain: xgb.DMatrix) -> Tuple[str, float]:
  class XGBoostAPITest (line 42) | class XGBoostAPITest(unittest.TestCase):
    method setUp (line 45) | def setUp(self):
    method tearDown (line 69) | def tearDown(self) -> None:
    method _init_ray (line 73) | def _init_ray(self):
    method testCustomObjectiveFunction (line 77) | def testCustomObjectiveFunction(self):
    method testCustomMetricFunction (line 104) | def testCustomMetricFunction(self):
    method testCallbacks (line 154) | def testCallbacks(self):

FILE: xgboost_ray/tests/utils.py
  function get_num_trees (line 15) | def get_num_trees(bst: xgb.Booster):
  function create_data (line 22) | def create_data(num_rows: int, num_cols: int, dtype: np.dtype = np.float...
  function create_labels (line 31) | def create_labels(
  function create_parquet (line 47) | def create_parquet(
  function create_parquet_in_tempdir (line 74) | def create_parquet_in_tempdir(
  function flatten_obj (line 93) | def flatten_obj(obj: Union[List, Dict], keys=None, base=None):
  function tree_obj (line 107) | def tree_obj(bst: xgb.Booster):
  function _kill_callback (line 111) | def _kill_callback(die_lock_file: str, actor_rank: int = 0, fail_iterati...
  function _fail_callback (line 145) | def _fail_callback(die_lock_file: str, actor_rank: int = 0, fail_iterati...
  function _checkpoint_callback (line 179) | def _checkpoint_callback(frequency: int = 1, before_iteration_=False):
  function _sleep_callback (line 205) | def _sleep_callback(sleep_iteration: int = 6, sleep_seconds: int = 5):

FILE: xgboost_ray/tune.py
  class TuneReportCheckpointCallback (line 26) | class TuneReportCheckpointCallback(OrigTuneReportCheckpointCallback):
    method after_iteration (line 27) | def after_iteration(self, model, epoch: int, evals_log: Dict):
    method after_training (line 41) | def after_training(self, model):
  class TuneReportCallback (line 51) | class TuneReportCallback(OrigTuneReportCallback):
    method __new__ (line 52) | def __new__(cls: type, *args, **kwargs):
  function _try_add_tune_callback (line 60) | def _try_add_tune_callback(kwargs: Dict):
  function _get_tune_resources (line 107) | def _get_tune_resources(
  function load_model (line 130) | def load_model(model_path):

FILE: xgboost_ray/util.py
  class Unavailable (line 9) | class Unavailable:
    method __init__ (line 12) | def __init__(self):
  class _EventActor (line 16) | class _EventActor:
    method __init__ (line 17) | def __init__(self):
    method set (line 20) | def set(self):
    method clear (line 23) | def clear(self):
    method is_set (line 26) | def is_set(self):
  class Event (line 31) | class Event:
    method __init__ (line 32) | def __init__(self, actor_options: Optional[Dict] = None):
    method set (line 36) | def set(self):
    method clear (line 39) | def clear(self):
    method is_set (line 42) | def is_set(self):
    method shutdown (line 45) | def shutdown(self):
  class MultiActorTask (line 52) | class MultiActorTask:
    method __init__ (line 62) | def __init__(self, pending_futures: Optional[List[ray.ObjectRef]] = No...
    method is_ready (line 66) | def is_ready(self):
  function get_current_node_resource_key (line 82) | def get_current_node_resource_key() -> str:
  function force_on_current_node (line 100) | def force_on_current_node(task_or_actor):

Download .json

Condensed preview — 85 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (544K chars).

[
  {
    "path": ".flake8",
    "chars": 519,
    "preview": "[flake8]\nmax-line-length = 88\ninline-quotes = \"\nignore =\n  C408\n  C417\n  E121\n  E123\n  E126\n  E203\n  E226\n  E24\n  E704\n "
  },
  {
    "path": ".github/workflows/gpu.yaml",
    "chars": 916,
    "preview": "name: GPU on manual trigger\n\non:\n  workflow_dispatch\n\njobs:\n  test_gpu:\n    runs-on: ubuntu-latest\n    timeout-minutes: "
  },
  {
    "path": ".github/workflows/test.yaml",
    "chars": 8546,
    "preview": "name: pytest on push\n\non:\n  push:\n  pull_request:\n  schedule:\n    - cron: \"0 5 * * *\"\n\njobs:\n  test_lint:\n    runs-on: u"
  },
  {
    "path": ".gitignore",
    "chars": 1005,
    "preview": "# Python byte code files\n*.pyc\npython/.eggs\n\n# Backup files\n*.bak\n\n# Emacs temporary files\n*~\n*#\n\n# Debug symbols\n*.pdb\n"
  },
  {
    "path": "LICENSE",
    "chars": 15673,
    "preview": "                                 Apache License\n                           Version 2.0, January 2004\n                   "
  },
  {
    "path": "README.md",
    "chars": 21798,
    "preview": "<!--$UNCOMMENT(xgboost-ray)=-->\n\n# Distributed XGBoost on Ray\n<!--$REMOVE-->\n![Build Status](https://github.com/ray-proj"
  },
  {
    "path": "format.sh",
    "chars": 11432,
    "preview": "#!/usr/bin/env bash\n# Black + Clang formatter (if installed). This script formats all changed files from the last mergeb"
  },
  {
    "path": "requirements/lint-requirements.txt",
    "chars": 144,
    "preview": "flake8==3.9.1\nflake8-comprehensions==3.10.1\nflake8-quotes==2.0.0\nflake8-bugbear==21.9.2\nblack==22.10.0\nisort==5.10.1\nimp"
  },
  {
    "path": "requirements/test-requirements.txt",
    "chars": 334,
    "preview": "packaging\npetastorm\npytest\npyarrow<15.0.0\nray[tune, data, default]\nscikit-learn\n# modin==0.23.1.post0 is not compatible "
  },
  {
    "path": "run_ci_examples.sh",
    "chars": 1265,
    "preview": "#!/bin/bash\nset -e\n\nTUNE=1\n\nfor i in \"$@\"\ndo\necho \"$i\"\ncase \"$i\" in\n    --no-tune)\n    TUNE=0\n    ;;\n    *)\n    echo \"un"
  },
  {
    "path": "run_ci_tests.sh",
    "chars": 1408,
    "preview": "#!/bin/bash\nTUNE=1\n\nfor i in \"$@\"\ndo\necho \"$i\"\ncase \"$i\" in\n    --no-tune)\n    TUNE=0\n    ;;\n    *)\n    echo \"unknown ar"
  },
  {
    "path": "setup.py",
    "chars": 603,
    "preview": "from setuptools import find_packages, setup\n\nsetup(\n    name=\"xgboost_ray\",\n    packages=find_packages(where=\".\", includ"
  },
  {
    "path": "xgboost_ray/__init__.py",
    "chars": 793,
    "preview": "from xgboost_ray.main import RayParams, predict, train\nfrom xgboost_ray.matrix import (\n    Data,\n    RayDeviceQuantileD"
  },
  {
    "path": "xgboost_ray/callback.py",
    "chars": 3581,
    "preview": "import os\nfrom abc import ABC\nfrom typing import TYPE_CHECKING, Any, Dict, Sequence, Union\n\nimport pandas as pd\nfrom ray"
  },
  {
    "path": "xgboost_ray/compat/__init__.py",
    "chars": 1613,
    "preview": "from typing import TYPE_CHECKING\n\nif TYPE_CHECKING:\n    from xgboost_ray.xgb import xgboost as xgb\n\ntry:\n    from xgboos"
  },
  {
    "path": "xgboost_ray/compat/tracker.py",
    "chars": 12320,
    "preview": "# flake8: noqa\n\n# Copyright 2021 by XGBoost Contributors\n#\n# Licensed under the Apache License, Version 2.0 (the \"Licens"
  },
  {
    "path": "xgboost_ray/data_sources/__init__.py",
    "chars": 961,
    "preview": "from xgboost_ray.data_sources.csv import CSV\nfrom xgboost_ray.data_sources.dask import Dask\nfrom xgboost_ray.data_source"
  },
  {
    "path": "xgboost_ray/data_sources/_distributed.py",
    "chars": 4733,
    "preview": "import itertools\nimport math\nfrom collections import defaultdict\nfrom typing import Any, Dict, Sequence\n\nimport ray\nfrom"
  },
  {
    "path": "xgboost_ray/data_sources/csv.py",
    "chars": 1479,
    "preview": "from typing import Any, Iterable, Optional, Sequence, Union\n\nimport pandas as pd\n\nfrom xgboost_ray.data_sources.data_sou"
  },
  {
    "path": "xgboost_ray/data_sources/dask.py",
    "chars": 5518,
    "preview": "from collections import defaultdict\nfrom typing import Any, Dict, List, Optional, Sequence, Tuple, Union\n\nimport pandas "
  },
  {
    "path": "xgboost_ray/data_sources/data_source.py",
    "chars": 4741,
    "preview": "from enum import Enum\nfrom typing import TYPE_CHECKING, Any, Dict, List, Optional, Sequence, Tuple, Union\n\nimport pandas"
  },
  {
    "path": "xgboost_ray/data_sources/modin.py",
    "chars": 4936,
    "preview": "from collections import defaultdict\nfrom typing import Any, Dict, Optional, Sequence, Tuple, Union\n\nimport pandas as pd\n"
  },
  {
    "path": "xgboost_ray/data_sources/numpy.py",
    "chars": 1058,
    "preview": "from typing import TYPE_CHECKING, Any, List, Optional, Sequence\n\nimport numpy as np\nimport pandas as pd\n\nfrom xgboost_ra"
  },
  {
    "path": "xgboost_ray/data_sources/object_store.py",
    "chars": 1237,
    "preview": "from typing import Any, Optional, Sequence\n\nimport pandas as pd\nimport ray\nfrom ray import ObjectRef\n\nfrom xgboost_ray.d"
  },
  {
    "path": "xgboost_ray/data_sources/pandas.py",
    "chars": 770,
    "preview": "from typing import Any, Optional, Sequence\n\nimport pandas as pd\n\nfrom xgboost_ray.data_sources.data_source import DataSo"
  },
  {
    "path": "xgboost_ray/data_sources/parquet.py",
    "chars": 1497,
    "preview": "from typing import Any, Iterable, Optional, Sequence, Union\n\nimport pandas as pd\n\nfrom xgboost_ray.data_sources.data_sou"
  },
  {
    "path": "xgboost_ray/data_sources/partitioned.py",
    "chars": 3664,
    "preview": "from collections import defaultdict\nfrom typing import Any, Dict, Optional, Sequence, Tuple\n\nimport numpy as np\nimport p"
  },
  {
    "path": "xgboost_ray/data_sources/petastorm.py",
    "chars": 2641,
    "preview": "from typing import Any, List, Optional, Sequence, Union\n\nimport pandas as pd\n\nfrom xgboost_ray.data_sources.data_source "
  },
  {
    "path": "xgboost_ray/data_sources/ray_dataset.py",
    "chars": 3549,
    "preview": "from typing import Any, Dict, Optional, Sequence, Tuple, Union\n\nimport pandas as pd\nimport ray\nfrom ray.actor import Act"
  },
  {
    "path": "xgboost_ray/elastic.py",
    "chars": 5915,
    "preview": "import time\nfrom typing import Callable, Dict, List, Optional, Tuple\n\nimport ray\n\nfrom xgboost_ray.main import (\n    ENV"
  },
  {
    "path": "xgboost_ray/examples/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "xgboost_ray/examples/create_test_data.py",
    "chars": 261,
    "preview": "from xgboost_ray.tests.utils import create_parquet\n\n\ndef main():\n    create_parquet(\n        \"example.parquet\",\n        "
  },
  {
    "path": "xgboost_ray/examples/higgs.py",
    "chars": 2120,
    "preview": "import os\nimport time\n\nfrom xgboost_ray import RayDMatrix, RayParams, train\n\nFILENAME_CSV = \"HIGGS.csv.gz\"\n\n\ndef downloa"
  },
  {
    "path": "xgboost_ray/examples/higgs_parquet.py",
    "chars": 3672,
    "preview": "import os\nimport time\n\nimport pandas as pd\nimport pyarrow as pa\nimport pyarrow.parquet as pq\nfrom higgs import download_"
  },
  {
    "path": "xgboost_ray/examples/readme.py",
    "chars": 2801,
    "preview": "# flake8: noqa E501\n\n\ndef readme_simple():\n    from sklearn.datasets import load_breast_cancer\n\n    from xgboost_ray imp"
  },
  {
    "path": "xgboost_ray/examples/readme_sklearn_api.py",
    "chars": 1281,
    "preview": "def readme_sklearn_api():\n    from sklearn.datasets import load_breast_cancer\n    from sklearn.model_selection import tr"
  },
  {
    "path": "xgboost_ray/examples/simple.py",
    "chars": 2285,
    "preview": "import argparse\n\nimport ray\nfrom sklearn import datasets\nfrom sklearn.model_selection import train_test_split\n\nfrom xgbo"
  },
  {
    "path": "xgboost_ray/examples/simple_dask.py",
    "chars": 2803,
    "preview": "import argparse\n\nimport numpy as np\nimport pandas as pd\nimport ray\n\nfrom xgboost_ray import RayDMatrix, RayParams, train"
  },
  {
    "path": "xgboost_ray/examples/simple_modin.py",
    "chars": 2898,
    "preview": "import argparse\n\nimport numpy as np\nimport pandas as pd\nimport ray\n\nfrom xgboost_ray import RayDMatrix, RayParams, train"
  },
  {
    "path": "xgboost_ray/examples/simple_objectstore.py",
    "chars": 2454,
    "preview": "import argparse\n\nimport numpy as np\nimport pandas as pd\nimport ray\n\nfrom xgboost_ray import RayDMatrix, RayParams, train"
  },
  {
    "path": "xgboost_ray/examples/simple_partitioned.py",
    "chars": 3919,
    "preview": "import argparse\n\nimport numpy as np\nimport ray\nfrom sklearn import datasets\nfrom sklearn.model_selection import train_te"
  },
  {
    "path": "xgboost_ray/examples/simple_predict.py",
    "chars": 815,
    "preview": "import os\n\nimport numpy as np\nimport xgboost as xgb\nfrom sklearn import datasets\n\nfrom xgboost_ray import RayDMatrix, Ra"
  },
  {
    "path": "xgboost_ray/examples/simple_ray_dataset.py",
    "chars": 2764,
    "preview": "import argparse\n\nimport numpy as np\nimport pandas as pd\nimport ray\nfrom xgboost import DMatrix\n\nfrom xgboost_ray import "
  },
  {
    "path": "xgboost_ray/examples/simple_tune.py",
    "chars": 3353,
    "preview": "import argparse\nimport os\n\nimport ray\nfrom ray import tune\nfrom sklearn import datasets\nfrom sklearn.model_selection imp"
  },
  {
    "path": "xgboost_ray/examples/train_on_test_data.py",
    "chars": 1842,
    "preview": "import argparse\nimport os\nimport shutil\nimport time\n\nfrom xgboost_ray import RayDMatrix, RayParams, train\nfrom xgboost_r"
  },
  {
    "path": "xgboost_ray/examples/train_with_ml_dataset.py",
    "chars": 1931,
    "preview": "import argparse\nimport os\nimport shutil\nimport time\n\nfrom ray.util.data import read_parquet\n\nfrom xgboost_ray import Ray"
  },
  {
    "path": "xgboost_ray/main.py",
    "chars": 66933,
    "preview": "import functools\nimport inspect\nimport multiprocessing\nimport os\nimport pickle\nimport platform\nimport threading\nimport t"
  },
  {
    "path": "xgboost_ray/matrix.py",
    "chars": 42222,
    "preview": "import glob\nimport uuid\nfrom enum import Enum\nfrom typing import (\n    TYPE_CHECKING,\n    Callable,\n    Dict,\n    Iterab"
  },
  {
    "path": "xgboost_ray/session.py",
    "chars": 2109,
    "preview": "from typing import Optional\n\nfrom ray.util.annotations import DeveloperAPI, PublicAPI\nfrom ray.util.queue import Queue\n\n"
  },
  {
    "path": "xgboost_ray/sklearn.py",
    "chars": 35480,
    "preview": "\"\"\"scikit-learn wrapper for xgboost-ray. Based on xgboost 1.4.0\nsklearn wrapper, with some provisions made for 1.5.0 and"
  },
  {
    "path": "xgboost_ray/tests/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "xgboost_ray/tests/conftest.py",
    "chars": 2204,
    "preview": "from contextlib import contextmanager\r\nfrom functools import partial\r\n\r\nimport pytest\r\nimport ray\r\n\r\ntry:\r\n    # Ray 1.3"
  },
  {
    "path": "xgboost_ray/tests/env_info.sh",
    "chars": 414,
    "preview": "#!/bin/bash\n# shellcheck disable=SC2005\n\necho \"Test environment information\"\necho \"----------------------------\"\necho \"P"
  },
  {
    "path": "xgboost_ray/tests/fault_tolerance.py",
    "chars": 4143,
    "preview": "import os\nimport time\nfrom collections import defaultdict\nfrom typing import Dict, Set, Tuple\n\nimport ray\nfrom ray.actor"
  },
  {
    "path": "xgboost_ray/tests/release/benchmark_cpu_gpu.py",
    "chars": 5416,
    "preview": "import argparse\nimport glob\nimport os\nimport shutil\nimport time\n\nimport ray\n\nfrom xgboost_ray import (\n    RayDeviceQuan"
  },
  {
    "path": "xgboost_ray/tests/release/benchmark_ft.py",
    "chars": 14740,
    "preview": "import argparse\nimport glob\nimport os\nfrom typing import Dict, List\n\nimport numpy as np\nimport ray\nfrom ray import tune\n"
  },
  {
    "path": "xgboost_ray/tests/release/cluster_cpu.yaml",
    "chars": 995,
    "preview": "cluster_name: xgboost_ray_release_tests_cpu_{{env[\"NUM_WORKERS\"] | default(0)}}\n\nmax_workers: {{env[\"NUM_WORKERS\"] | def"
  },
  {
    "path": "xgboost_ray/tests/release/cluster_ft.yaml",
    "chars": 963,
    "preview": "cluster_name: xgboost_ray_release_tests_ft_cluster\n\nmax_workers: 9\n\nupscaling_speed: 32\n\nidle_timeout_minutes: 15\n\ndocke"
  },
  {
    "path": "xgboost_ray/tests/release/cluster_gpu.yaml",
    "chars": 1126,
    "preview": "cluster_name: xgboost_ray_release_tests_gpu_{{env[\"NUM_WORKERS\"] | default(0)}}\n\nmax_workers: {{env[\"NUM_WORKERS\"] | def"
  },
  {
    "path": "xgboost_ray/tests/release/create_learnable_data.py",
    "chars": 3079,
    "preview": "import argparse\nimport os\n\nimport numpy as np\nimport pandas as pd\nfrom sklearn.datasets import make_classification, make"
  },
  {
    "path": "xgboost_ray/tests/release/create_test_data.py",
    "chars": 1351,
    "preview": "import argparse\nimport os\n\nimport numpy as np\n\nfrom xgboost_ray.tests.utils import create_parquet\n\nif __name__ == \"__mai"
  },
  {
    "path": "xgboost_ray/tests/release/custom_objective_metric.py",
    "chars": 364,
    "preview": "import ray\n\nfrom xgboost_ray.tests.test_xgboost_api import XGBoostAPITest\n\n\nclass XGBoostDistributedAPITest(XGBoostAPITe"
  },
  {
    "path": "xgboost_ray/tests/release/run_e2e_gpu.sh",
    "chars": 491,
    "preview": "#!/bin/bash\n\nif [ ! -f \"./.anyscale.yaml\" ]; then\n  echo \"Anyscale project not initialized. Please run 'anyscale init'\"\n"
  },
  {
    "path": "xgboost_ray/tests/release/setup_xgboost.sh",
    "chars": 559,
    "preview": "#!/bin/bash\n\npip install pytest\n# Uninstall any existing xgboost_ray repositories\npip uninstall -y xgboost_ray || true\n\n"
  },
  {
    "path": "xgboost_ray/tests/release/start_cpu_cluster.sh",
    "chars": 604,
    "preview": "#!/bin/bash\n\nif [ ! -f \"./.anyscale.yaml\" ]; then\n  echo \"Anyscale project not initialized. Please run 'anyscale init'\"\n"
  },
  {
    "path": "xgboost_ray/tests/release/start_ft_cluster.sh",
    "chars": 508,
    "preview": "#!/bin/bash\n\nif [ ! -f \"./.anyscale.yaml\" ]; then\n  echo \"Anyscale project not initialized. Please run 'anyscale init'\"\n"
  },
  {
    "path": "xgboost_ray/tests/release/start_gpu_cluster.sh",
    "chars": 604,
    "preview": "#!/bin/bash\n\nif [ ! -f \"./.anyscale.yaml\" ]; then\n  echo \"Anyscale project not initialized. Please run 'anyscale init'\"\n"
  },
  {
    "path": "xgboost_ray/tests/release/submit_cpu_gpu_benchmark.sh",
    "chars": 449,
    "preview": "#!/bin/bash\n\nif [ ! -f \"./.anyscale.yaml\" ]; then\n  echo \"Anyscale project not initialized. Please run 'anyscale init'\"\n"
  },
  {
    "path": "xgboost_ray/tests/release/submit_ft_benchmark.sh",
    "chars": 444,
    "preview": "#!/bin/bash\n\nif [ ! -f \"./.anyscale.yaml\" ]; then\n  echo \"Anyscale project not initialized. Please run 'anyscale init'\"\n"
  },
  {
    "path": "xgboost_ray/tests/release/tune_cluster.yaml",
    "chars": 1710,
    "preview": "cluster_name: xgboost_ray_release_tests_tune\nmin_workers: 4\nmax_workers: 4\ninitial_workers: 4\nautoscaling_mode: default\n"
  },
  {
    "path": "xgboost_ray/tests/release/tune_placement.py",
    "chars": 7410,
    "preview": "\"\"\"\nNOTE: This example is currently broken (very outdated) and not run in CI.\n\nTest Ray Tune trial placement across clus"
  },
  {
    "path": "xgboost_ray/tests/test_client.py",
    "chars": 2802,
    "preview": "import os\n\nimport pytest\nimport ray\nfrom ray.util.client.ray_client_helpers import ray_start_client_server\n\nfrom xgboost"
  },
  {
    "path": "xgboost_ray/tests/test_colocation.py",
    "chars": 7735,
    "preview": "import os\nimport shutil\nimport tempfile\nimport unittest\nfrom unittest.mock import patch\n\nimport numpy as np\nimport pytes"
  },
  {
    "path": "xgboost_ray/tests/test_data_source.py",
    "chars": 20641,
    "preview": "import unittest\nfrom typing import List, Sequence\nfrom unittest.mock import patch\n\nimport numpy as np\nimport pandas as p"
  },
  {
    "path": "xgboost_ray/tests/test_end_to_end.py",
    "chars": 16292,
    "preview": "import os\nimport shutil\nimport tempfile\nimport unittest\n\nimport numpy as np\nimport ray\nimport xgboost as xgb\nfrom ray.ex"
  },
  {
    "path": "xgboost_ray/tests/test_fault_tolerance.py",
    "chars": 22157,
    "preview": "import logging\nimport os\nimport shutil\nimport tempfile\nimport time\nimport unittest\nfrom unittest.mock import DEFAULT, Ma"
  },
  {
    "path": "xgboost_ray/tests/test_matrix.py",
    "chars": 17247,
    "preview": "import inspect\nimport os\nimport tempfile\nimport unittest\n\nimport numpy as np\nimport pandas as pd\nimport ray\nimport xgboo"
  },
  {
    "path": "xgboost_ray/tests/test_sklearn.py",
    "chars": 45475,
    "preview": "\"\"\"Copied almost verbatim from https://github.com/dmlc/xgboost/blob/a5c852660b1056204aa2e0cbfcd5b4ecfbf31adf/tests/pytho"
  },
  {
    "path": "xgboost_ray/tests/test_sklearn_matrix.py",
    "chars": 3406,
    "preview": "import unittest\n\nimport numpy as np\nimport ray\nimport xgboost as xgb\nfrom packaging.version import Version\nfrom sklearn."
  },
  {
    "path": "xgboost_ray/tests/test_tune.py",
    "chars": 6552,
    "preview": "import os\nimport shutil\nimport tempfile\nimport unittest\nfrom unittest.mock import MagicMock, patch\n\nimport numpy as np\ni"
  },
  {
    "path": "xgboost_ray/tests/test_xgboost_api.py",
    "chars": 5578,
    "preview": "import unittest\nfrom typing import Tuple\n\nimport numpy as np\nimport ray\nimport xgboost as xgb\n\nfrom xgboost_ray import R"
  },
  {
    "path": "xgboost_ray/tests/utils.py",
    "chars": 6746,
    "preview": "import json\nimport os\nimport tempfile\nimport time\nfrom typing import Dict, List, Optional, Tuple, Union\n\nimport numpy as"
  },
  {
    "path": "xgboost_ray/tune.py",
    "chars": 5445,
    "preview": "import logging\nfrom typing import Dict, Optional\n\nimport ray\nfrom ray.util.annotations import PublicAPI\n\nfrom xgboost_ra"
  },
  {
    "path": "xgboost_ray/util.py",
    "chars": 3037,
    "preview": "import asyncio\nfrom typing import Dict, List, Optional\n\nimport ray\nfrom ray.util.annotations import DeveloperAPI\n\n\n@Deve"
  },
  {
    "path": "xgboost_ray/xgb.py",
    "chars": 179,
    "preview": "from typing import TYPE_CHECKING\n\nif TYPE_CHECKING:\n    import xgboost\nelse:\n    try:\n        import xgboost\n    except "
  }
]

About this extraction

This page contains the full source code of the ray-project/xgboost_ray GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 85 files (505.3 KB), approximately 124.4k tokens, and a symbol index with 564 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo