Full Code of NVIDIA/spark-rapids-examples for AI

main 162959461bf8 cached

277 files

3.5 MB

925.1k tokens

420 symbols

1 requests

Download .txt

Showing preview only (3,707K chars total). Download the full file or copy to clipboard to get everything.

Repository: NVIDIA/spark-rapids-examples
Branch: main
Commit: 162959461bf8
Files: 277
Total size: 3.5 MB

Directory structure:
gitextract_pa_r2orm/

├── .github/
│   ├── ISSUE_TEMPLATE/
│   │   └── bug_report.md
│   └── workflows/
│       ├── add-to-project.yml
│       ├── license-header-check.yml
│       ├── markdown-links-check/
│       │   └── markdown-links-check-config.json
│       ├── markdown-links-check.yml
│       ├── shell-check.yml
│       └── signoff-check.yml
├── .gitignore
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── dockerfile/
│   ├── Dockerfile
│   └── gpu_executor_template.yaml
├── docs/
│   ├── get-started/
│   │   └── xgboost-examples/
│   │       ├── building-sample-apps/
│   │       │   ├── python.md
│   │       │   └── scala.md
│   │       ├── csp/
│   │       │   ├── aws/
│   │       │   │   └── ec2.md
│   │       │   ├── databricks/
│   │       │   │   ├── databricks.md
│   │       │   │   └── init.sh
│   │       │   └── dataproc/
│   │       │       └── gcp.md
│   │       ├── dataset/
│   │       │   └── mortgage.md
│   │       ├── notebook/
│   │       │   ├── python-notebook.md
│   │       │   ├── spylon.md
│   │       │   └── toree.md
│   │       ├── on-prem-cluster/
│   │       │   ├── kubernetes-scala.md
│   │       │   ├── standalone-python.md
│   │       │   ├── standalone-scala.md
│   │       │   ├── yarn-python.md
│   │       │   └── yarn-scala.md
│   │       └── prepare-package-data/
│   │           ├── preparation-python.md
│   │           └── preparation-scala.md
│   └── trouble-shooting/
│       └── xgboost-examples-trouble-shooting.md
├── examples/
│   ├── MIG-Support/
│   │   ├── README.md
│   │   ├── device-plugins/
│   │   │   └── gpu-mig/
│   │   │       ├── README.md
│   │   │       ├── pom.xml
│   │   │       ├── scripts/
│   │   │       │   └── getMIGGPUs
│   │   │       └── src/
│   │   │           ├── main/
│   │   │           │   └── java/
│   │   │           │       └── com/
│   │   │           │           └── nvidia/
│   │   │           │               └── spark/
│   │   │           │                   └── NvidiaGPUMigPluginForRuntimeV2.java
│   │   │           └── test/
│   │   │               └── java/
│   │   │                   └── com/
│   │   │                       └── nvidia/
│   │   │                           └── spark/
│   │   │                               └── TestNvidiaGPUMigPluginForRuntimeV2.java
│   │   ├── resource-types/
│   │   │   └── gpu-mig/
│   │   │       ├── README.md
│   │   │       ├── yarn312MIG.patch
│   │   │       ├── yarn313to315MIG.patch
│   │   │       └── yarn321to323MIG.patch
│   │   └── yarn-unpatched/
│   │       ├── README.md
│   │       └── scripts/
│   │           ├── mig2gpu.sh
│   │           ├── nvidia-container-cli-wrapper.sh
│   │           └── nvidia-smi
│   ├── ML+DL-Examples/
│   │   ├── Optuna-Spark/
│   │   │   ├── README.md
│   │   │   └── optuna-examples/
│   │   │       ├── databricks/
│   │   │       │   ├── init_optuna.sh
│   │   │       │   └── start_cluster.sh
│   │   │       ├── optuna-dataframe.ipynb
│   │   │       └── optuna-joblibspark.ipynb
│   │   ├── Spark-DL/
│   │   │   └── dl_inference/
│   │   │       ├── README.md
│   │   │       ├── databricks/
│   │   │       │   ├── README.md
│   │   │       │   └── setup/
│   │   │       │       ├── init_spark_dl.sh
│   │   │       │       └── start_cluster.sh
│   │   │       ├── dataproc/
│   │   │       │   ├── README.md
│   │   │       │   └── setup/
│   │   │       │       ├── init_spark_dl.sh
│   │   │       │       └── start_cluster.sh
│   │   │       ├── huggingface/
│   │   │       │   ├── conditional_generation_tf.ipynb
│   │   │       │   ├── conditional_generation_torch.ipynb
│   │   │       │   ├── deepseek-r1_torch.ipynb
│   │   │       │   ├── gemma-7b_torch.ipynb
│   │   │       │   ├── pipelines_tf.ipynb
│   │   │       │   ├── pipelines_torch.ipynb
│   │   │       │   ├── qwen-2.5-7b_torch.ipynb
│   │   │       │   └── sentence_transformers_torch.ipynb
│   │   │       ├── pytorch/
│   │   │       │   ├── housing_regression_torch.ipynb
│   │   │       │   └── image_classification_torch.ipynb
│   │   │       ├── requirements.txt
│   │   │       ├── server_utils.py
│   │   │       ├── tensorflow/
│   │   │       │   ├── image_classification_tf.ipynb
│   │   │       │   ├── keras_preprocessing_tf.ipynb
│   │   │       │   ├── keras_resnet50_tf.ipynb
│   │   │       │   └── text_classification_tf.ipynb
│   │   │       ├── tf_requirements.txt
│   │   │       ├── torch_requirements.txt
│   │   │       ├── vllm/
│   │   │       │   ├── qwen-2.5-14b-tensor-parallel_vllm.ipynb
│   │   │       │   └── qwen-2.5-7b_vllm.ipynb
│   │   │       └── vllm_requirements.txt
│   │   └── Spark-Rapids-ML/
│   │       └── pca/
│   │           ├── README.md
│   │           ├── notebooks/
│   │           │   └── pca.ipynb
│   │           └── start-spark-rapids.sh
│   ├── SQL+DF-Examples/
│   │   ├── customer-churn/
│   │   │   ├── README.md
│   │   │   └── notebooks/
│   │   │       └── python/
│   │   │           ├── README.md
│   │   │           ├── augment.ipynb
│   │   │           ├── churn/
│   │   │           │   ├── augment.py
│   │   │           │   ├── eda.py
│   │   │           │   └── etl.py
│   │   │           └── etl.ipynb
│   │   ├── demo/
│   │   │   ├── Spark_get_json_object.ipynb
│   │   │   └── Spark_parquet_microkernels.ipynb
│   │   ├── micro-benchmarks/
│   │   │   ├── README.md
│   │   │   └── notebooks/
│   │   │       ├── micro-benchmarks-cpu.ipynb
│   │   │       └── micro-benchmarks-gpu.ipynb
│   │   ├── retail-analytics/
│   │   │   ├── README.md
│   │   │   └── notebooks/
│   │   │       └── python/
│   │   │           ├── retail-analytic.ipynb
│   │   │           └── retail-datagen.ipynb
│   │   └── tpcds/
│   │       ├── README.md
│   │       └── notebooks/
│   │           └── TPCDS-SF10.ipynb
│   ├── UDF-Examples/
│   │   └── RAPIDS-accelerated-UDFs/
│   │       ├── Dockerfile
│   │       ├── README.md
│   │       ├── clone-cudf-repo.sh
│   │       ├── conftest.py
│   │       ├── extract-cudf-libs.sh
│   │       ├── pom.xml
│   │       ├── pytest.ini
│   │       ├── run_pyspark_from_build.sh
│   │       ├── runtests.py
│   │       └── src/
│   │           └── main/
│   │               ├── cpp/
│   │               │   ├── CMakeLists.txt
│   │               │   ├── benchmarks/
│   │               │   │   ├── CMakeLists.txt
│   │               │   │   ├── cosine_similarity/
│   │               │   │   │   └── cosine_similarity_benchmark.cpp
│   │               │   │   ├── fixture/
│   │               │   │   │   └── benchmark_fixture.hpp
│   │               │   │   └── synchronization/
│   │               │   │       ├── synchronization.cpp
│   │               │   │       └── synchronization.hpp
│   │               │   └── src/
│   │               │       ├── CosineSimilarityJni.cpp
│   │               │       ├── StringWordCountJni.cpp
│   │               │       ├── cosine_similarity.cu
│   │               │       ├── cosine_similarity.hpp
│   │               │       ├── string_word_count.cu
│   │               │       └── string_word_count.hpp
│   │               ├── java/
│   │               │   └── com/
│   │               │       └── nvidia/
│   │               │           └── spark/
│   │               │               └── rapids/
│   │               │                   └── udf/
│   │               │                       ├── hive/
│   │               │                       │   ├── DecimalFraction.java
│   │               │                       │   ├── StringWordCount.java
│   │               │                       │   ├── URLDecode.java
│   │               │                       │   └── URLEncode.java
│   │               │                       └── java/
│   │               │                           ├── CosineSimilarity.java
│   │               │                           ├── DecimalFraction.java
│   │               │                           ├── NativeUDFExamplesLoader.java
│   │               │                           ├── URLDecode.java
│   │               │                           └── URLEncode.java
│   │               ├── python/
│   │               │   ├── asserts.py
│   │               │   ├── conftest.py
│   │               │   ├── data_gen.py
│   │               │   ├── rapids_udf_test.py
│   │               │   ├── spark_init_internal.py
│   │               │   └── spark_session.py
│   │               └── scala/
│   │                   └── com/
│   │                       └── nvidia/
│   │                           └── spark/
│   │                               └── rapids/
│   │                                   └── udf/
│   │                                       └── scala/
│   │                                           ├── URLDecode.scala
│   │                                           └── URLEncode.scala
│   ├── XGBoost-Examples/
│   │   ├── .gitignore
│   │   ├── README.md
│   │   ├── agaricus/
│   │   │   ├── .gitignore
│   │   │   ├── notebooks/
│   │   │   │   ├── python/
│   │   │   │   │   └── agaricus-gpu.ipynb
│   │   │   │   └── scala/
│   │   │   │       └── agaricus-gpu.ipynb
│   │   │   ├── pom.xml
│   │   │   ├── python/
│   │   │   │   └── com/
│   │   │   │       ├── __init__.py
│   │   │   │       └── nvidia/
│   │   │   │           ├── __init__.py
│   │   │   │           └── spark/
│   │   │   │               ├── __init__.py
│   │   │   │               └── examples/
│   │   │   │                   ├── __init__.py
│   │   │   │                   └── agaricus/
│   │   │   │                       ├── __init__.py
│   │   │   │                       └── main.py
│   │   │   └── scala/
│   │   │       └── src/
│   │   │           └── com/
│   │   │               └── nvidia/
│   │   │                   └── spark/
│   │   │                       └── examples/
│   │   │                           └── agaricus/
│   │   │                               └── Main.scala
│   │   ├── aggregator/
│   │   │   └── .gitignore
│   │   ├── app-parameters/
│   │   │   ├── supported_xgboost_parameters_python.md
│   │   │   └── supported_xgboost_parameters_scala.md
│   │   ├── assembly/
│   │   │   └── assembly-no-scala.xml
│   │   ├── main.py
│   │   ├── mortgage/
│   │   │   ├── .gitignore
│   │   │   ├── notebooks/
│   │   │   │   ├── python/
│   │   │   │   │   ├── MortgageETL+XGBoost.ipynb
│   │   │   │   │   ├── MortgageETL.ipynb
│   │   │   │   │   ├── cv-mortgage-gpu.ipynb
│   │   │   │   │   └── mortgage-gpu.ipynb
│   │   │   │   └── scala/
│   │   │   │       ├── mortgage-ETL.ipynb
│   │   │   │       ├── mortgage-gpu.ipynb
│   │   │   │       └── mortgage_gpu_crossvalidation.ipynb
│   │   │   ├── pom.xml
│   │   │   ├── python/
│   │   │   │   └── com/
│   │   │   │       ├── __init__.py
│   │   │   │       └── nvidia/
│   │   │   │           ├── __init__.py
│   │   │   │           └── spark/
│   │   │   │               ├── __init__.py
│   │   │   │               └── examples/
│   │   │   │                   ├── __init__.py
│   │   │   │                   └── mortgage/
│   │   │   │                       ├── __init__.py
│   │   │   │                       ├── consts.py
│   │   │   │                       ├── cross_validator_main.py
│   │   │   │                       ├── etl.py
│   │   │   │                       ├── etl_main.py
│   │   │   │                       └── main.py
│   │   │   └── scala/
│   │   │       └── src/
│   │   │           └── com/
│   │   │               └── nvidia/
│   │   │                   └── spark/
│   │   │                       └── examples/
│   │   │                           └── mortgage/
│   │   │                               ├── CrossValidationMain.scala
│   │   │                               ├── ETLMain.scala
│   │   │                               ├── Main.scala
│   │   │                               ├── Mortgage.scala
│   │   │                               └── XGBoostETL.scala
│   │   ├── pack_pyspark_example.sh
│   │   ├── pom.xml
│   │   ├── taxi/
│   │   │   ├── .gitignore
│   │   │   ├── notebooks/
│   │   │   │   ├── python/
│   │   │   │   │   ├── cv-taxi-gpu.ipynb
│   │   │   │   │   ├── taxi-ETL.ipynb
│   │   │   │   │   └── taxi-gpu.ipynb
│   │   │   │   └── scala/
│   │   │   │       ├── taxi-ETL.ipynb
│   │   │   │       ├── taxi-gpu.ipynb
│   │   │   │       └── taxi_gpu_crossvalidation.ipynb
│   │   │   ├── pom.xml
│   │   │   ├── python/
│   │   │   │   └── com/
│   │   │   │       ├── __init__.py
│   │   │   │       └── nvidia/
│   │   │   │           ├── __init__.py
│   │   │   │           └── spark/
│   │   │   │               ├── __init__.py
│   │   │   │               └── examples/
│   │   │   │                   ├── __init__.py
│   │   │   │                   └── taxi/
│   │   │   │                       ├── __init__.py
│   │   │   │                       ├── consts.py
│   │   │   │                       ├── cross_validator_main.py
│   │   │   │                       ├── etl_main.py
│   │   │   │                       ├── main.py
│   │   │   │                       └── pre_process.py
│   │   │   └── scala/
│   │   │       └── src/
│   │   │           └── com/
│   │   │               └── nvidia/
│   │   │                   └── spark/
│   │   │                       └── examples/
│   │   │                           └── taxi/
│   │   │                               ├── CrossValidationMain.scala
│   │   │                               ├── ETLMain.scala
│   │   │                               ├── Main.scala
│   │   │                               └── Taxi.scala
│   │   └── utility/
│   │       ├── .gitignore
│   │       ├── pom.xml
│   │       ├── python/
│   │       │   └── com/
│   │       │       ├── __init__.py
│   │       │       └── nvidia/
│   │       │           ├── __init__.py
│   │       │           └── spark/
│   │       │               ├── __init__.py
│   │       │               └── examples/
│   │       │                   ├── __init__.py
│   │       │                   ├── main.py
│   │       │                   └── utility/
│   │       │                       ├── __init__.py
│   │       │                       ├── args.py
│   │       │                       └── utils.py
│   │       └── scala/
│   │           └── src/
│   │               └── com/
│   │                   └── nvidia/
│   │                       └── spark/
│   │                           └── examples/
│   │                               └── utility/
│   │                                   ├── Benchmark.scala
│   │                                   ├── SparkSetup.scala
│   │                                   ├── Vectorize.scala
│   │                                   └── XGBoostArgs.scala
│   └── spark-connect-gpu/
│       ├── client/
│       │   ├── Dockerfile
│       │   ├── README.md
│       │   ├── docker-compose.yaml
│       │   ├── nds/
│       │   │   ├── nds.ipynb
│       │   │   └── query_0.sql
│       │   ├── notebook/
│       │   │   ├── README.md
│       │   │   ├── spark-connect-gpu-etl-ml.ipynb
│       │   │   └── work/
│       │   │       ├── csv_raw_schema.ddl
│       │   │       └── name_mapping.csv
│       │   ├── python/
│       │   │   ├── batch-job.ipynb
│       │   │   └── batch-job.py
│       │   ├── requirements.txt
│       │   └── scala/
│       │       ├── .gitignore
│       │       ├── pom.xml
│       │       ├── run.sh
│       │       ├── scala-run.ipynb
│       │       └── src/
│       │           └── main/
│       │               └── scala/
│       │                   └── connect.scala
│       └── server/
│           ├── README.md
│           ├── docker-compose.yaml
│           ├── proxy-service/
│           │   ├── Dockerfile
│           │   └── nginx.conf
│           ├── spark-connect-server/
│           │   ├── Dockerfile
│           │   ├── requirements.txt
│           │   ├── spark-defaults.conf
│           │   └── spark-env.sh
│           ├── spark-master/
│           │   ├── Dockerfile
│           │   └── spark-env.sh
│           └── spark-worker/
│               ├── Dockerfile
│               ├── requirements.txt
│               └── spark-env.sh
├── scripts/
│   ├── README.md
│   ├── building/
│   │   └── python_build.sh
│   ├── csp-startup-scripts/
│   │   ├── README.md
│   │   └── emr/
│   │       ├── cgroup-bootstrap-action-emr6.sh
│   │       ├── cgroup-bootstrap-action-emr7.sh
│   │       ├── config-emr6.json
│   │       ├── config-emr7.json
│   │       └── emr-spark-plugin-startup.py
│   ├── encoding/
│   │   └── python/
│   │       ├── .gitignore
│   │       ├── com/
│   │       │   ├── __init__.py
│   │       │   └── nvidia/
│   │       │       ├── __init__.py
│   │       │       └── spark/
│   │       │           ├── __init__.py
│   │       │           └── encoding/
│   │       │               ├── __init__.py
│   │       │               ├── criteo/
│   │       │               │   ├── __init__.py
│   │       │               │   ├── common.py
│   │       │               │   ├── one_hot_cpu_main.py
│   │       │               │   └── target_cpu_main.py
│   │       │               ├── main.py
│   │       │               └── utility/
│   │       │                   ├── __init__.py
│   │       │                   ├── args.py
│   │       │                   └── utils.py
│   │       └── main.py
│   └── encoding-sample/
│       ├── repartition.py
│       ├── run.sh
│       └── truncate-model.py
└── tools/
    ├── databricks/
    │   ├── README.md
    │   ├── [RAPIDS Accelerator for Apache Spark] Profiling Tool Notebook Template.ipynb
    │   └── [RAPIDS Accelerator for Apache Spark] Qualification Tool Notebook Template.ipynb
    └── emr/
        ├── README.md
        ├── [RAPIDS Accelerator for Apache Spark] Profiling Tool Notebook Template.ipynb
        └── [RAPIDS Accelerator for Apache Spark] Qualification Tool Notebook Template.ipynb

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/ISSUE_TEMPLATE/bug_report.md
================================================
---
name: Bug report
about: Create a report to help us improve
title: ''
labels: ''
assignees: GaryShen2008

---

**Describe the bug**
A clear and concise description of what the bug is.

**Steps/Code to reproduce bug**
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.

**Expected behavior**
A clear and concise description of what you expected to happen.

**Environment details (please complete the following information)**
 - Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
 - Spark configuration settings related to the issue

================================================
FILE: .github/workflows/add-to-project.yml
================================================
# Copyright (c) 2024-2025, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name: Add new issues and pull requests to project

on:
  issues:
    types:
      - opened
  pull_request_target:
    types:
      - opened

jobs:
  Add-to-project:
    if: github.repository_owner == 'NVIDIA' # avoid adding issues from forks
    runs-on: ubuntu-latest
    steps:
      - name: add-to-project
        uses: NVIDIA/spark-rapids-common/add-to-project@main
        with:
          token: ${{ secrets.PROJECT_TOKEN }}


================================================
FILE: .github/workflows/license-header-check.yml
================================================
# Copyright (c) 2024-2025, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# A workflow to check copyright/license header
name: license header check

on:
  pull_request:
    types: [opened, synchronize, reopened]

jobs:
  license-header-check:
    runs-on: ubuntu-latest
    if: "!contains(github.event.pull_request.title, '[bot]')"
    steps:
      - name: Get checkout depth
        run: |
          echo "PR_FETCH_DEPTH=$(( ${{ github.event.pull_request.commits }} + 10 ))" >> $GITHUB_ENV

      - name: Checkout code
        uses: NVIDIA/spark-rapids-common/checkout@main
        with:
          fetch-depth: ${{ env.PR_FETCH_DEPTH }}

      - name: license-header-check
        uses: NVIDIA/spark-rapids-common/license-header-check@main
        with:
          included_file_patterns: |
            *.sh,
            *.java,
            *.py,
            *.pbtxt,
            *Dockerfile*,
            *Jenkinsfile*,
            *.yml,
            *.yaml,
            *.cpp,
            *.hpp,
            *.txt,
            *.cu,
            *.scala,
            *.ini,
            *.xml


================================================
FILE: .github/workflows/markdown-links-check/markdown-links-check-config.json
================================================
{
  "ignorePatterns": [
    {
      "pattern": "/docs"
    },
    {
      "pattern": "/datasets"
    },
    {
      "pattern": "/dockerfile"
    },
    {
      "pattern": "/examples"
    },
    {
      "pattern": "^http://localhost"
    },
    {
      "pattern": "^http://spark-master"
    },
    {
      "pattern": "^http://spark-worker"
    },
    {
      "pattern": "^http://spark-connect-server"
    }
  ],
  "timeout": "15s",
  "retryOn429": true,
  "retryCount":30,
  "aliveStatusCodes": [200, 403]
} 


================================================
FILE: .github/workflows/markdown-links-check.yml
================================================
# Copyright (c) 2022-2025, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# A workflow to check if PR got broken hyperlinks
name: Check Markdown links

on:
  pull_request:
    types: [opened, synchronize, reopened]

jobs:
  markdown-link-check:
    runs-on: ubuntu-latest
    steps:
    - name: work around permission issue
      run: git config --global --add safe.directory /github/workspace
    - name: checkout code
      uses: NVIDIA/spark-rapids-common/checkout@main
    - name: markdown link check
      uses: NVIDIA/spark-rapids-common/markdown-link-check@main
      with:
        max-depth: -1
        use-verbose-mode: 'yes'
        config-file: '.github/workflows/markdown-links-check/markdown-links-check-config.json'
        base-branch: 'main'

================================================
FILE: .github/workflows/shell-check.yml
================================================
# Copyright (c) 2025, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# A workflow to check shell script syntax
name: shell check

on:
  pull_request:
    types: [opened, synchronize, reopened]

jobs:
  shell-check:
    runs-on: ubuntu-latest
    if: "!contains(github.event.pull_request.title, '[bot]')"
    steps:
      - name: Checkout code
        uses: NVIDIA/spark-rapids-common/checkout@main

      - name: Run ShellCheck
        uses: NVIDIA/spark-rapids-common/shell-check@main
        with:
          excluded_codes:
            SC2164,
            SC2076,
            SC2054

          # codes explanation:
          # SC2164: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
          # SC2076: Remove quotes from right-hand side of =~ to match as a regex rather than literally.
          # SC2054: Use spaces, not commas, to separate array elements.


================================================
FILE: .github/workflows/signoff-check.yml
================================================
# Copyright (c) 2021-2024, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# A workflow to check if PR got sign-off
name: signoff check

on:
  pull_request_target:
    types: [opened, synchronize, reopened]

jobs:
  signoff-check:
    runs-on: ubuntu-latest
    steps:
      - name: signoff
        uses: NVIDIA/spark-rapids-common/signoff-check@main
        with:
          owner: ${{ github.repository_owner }}
          repo: spark-rapids-examples
          pull_number: ${{ github.event.number }}
          token: ${{ secrets.GITHUB_TOKEN }}


================================================
FILE: .gitignore
================================================
*#*#
*.#*
*.iml
*.ipr
*.iws
*.pyc
*.pyo
*.swp
*~
.DS_Store
.cache
.classpath
.ensime
.ensime_cache/
.ensime_lucene
.generated-mima*
.idea/
.idea_modules/
.project
.pydevproject
.scala_dependencies
.settings
hs_err*.log
target


================================================
FILE: CONTRIBUTING.md
================================================
# Contributing to Spark Examples

### Sign your work

We require that all contributors sign-off on their commits. 

This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.

Any contribution which contains commits that are not signed off will not be accepted.

To sign off on a commit use the `--signoff` (or `-s`) option when committing your changes:

```shell
git commit -s -m "Add cool feature."
```

This will append the following to your commit message:

```
Signed-off-by: Your Name <your@email.com>
```

The sign-off is a simple line at the end of the explanation for the patch. 

Your signature certifies that you wrote the patch or otherwise have the right to pass it on as an open-source patch. 

Use your real name, no pseudonyms or anonymous contributions.  

If you set your `user.name` and `user.email` git configs, you can sign your commit automatically with `git commit -s`.


The signoff means you certify the below (from [developercertificate.org](https://developercertificate.org)):

```
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.
```

Note: This section `Sign your work` is derived from [https://github.com/NVIDIA/spark-rapids](https://github.com/NVIDIA/spark-rapids)


================================================
FILE: LICENSE
================================================
                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "{}"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright 2018 NVIDIA Corporation

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


================================================
FILE: README.md
================================================
# spark-rapids-examples

This is the [RAPIDS Accelerator for Apache Spark](https://nvidia.github.io/spark-rapids/) examples repo.
RAPIDS Accelerator for Apache Spark accelerates Spark applications with no code changes.
You can download the latest version of RAPIDS Accelerator [here](https://nvidia.github.io/spark-rapids/docs/download.html).
This repo contains examples and applications that showcases the performance and benefits of using 
RAPIDS Accelerator in data processing and machine learning pipelines. 
There are broadly five categories of examples in this repo: 
1. [SQL/Dataframe](./examples/SQL+DF-Examples) 
2. [Spark XGBoost](./examples/XGBoost-Examples) 
3. [Machine Learning/Deep Learning](./examples/ML+DL-Examples) 
4. [RAPIDS UDF](./examples/UDF-Examples)
5. [Databricks Tools demo notebooks](./tools/databricks)

For more information on each of the examples please look into respective categories.

Here is the list of notebooks in this repo:

|   | Category  | Notebook Name | Description
| ------------- | ------------- | ------------- | -------------
| 1 | SQL/DF | Microbenchmark | Spark SQL operations such as expand, hash aggregate, windowing, and cross joins with up to 20x performance benefits
| 2 | SQL/DF | Customer Churn | Data federation for modeling customer Churn with a sample telco customer data
| 3 | XGBoost | Agaricus (Scala) | Uses XGBoost classifier function to create model that can accurately differentiate between edible and poisonous mushrooms with the [agaricus dataset](https://archive.ics.uci.edu/ml/datasets/mushroom)
| 4 | XGBoost | Mortgage (Scala) | End-to-end ETL + XGBoost example to predict mortgage default with [Fannie Mae Single-Family Loan Performance Data](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data)
| 5 | XGBoost | Taxi (Scala) | End-to-end ETL + XGBoost example to predict taxi trip fare amount with [NYC taxi trips data set](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
| 6 | ML/DL | PCA | [Spark-Rapids-ML](https://github.com/NVIDIA/spark-rapids-ml) based PCA example to train and transform with a synthetic dataset
| 7 | ML/DL | DL Inference | Several notebooks demonstrating distributed model inference on Spark using the `predict_batch_udf` across various frameworks: PyTorch, HuggingFace, vLLM, and TensorFlow
| 8 | SQL/DF + MLlib | GPU-Accelerated Spark Connect | End-to-end SQL/DF + MLlib acceleration to predict mortgage default with [Fannie Mae Single-Family Loan Performance Data](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data) using the lightweight Spark Connect integration for Apache Spark 4.0+
| 9 | SQL/DF | [TPC-DS](https://www.tpc.org/tpcds/) Scale Factor 10 | Comparison of Spark SQL CPU vs GPU. Easy to run locally and on Google Colab

Here is the list of Apache Spark applications (Scala and PySpark) that 
can be built for running on GPU with RAPIDS Accelerator in this repo:

|   | Category  | Notebook Name | Description
| ------------- | ------------- | ------------- | -------------
| 1 | XGBoost | Agaricus (Scala) | Uses XGBoost classifier function to create model that can accurately differentiate between edible and poisonous mushrooms with the [agaricus dataset](https://archive.ics.uci.edu/ml/datasets/mushroom)
| 2 | XGBoost | Mortgage (Scala) | End-to-end ETL + XGBoost example to predict mortgage default with [Fannie Mae Single-Family Loan Performance Data](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data)
| 3 | XGBoost | Taxi (Scala) | End-to-end ETL + XGBoost example to predict taxi trip fare amount with [NYC taxi trips data set](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
| 4 | ML/DL | PCA | [Spark-Rapids-ML](https://github.com/NVIDIA/spark-rapids-ml) based PCA example to train and transform with a synthetic dataset
| 5 | UDF | URL Decode | Decodes URL-encoded strings using the [Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/legacy/)
| 6 | UDF | URL Encode | URL-encodes strings using the [Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/legacy/)
| 7 | UDF | [CosineSimilarity](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/java/CosineSimilarity.java) | Computes the cosine similarity between two float vectors using [native code](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/src)
| 8 | UDF | [StringWordCount](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/hive/StringWordCount.java)  | Implements a Hive simple UDF using [native code](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/src) to count words in strings


================================================
FILE: dockerfile/Dockerfile
================================================
# Copyright (c) 2019-2023, NVIDIA CORPORATION. All rights reserved.
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

FROM nvidia/cuda:11.8.0-devel-ubuntu18.04
ARG spark_uid=185

# Install java dependencies 
RUN apt-get update && apt-get install -y --no-install-recommends openjdk-8-jdk openjdk-8-jre
ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64
ENV PATH $PATH:/usr/lib/jvm/java-1.8.0-openjdk-amd64/jre/bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64/bin

# Before building the docker image, first build and make a Spark distribution following
# the instructions in http://spark.apache.org/docs/latest/building-spark.html.
# If this docker file is being used in the context of building your images from a Spark
# distribution, the docker build command should be invoked from the top level directory
# of the Spark distribution. E.g.:
# docker build -t spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .

RUN set -ex && \
    ln -s /lib /lib64 && \
    mkdir -p /opt/spark && \
    mkdir -p /opt/spark/examples && \
    mkdir -p /opt/spark/work-dir && \
    touch /opt/spark/RELEASE && \
    rm /bin/sh && \
    ln -sv /bin/bash /bin/sh && \
    echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
    chgrp root /etc/passwd && chmod ug+rw /etc/passwd

ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends apt-utils \
 && apt-get install -y --no-install-recommends python libgomp1 \
 && rm -rf /var/lib/apt/lists/*

COPY jars /opt/spark/jars
COPY bin /opt/spark/bin
COPY sbin /opt/spark/sbin
COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/
COPY examples /opt/spark/examples
COPY kubernetes/tests /opt/spark/tests
COPY data /opt/spark/data

ENV SPARK_HOME /opt/spark

WORKDIR /opt/spark/work-dir
RUN chmod g+w /opt/spark/work-dir

ENV TINI_VERSION v0.18.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /sbin/tini
RUN chmod +rx /sbin/tini

ENTRYPOINT [ "/opt/entrypoint.sh" ]

# Specify the User that the actual main process will run as
USER ${spark_uid}



================================================
FILE: dockerfile/gpu_executor_template.yaml
================================================
# Copyright (c) 2024, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: v1
kind: Pod
spec:
  containers:
    - name: executor
      resources:
        limits:
          nvidia.com/gpu: 1



================================================
FILE: docs/get-started/xgboost-examples/building-sample-apps/python.md
================================================
# Build XGBoost Python Examples

## Build

Follow these steps to package the Python zip file:

``` bash
git clone https://github.com/NVIDIA/spark-rapids-examples.git
cd spark-rapids-examples/scripts/building
sh python_build.sh
```


## Files Required by PySpark

Two files are required by PySpark:

+ *samples.zip*
  
  the package including all example code. 
  Executing the above build commands generates the samples.zip file in 'spark-rapids-examples/examples/XGBoost-Examples' folder

+ *main.py*
  
  entrypoint for PySpark, you can find it in 'spark-rapids-examples/examples/XGBoost-Examples' folder


================================================
FILE: docs/get-started/xgboost-examples/building-sample-apps/scala.md
================================================
# Build XGBoost Scala Examples

The examples rely on [XGBoost](https://github.com/dmlc/xgboost).

## Build

Follow these steps to build the Scala jars:

``` bash
git clone https://github.com/NVIDIA/spark-rapids-examples.git
cd spark-rapids-examples/examples/XGBoost-Examples
mvn package
```

## The generated Jars

Let's assume LATEST_VERSION is **0.2.3**. The build process will generate two jars as belows,

+ *aggregator/target/sample_xgboost_apps-${LATEST_VERSION}.jar*
  
  only classes for the examples are included, so it should be submitted to spark together with other dependent jars

+ *aggregator/target/sample_xgboost_apps-${LATEST_VERSION}-jar-with-dependencies.jar*
  
  both classes for the examples and the classes from dependent jars are included except cudf and rapids.



================================================
FILE: docs/get-started/xgboost-examples/csp/aws/ec2.md
================================================
# Get Started with XGBoost4J-Spark 3.0 on AWS EC2

This is a getting started guide to Spark 3.2+ on AWS EC2. At the end of this guide, the reader will be able to run a sample Apache Spark application that runs on NVIDIA GPUs on AWS EC2.

For more details of AWS EC2 and get started, please check the [AWS document](https://aws.amazon.com/ec2/getting-started/).

## Configure and Launch AWS EC2

Go to AWS Management Console select a region, e.g. Oregon, and click EC2 service.

### Step 1:  Launch New Instance

Click "Launch instance" at the EC2 Management Console, and select "Launch instance".

![Step 1:  Launch New Instance](pics/ec2_step1.png)

### Step 2:  Configure Instance

#### Step 2.1: Choose an Amazon Machine Image(AMI)

Search for "deep learning base ami", choose "Deep Learning Base AMI (Ubuntu 18.04)". Click "Select".

![Step 2.1: Choose an Amazon Machine Image(AMI)](pics/ec2_step2-1.png)

#### Step 2.2: Choose an Instance Type

Choose type "p3.2xlarge". Click "Next: Configure Instance Details" at right buttom.

![Step 2.1: Choose an Instance Type](pics/ec2_step2-2.png)

#### Step 2.3: Configure Instance Detials

Do not need to change anything here, make sure "Number of instances" is 1. Click "Next: Add Storage" at right buttom.

![Step 2.3: Configure Instance Detials](pics/ec2_step2-3.png)

#### Step 2.4: Add Storage

Change the root disk size based on your needed, also you can add ebs volume by clicking "Add New Volume". In this sample, we use default 50G. Click "Next: Add Tag" at right buttom.

For more details of AWS EBS please check the [AWS document](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEBS.html).

![Step 2.4: Add Storage](pics/ec2_step2-4.png)

#### Step 2.5: Add Tags

You can add tag here or skip. In this sample, we will skip it. Click "Next: Configure Security Group" at right buttom.

#### Step 2.6: Configure Security Group

For convenience, in this sample, we open all ports. You can add your own rules.

Create a new security group and select type as "All traffic". Click "Review and Launch" at right buttom.

For more details of security group, please check the [AWS document](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-security-groups.html).

![Step 2.6: Configure Security Group](pics/ec2_step2-6.png)

#### Step 2.7: Review Instance Launch

Review your configuration. Click "Launch" at right buttom. Choose the key-pair you have and launch instances.

Return "instances | EC2 Managemnt Console", you can find your instance running. (It may take a few minutes to initialize)

![Step 2.7: Review Instance Launch](pics/ec2_step2-7.png)

## Launch EC2 and Configure Spark 3.2+

### Step 1:  Launch EC2

Copy "Public DNS (IPv4)" of your instance 
Use ssh with your private key to launch the EC2 machine as user "ubuntu"

``` bash
ssh -i "key.pem" ubuntu@xxxx.region.compute.amazonaws.com
```

### Step 2: Download Spark package

Download spark package and set environment variable.

``` bash
# download the spark
wget https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
tar zxf spark-3.2.1-bin-hadoop3.2.tgz
export SPARK_HOME=/your/spark/spark-3.2.1-bin-hadoop3.2
```

### Step 3: Download jars for S3A (optional)

If your dataset is on S3, you should download below jar files to enable the accessing of S3. In this sample, we will use data on S3.
The jars should under $SPARK_HOME/jars

``` bash
cd $SPARK_HOME/jars
wget https://github.com/JodaOrg/joda-time/releases/download/v2.10.5/joda-time-2.10.5.jar
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.0/hadoop-aws-3.2.0.jar
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.11.687/aws-java-sdk-1.11.687.jar
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-core/1.11.687/aws-java-sdk-core-1.11.687.jar
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-dynamodb/1.11.687/aws-java-sdk-dynamodb-1.11.687.jar
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.11.687/aws-java-sdk-s3-1.11.687.jar
```

### Step 4: Start Spark Standalone

#### Step 4.1: Edit spark-default.conf

cd $SPARK_HOME/conf and edit spark-defaults.conf

By default, thers is only spark-defaults.conf.template in $SPARK_HOME/conf, you could edit it and rename to spark-defaults.conf
You can find getGpusResources.sh in $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh

``` bash
spark.worker.resource.gpu.amount 1
spark.worker.resource.gpu.discoveryScript /path/to/getGpusResources.sh
```

The gpu.amount should be <= the number of GPUs the worker has.

#### Step 4.2: Start Spark Standalone

Start Spark. Default master-spark-URL is spark://$HOSTNAME:7077 . 

``` bash
$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-slave.sh <master-spark-URL>
```

## Launch XGBoost-Spark examples on Spark 3.2+

### Step 1: Download Jars

Make sure you have prepared the necessary packages and dataset by following this [guide](/docs/get-started/xgboost-examples/prepare-package-data/preparation-scala.md)

Copy rapids jars to `$SPARK_HOME/jars`

``` bash
cp $RAPIDS_JAR $SPARK_HOME/jars/
```

### Step 2: Create sample running script

Create running run.sh script with below content, make sure change the paths in it to your own. Also your aws key/secret.

``` bash
#!/bin/bash
export SPARK_HOME=/your/path/to/spark-3.2.1-bin-hadoop3.2

export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH

export TOTAL_CORES=8
export NUM_EXECUTORS=1
export NUM_EXECUTOR_CORES=$((${TOTAL_CORES}/${NUM_EXECUTORS}))

export S3A_CREDS_USR=your_aws_key

export S3A_CREDS_PSW=your_aws_secret

spark-submit --master spark://$HOSTNAME:7077 \
        --deploy-mode client \
        --driver-memory 10G \
        --executor-memory 22G \
        --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
        --conf spark.hadoop.fs.s3a.access.key=$S3A_CREDS_USR \
        --conf spark.hadoop.fs.s3a.secret.key=$S3A_CREDS_PSW \
        --conf spark.executor.memoryOverhead=28G \
        --conf spark.cores.max=$TOTAL_CORES \
        --conf spark.executor.cores=$NUM_EXECUTOR_CORES \
        --conf spark.task.cpus=$NUM_EXECUTOR_CORES \
        --conf spark.sql.files.maxPartitionBytes=4294967296 \
        --conf spark.yarn.maxAppAttempts=1 \
        --conf spark.plugins=com.nvidia.spark.SQLPlugin \
        --conf spark.rapids.memory.gpu.pooling.enabled=false \
        --conf spark.executor.resource.gpu.amount=1 \
        --conf spark.task.resource.gpu.amount=1 \
        --class com.nvidia.spark.examples.mortgage.GPUMain \
        ${SAMPLE_JAR} \
        -num_workers=${NUM_EXECUTORS} \
        -format=csv \
        -dataPath="train::your-train-data-path" \
        -dataPath="trans::your-eval-data-path" \
        -numRound=100 -max_depth=8 -nthread=$NUM_EXECUTOR_CORES -showFeatures=0 \
        -tree_method=gpu_hist
```

### Step 3: Submit Sample job

Run run.sh

``` bash
./run.sh
```

After running successfully, the job will print an accuracy benchmark for model prediction.  


================================================
FILE: docs/get-started/xgboost-examples/csp/databricks/databricks.md
================================================
Get Started with XGBoost4J-Spark on Databricks
======================================================

This is a getting started guide to XGBoost4J-Spark on Databricks. At the end of this guide, the reader will be able to run a sample Apache Spark application that runs on NVIDIA GPUs on Databricks.

Prerequisites
-------------

    * Apache Spark 3.x running in Databricks Runtime 10.4 ML or 11.3 ML with GPU
    * AWS: 10.4 LTS ML (GPU, Scala 2.12, Spark 3.2.1) or 11.3 LTS ML (GPU, Scala 2.12, Spark 3.3.0)
    * Azure: 10.4 LTS ML (GPU, Scala 2.12, Spark 3.2.1) or 11.3 LTS ML (GPU, Scala 2.12, Spark 3.3.0)

The number of GPUs per node dictates the number of Spark executors that can run in that node. Each executor should only be allowed to run 1 task at any given time.
   
Start A Databricks Cluster
--------------------------
Before creating the cluster, we will need to create an [initialization script](https://docs.databricks.com/clusters/init-scripts.html) for the 
cluster to install the RAPIDS jars. Databricks recommends storing all cluster-scoped init scripts using workspace files. 
Each user has a Home directory configured under the /Users directory in the workspace. 
Navigate to your home directory in the UI and select **Create** > **File** from the menu, 
create an `init.sh` scripts with contents:   
   ```bash
   #!/bin/bash
   sudo wget -O /databricks/jars/rapids-4-spark_2.12-26.02.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/26.02.0/rapids-4-spark_2.12-26.02.0.jar
   ```
1. Select the Databricks Runtime Version from one of the supported runtimes specified in the
   Prerequisites section.
2. Choose the number of workers that matches the number of GPUs you want to use.
3. Select a worker type. On AWS, use nodes with 1 GPU each such as `p3.2xlarge` or `g4dn.xlarge`.
   For Azure, choose GPU nodes such as Standard_NC6s_v3. For GCP, choose N1 or A2 instance types with GPUs. 
4. Select the driver type. Generally this can be set to be the same as the worker.
5. Click the “Edit” button, then navigate down to the “Advanced Options” section. Select the “Init Scripts” tab in 
   the advanced options section, and paste the workspace path to the initialization script:`/Users/user@domain/init.sh`, then click “Add”.
   ![Init Script](../../../../img/databricks/initscript.png)
6. Now select the “Spark” tab, and paste the following config options into the Spark Config section.
   Change the config values based on the workers you choose. See Apache Spark
   [configuration](https://spark.apache.org/docs/latest/configuration.html) and RAPIDS Accelerator
   for Apache Spark [descriptions](https://nvidia.github.io/spark-rapids/docs/configs.html) for each config.

    The
    [`spark.task.resource.gpu.amount`](https://spark.apache.org/docs/latest/configuration.html#scheduling)
    configuration is defaulted to 1 by Databricks. That means that only 1 task can run on an
    executor with 1 GPU, which is limiting, especially on the reads and writes from Parquet. Set
    this to 1/(number of cores per executor) which will allow multiple tasks to run in parallel just
    like the CPU side. Having the value smaller is fine as well.
    Note: Please remove the `spark.task.resource.gpu.amount` config for a single-node Databricks 
    cluster because Spark local mode does not support GPU scheduling.
   
    ```bash
    spark.plugins com.nvidia.spark.SQLPlugin
    spark.task.resource.gpu.amount 0.1
    spark.rapids.memory.pinnedPool.size 2G
    spark.rapids.sql.concurrentGpuTasks 2
    ```

    ![Spark Config](../../../../img/databricks/sparkconfig.png)

    If running Pandas UDFs with GPU support from the plugin, at least three additional options
    as below are required. The `spark.python.daemon.module` option is to choose the right daemon module
    of python for Databricks. On Databricks, the python runtime requires different parameters than the
    Spark one, so a dedicated python demon module `rapids.daemon_databricks` is created and should
    be specified here. Set the config
    [`spark.rapids.sql.python.gpu.enabled`](https://nvidia.github.io/spark-rapids/docs/configs.html#sql.python.gpu.enabled) to `true` to
    enable GPU support for python. Add the path of the plugin jar (supposing it is placed under
    `/databricks/jars/`) to the `spark.executorEnv.PYTHONPATH` option. For more details please go to
    [GPU Scheduling For Pandas UDF](https://nvidia.github.io/spark-rapids/docs/additional-functionality/rapids-udfs.html#gpu-support-for-pandas-udf)

    ```bash
    spark.rapids.sql.python.gpu.enabled true
    spark.python.daemon.module rapids.daemon_databricks
    spark.executorEnv.PYTHONPATH /databricks/jars/rapids-4-spark_2.12-26.02.0.jar:/databricks/spark/python
    ```
   Note that since python memory pool require installing the cudf library, so you need to install cudf library in 
   each worker nodes `pip install cudf-cu11 --extra-index-url=https://pypi.nvidia.com` or disable python memory pool
   `spark.rapids.python.memory.gpu.pooling.enabled=false`.
   
7. Click `Create Cluster`, it is now enabled for GPU-accelerated Spark.

Install the xgboost4j_spark jar in the cluster
---------------------------

1. See [Libraries](https://docs.databricks.com/user-guide/libraries.html) for how to install jars from DBFS
2. Go to "Libraries" tab under your cluster and install dbfs:/FileStore/jars/${XGBOOST4J_SPARK_JAR} in your cluster by selecting the "DBFS" option for installing jars

These steps will ensure you are able to import xgboost libraries in python notebooks.

Import the GPU Mortgage Example Notebook
---------------------------

1. See [Managing Notebooks](https://docs.databricks.com/user-guide/notebooks/notebook-manage.html) on how to import a notebook.
2. Import the example notebook: [XGBoost4j-Spark mortgage notebook](../../../../../examples/XGBoost-Examples/mortgage/notebooks/scala/mortgage-gpu.ipynb)
3. Inside the mortgage example notebook, update the data paths from 
"/data/datasets/mortgage-small/train" to "dbfs:/FileStore/tables/mortgage/csv/train/mortgage_train_merged.csv"
"/data/datasets/mortgage-small/eval" to "dbfs:/FileStore/tables/mortgage/csv/test/mortgage_eval_merged.csv"

The example notebook comes with the following configuration, you can adjust this according to your setup.
See supported configuration options here: [xgboost parameters](../../../../../examples/XGBoost-Examples/app-parameters/supported_xgboost_parameters_python.md)

``` bash
params = { 
    'eta': 0.1,
    'gamma': 0.1,
    'missing': 0.0,
    'treeMethod': 'gpu_hist',
    'maxDepth': 10, 
    'maxLeaves': 256,
    'growPolicy': 'depthwise',
    'minChildWeight': 30.0,
    'lambda_': 1.0,
    'scalePosWeight': 2.0,
    'subsample': 1.0,
    'nthread': 1,
    'numRound': 100,
    'numWorkers': 1,
}
```

4. Run all the cells in the notebook.

5. View the results
In the cell 5 (Training), 7 (Transforming) and 8 (Accuracy of Evaluation) you will see the output.

```
--------------
==> Benchmark: 
Training takes 6.48 seconds
--------------

--------------
==> Benchmark: Transformation takes 3.2 seconds

--------------

------Accuracy of Evaluation------
Accuracy is 0.9980699597729774

```

Limitations
-------------

1. When selecting GPU nodes, Databricks UI requires the driver node to be a GPU node. However you 
   can use Databricks API to create a cluster with CPU driver node.
   Outside of Databricks the plugin can operate with the driver as a CPU node and workers as GPU nodes.

2. Cannot spin off multiple executors on a multi-GPU node. 

   Even though it is possible to set `spark.executor.resource.gpu.amount=1` in the in Spark 
   Configuration tab, Databricks overrides this to `spark.executor.resource.gpu.amount=N` 
   (where N is the number of GPUs per node). This will result in failed executors when starting the
   cluster.

3. Parquet rebase mode is set to "LEGACY" by default.

   The following Spark configurations are set to `LEGACY` by default on Databricks:
   
   ```
   spark.sql.legacy.parquet.datetimeRebaseModeInWrite
   spark.sql.legacy.parquet.int96RebaseModeInWrite
   ```
   
   These settings will cause a CPU fallback for Parquet writes involving dates and timestamps.
   If you do not need `LEGACY` write semantics, set these configs to `EXCEPTION` which is
   the default value in Apache Spark 3.0 and higher.

4. Databricks makes changes to the runtime without notification.

    Databricks makes changes to existing runtimes, applying patches, without notification.
    [Issue-3098](https://github.com/NVIDIA/spark-rapids/issues/3098) is one example of this.  We run
    regular integration tests on the Databricks environment to catch these issues and fix them once
    detected.

================================================
FILE: docs/get-started/xgboost-examples/csp/databricks/init.sh
================================================
#!/bin/bash
# Copyright (c) 2025-2026, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

sudo rm -f /databricks/jars/spark--maven-trees--ml--10.x--xgboost-gpu--ml.dmlc--xgboost4j-gpu_2.12--ml.dmlc__xgboost4j-gpu_2.12__1.5.2.jar
sudo rm -f /databricks/jars/spark--maven-trees--ml--10.x--xgboost-gpu--ml.dmlc--xgboost4j-spark-gpu_2.12--ml.dmlc__xgboost4j-spark-gpu_2.12__1.5.2.jar

sudo wget -O /databricks/jars/rapids-4-spark_2.12-26.02.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/26.02.0/rapids-4-spark_2.12-26.02.0.jar
sudo wget -O /databricks/jars/xgboost4j-gpu_2.12-1.7.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-gpu_2.12/1.7.1/xgboost4j-gpu_2.12-1.7.1.jar
sudo wget -O /databricks/jars/xgboost4j-spark-gpu_2.12-1.7.1.jar https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark-gpu_2.12/1.7.1/xgboost4j-spark-gpu_2.12-1.7.1.jar
ls -ltr

mkdir -p /dbfs/FileStore/tables/
cd /dbfs/FileStore/tables/
# Note that this is just a dummy dataset for quickly hands on, please refer the instructions to download the full dataset:
# https://github.com/NVIDIA/spark-rapids-examples/blob/main/docs/get-started/xgboost-examples/dataset/mortgage.md
wget -O mortgage.zip https://rapidsai-data.s3.us-east-2.amazonaws.com/spark/mortgage.zip
ls
unzip -o mortgage.zip
pwd
ls -ltr mortgage/csv/*

================================================
FILE: docs/get-started/xgboost-examples/csp/dataproc/gcp.md
================================================
# Getting started pyspark+xgboost with RAPIDS Accelerator on GCP Dataproc
 [Google Cloud Dataproc](https://cloud.google.com/dataproc) is Google Cloud's fully managed Apache
 Spark and Hadoop service. Please make sure to install gcloud CLI by following 
 this [guide](https://cloud.google.com/sdk/docs/install) before getting started.
 
## Create a Dataproc Cluster using T4's
* One 16-core master node and 2 32-core worker nodes
* Two NVIDIA T4 for each worker node

```bash
    export REGION=[Your Preferred GCP Region]
    export GCS_BUCKET=[Your GCS Bucket]
    export CLUSTER_NAME=[Your Cluster Name]
    export NUM_GPUS=2
    export NUM_WORKERS=2

gcloud dataproc clusters create $CLUSTER_NAME  \
    --region=$REGION \
    --image-version=2.0-ubuntu18 \
    --master-machine-type=n2-standard-16 \
    --num-workers=$NUM_WORKERS \
    --worker-accelerator=type=nvidia-tesla-t4,count=$NUM_GPUS \
    --worker-machine-type=n1-highmem-32\
    --num-worker-local-ssds=4 \
    --initialization-actions=gs://goog-dataproc-initialization-actions-${REGION}/spark-rapids/spark-rapids.sh \
    --optional-components=JUPYTER,ZEPPELIN \
    --metadata=rapids-runtime=SPARK \
    --bucket=$GCS_BUCKET \
    --enable-component-gateway \
    --subnet=default
```

Explanation of parameters:
* NUM_GPUS = number of GPUs to attach to each worker node in the cluster
* NUM_WORKERS = number of Spark worker nodes in the cluster

This takes around 10-15 minutes to complete.  You can navigate to the Dataproc clusters tab in the
Google Cloud Console to see the progress.

![Dataproc Cluster](../../../../img/GCP/dataproc-cluster.png)

If you'd like to further accelerate init time to 4-5 minutes, create a custom Dataproc image using
[this](#build-custom-dataproc-image-to-accelerate-cluster-init-time) guide.


## Get Application Files, Jar and Dataset

Bash into the master node and make sure you have prepared the necessary packages and dataset by following this [guide](../../prepare-package-data/preparation-python.md).

Note: Since there is no maven CLI in master node, so we need to manually install.
``` bash
gcloud compute ssh your-name@your-cluster-m --zone your-zone
sudo apt-get install maven -y
```

Then create a directory in HDFS, and run below commands,

``` bash
[xgboost4j_spark_python]$ hadoop fs -mkdir /tmp/xgboost4j_spark_python
[xgboost4j_spark_python]$ hadoop fs -copyFromLocal ${SPARK_XGBOOST_DIR}/mortgage/* /tmp/xgboost4j_spark_python
```

## Preparing libraries
Please make sure to install the XGBoost, cudf-cu11, numpy libraries on all nodes before running XGBoost application.
``` bash
pip install xgboost
pip install cudf-cu11 --extra-index-url=https://pypi.nvidia.com
pip install numpy
pip install scikit-learn
```
You can also create an isolated python environment by using [Virtualenv](https://virtualenv.pypa.io/en/latest/),
and then directly pass/unpack the archive file and enable the environment on executors
by leveraging the --archives option or spark.archives configuration.
``` bash
# create an isolated python environment and install libraries
python -m venv pyspark_venv
source pyspark_venv/bin/activate
pip install xgboost
pip install cudf-cu11 --extra-index-url=https://pypi.nvidia.com
pip install numpy
pip install scikit-learn
pip install venv-pack
venv-pack -o pyspark_venv.tar.gz

# enable archive python environment on executors
export PYSPARK_DRIVER_PYTHON=python # Do not set in cluster modes.
export PYSPARK_PYTHON=./environment/bin/python
spark-submit --archives pyspark_venv.tar.gz#environment app.py
```
## Run jupyter notebooks on Dataproc 

Bash into the master node and start up the notebook.
```
jupyter notebook --ip=0.0.0.0 --port=8124 --no-browser
```

If you want to remote access the notebook from local, please reserve an external static IP address first:
1. Access the IP addresses page through the navigation menu: `VPC network` -> `IP addresses`
![dataproc img2](../../../../img/GCP/dataproc-img2.png)
2. Click the `RESERVE EXTERNAL STATIC ADDRESS` button
![dataproc img3](../../../../img/GCP/dataproc-img3.png)
3. Attached the static address to the master node of your cluster
![dataproc img4](../../../../img/GCP/dataproc-img4.png)
4. Then you can access and run the notebook from the browser in local using the reserved address.  
![dataproc img5](../../../../img/GCP/dataproc-img5.png)

Then you can run the [notebook](../../../../../examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb) and get the benchmark results.
![dataproc img6](../../../../img/GCP/dataproc-img6.png)

## Build custom dataproc image to accelerate cluster init time
In order to accelerate cluster init time to 3-4 minutes, we need to build a custom Dataproc image
that already has NVIDIA drivers and CUDA toolkit installed, with RAPIDS deployed. The custom image
could also be used in an air gap environment. In this section, we will be using [these instructions
from GCP](https://cloud.google.com/dataproc/docs/guides/dataproc-images) to create a custom image.

Currently, we can directly download the [spark-rapids.sh](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/spark-rapids)
script to create the Dataproc image:

Google provides a `generate_custom_image.py` script that:
- Launches a temporary Compute Engine VM instance with the specified Dataproc base image.
- Then runs the customization script inside the VM instance to install custom packages and/or
update configurations.
- After the customization script finishes, it shuts down the VM instance and creates a Dataproc
  custom image from the disk of the VM instance.
- The temporary VM is deleted after the custom image is created.
- The custom image is saved and can be used to create Dataproc clusters.

Download `spark-rapids.sh` in this repo.  The script uses
Google's `generate_custom_image.py` script.  This step may take 20-25 minutes to complete.

```bash
git clone https://github.com/GoogleCloudDataproc/custom-images
cd custom-images

export CUSTOMIZATION_SCRIPT=/path/to/spark-rapids.sh
export ZONE=[Your Preferred GCP Zone]
export GCS_BUCKET=[Your GCS Bucket]
export IMAGE_NAME=sample-20-ubuntu18-gpu-t4
export DATAPROC_VERSION=2.0-ubuntu18
export GPU_NAME=nvidia-tesla-t4
export GPU_COUNT=1

python generate_custom_image.py \
    --image-name $IMAGE_NAME \
    --dataproc-version $DATAPROC_VERSION \
    --customization-script $CUSTOMIZATION_SCRIPT \
    --no-smoke-test \
    --zone $ZONE \
    --gcs-bucket $GCS_BUCKET \
    --machine-type n1-standard-4 \
    --accelerator type=$GPU_NAME,count=$GPU_COUNT \
    --disk-size 200 \
    --subnet default 
```

See [here](https://cloud.google.com/dataproc/docs/guides/dataproc-images#running_the_code) for more
details on `generate_custom_image.py` script arguments and
[here](https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions) for dataproc
version description.

The image `sample-20-ubuntu18-gpu-t4` is now ready and can be viewed in the GCP console under
`Compute Engine > Storage > Images`. The next step is to launch the cluster using this new image
and new initialization actions (that do not install NVIDIA drivers since we are already past that
step).

Move this to your own bucket. Let's launch the cluster:

```bash 
export REGION=[Your Preferred GCP Region]
export GCS_BUCKET=[Your GCS Bucket]
export CLUSTER_NAME=[Your Cluster Name]
export NUM_GPUS=1
export NUM_WORKERS=2

gcloud dataproc clusters create $CLUSTER_NAME  \
    --region=$REGION \
    --image=sample-20-ubuntu18-gpu-t4 \
    --master-machine-type=n1-standard-4 \
    --num-workers=$NUM_WORKERS \
    --worker-accelerator=type=nvidia-tesla-t4,count=$NUM_GPUS \
    --worker-machine-type=n1-standard-4 \
    --num-worker-local-ssds=1 \
    --optional-components=JUPYTER,ZEPPELIN \
    --metadata=rapids-runtime=SPARK \
    --bucket=$GCS_BUCKET \
    --enable-component-gateway \
    --subnet=default 
```

The new cluster should be up and running within 3-4 minutes!



================================================
FILE: docs/get-started/xgboost-examples/dataset/mortgage.md
================================================
# How to download the Mortgage dataset



## Steps to download the data

1. Go to the [Fannie Mae](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data) website
2. Click on [Single-Family Loan Performance Data](https://datadynamics.fanniemae.com/data-dynamics/?&_ga=2.181456292.2043790680.1657122341-289272350.1655822609#/reportMenu;category=HP)
    * Register as a new user if you are using the website for the first time
    * Use the credentials to login
3. Select [HP](https://datadynamics.fanniemae.com/data-dynamics/#/reportMenu;category=HP)
4. Click on  **Download Data** and choose *Single-Family Loan Performance Data*
5. You will find a tabular list of 'Acquisition and Performance' files sorted based on year and quarter. Click on the file to download `Eg: 2017Q1.zip`
6. Unzip the downlad file to extract the csv file `Eg: 2017Q1.csv`
7. Copy only the csv files to a new folder for the ETL to read

## Notes
1. Refer to the [Loan Performance Data Tutorial](https://capitalmarkets.fanniemae.com/media/9066/display) for more details. 
2. Note that *Single-Family Loan Performance Data* has 2 componenets. However, the Mortgage ETL requires only the first one (primary dataset)
    * Primary Dataset:  Acquisition and Performance Files
    * HARP Dataset
3. Use the [Resources](https://datadynamics.fanniemae.com/data-dynamics/#/resources/HP) section to know more about the dataset

================================================
FILE: docs/get-started/xgboost-examples/notebook/python-notebook.md
================================================
Get Started with pyspark+XGBoost with Jupyter Notebook
===================================================================

This is a getting started guide to XGBoost4J-Spark using an [Jupyter notebook](https://jupyter.org/). 
At the end of this guide, you will be able to run a sample notebook that runs on NVIDIA GPUs.

Before you begin, please ensure that you have setup a Spark Cluster(Standalone or YARN).
You should change `--master` config according to your cluster architecture. For example, set `--master yarn` for spark on YARN.

It is assumed that the `SPARK_MASTER` and `SPARK_HOME` environment variables are defined and point to the Spark Master URL (e.g. `spark://localhost:7077`),
and the home directory for Apache Spark respectively.

1. Make sure you have [Jupyter notebook installed](https://jupyter.org/install.html).

   If you install it with conda, please make sure your Python version is consistent.

2. Prepare packages and dataset.

    Make sure you have prepared the necessary packages and dataset by following this [guide](../prepare-package-data/preparation-python.md)

3. Launch the notebook:

   Note: For ETL jobs, Set `spark.task.resource.gpu.amount` to `1/spark.executor.cores`.

    For ETL:

    ``` bash
    PYSPARK_DRIVER_PYTHON=jupyter       \
    PYSPARK_DRIVER_PYTHON_OPTS=notebook \
    pyspark                             \
    --master ${SPARK_MASTER}            \
    --jars ${RAPIDS_JAR}\
    --py-files ${SAMPLE_ZIP}      \
    --conf spark.plugins=com.nvidia.spark.SQLPlugin \
    --conf spark.executor.resource.gpu.amount=1 \
    --conf spark.executor.cores=10 \
    --conf spark.task.resource.gpu.amount=0.1 \
    --conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer \
    --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
    --files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh
    ```

    For XGBoost:

    ``` bash
    PYSPARK_DRIVER_PYTHON=jupyter       \
    PYSPARK_DRIVER_PYTHON_OPTS=notebook \
    pyspark                             \
    --master ${SPARK_MASTER}            \
    --jars ${RAPIDS_JAR}\
    --py-files ${SAMPLE_ZIP}      \
    --conf spark.plugins=com.nvidia.spark.SQLPlugin \
    --conf spark.rapids.memory.gpu.pool=NONE \
    --conf spark.executor.resource.gpu.amount=1 \
    --conf spark.executor.cores=10 \
    --conf spark.task.resource.gpu.amount=1 \
    --conf spark.sql.execution.arrow.maxRecordsPerBatch=200000 \
    --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
    --files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh
    ```

4. Launch ETL Part 

- Mortgage ETL Notebook: [Python](../../../../examples/XGBoost-Examples/mortgage/notebooks/python/MortgageETL.ipynb)
- Taxi ETL Notebook: [Python](../../../../examples/XGBoost-Examples/taxi/notebooks/python/taxi-ETL.ipynb)
- Note: Agaricus does not have ETL part.


================================================
FILE: docs/get-started/xgboost-examples/notebook/spylon.md
================================================
Get Started with XGBoost4J-Spark with Spylon Kernel Jupyter Notebook
===================================================================

This is a getting started guide to XGBoost4J-Spark using a [Spylon Kernel](https://pypi.org/project/spylon-kernel/) Jupyter notebook. 
At the end of this guide, the reader will be able to run a sample notebook that runs on NVIDIA GPUs.

Before you begin, please ensure that you have setup 
a [Spark Standalone Cluster](/docs/get-started/xgboost-examples/on-prem-cluster/standalone-scala.md).

It is assumed that the `SPARK_MASTER` and `SPARK_HOME` environment variables are defined and point to the Spark Master URL, 
and the home directory for Apache Spark respectively.

1. Install Jupyter Notebook with spylon-kernel.
   ``` bash
   # Install notebook and spylon-kernel (Scala kernel for Jupyter Notebook), https://pypi.org/project/spylon-kernel/
   # You can use spylon-kernel as Scala kernel for Jupyter Notebook. Do this when you want to work with Spark in Scala with a bit of Python code mixed in.
   RUN pip3 install jupyter notebook spylon-kernel
   RUN python -m spylon_kernel install
   # Latest version breaks nbconvert: https://github.com/ipython/ipykernel/issues/422
   RUN pip3 install ipykernel==5.1.1
   ```
2. Start Jupyter Notebook. 
<!-- markdown-link-check-disable -->
You can debug from webUI http://your_ip:your_port with your password.
<!-- markdown-link-check-enable -->    
    ``` bash
    export JUPYTER_CONFIG_FILE=~/.jupyter/jupyter_notebook_config.py
    
    rm -rf `dirname $JUPYTER_CONFIG_FILE` && mkdir -p `dirname $JUPYTER_CONFIG_FILE` && echo """
    c.NotebookApp.ip='*'
    c.NotebookApp.password = your_hashed_password
    c.NotebookApp.password = your_password 
    c.NotebookApp.open_browser = False
    c.NotebookApp.port = your_port
    """ > $JUPYTER_CONFIG_FILE
 
    jupyter notebook --allow-root --notebook-dir=$WORKSPACE --config=$JUPYTER_CONFIG_FILE &
    ```
3. Prepare packages and dataset.

    Make sure you have prepared the necessary packages and dataset by following this [guide](/docs/get-started/xgboost-examples/prepare-package-data/preparation-scala.md)

4. Run scala notebook (e.g. [mortgage-gpu.ipynb](../../../../examples/XGBoost-Examples/mortgage/notebooks/scala/mortgage-gpu.ipynb))

    ``` bash
    # Suppose your Scala file is $WORKSPACE/mortgage-gpu.ipynb
 
    jupyter nbconvert --to notebook --stdout --execute $WORKSPACE/mortgage-gpu.ipynb
     
    # -------you will see output looks like ----------------
    # { 
    #   "cells": [
    #   {
    #    "cell_type": "code",
    #    "execution_count": 1,
    #    "id": "5ca1ae16",
    #    "metadata": {
    #     ........
    #     ........
    #     ........
    #   "language_info": {
    #    "codemirror_mode": "text/x-scala",
    #    "file_extension": ".scala",
    #    "help_links": [
    #     {
    #      "text": "MetaKernel Magics",
    #      "url": "https://metakernel.readthedocs.io/en/latest/source/README.html"
    #     }
    #    ],
    #    "mimetype": "text/x-scala",
    #    "name": "scala",
    #    "pygments_lexer": "scala",
    #    "version": "0.4.1"
    #   }
    #  },
    #  "nbformat": 4,
    #  "nbformat_minor": 5
    # }
    ```
    You can also run python notebook with Spylon Kernel
    ``` bash
    # restart Jupyter Notebook
  
    export PYSPARK_DRIVER_PYTHON=jupyter
    export PYSPARK_DRIVER_PYTHON_OPTS="notebook --allow-root --notebook-dir=$WORKSPACE --config=$JUPYTER_CONFIG_FILE"
    pyspark &
     
    # Suppose your python file is $WORKSPACE/mortgage-gpu.ipynb
    jupyter nbconvert --to notebook--stdout --execute $WORKSPACE/mortgage-gpu.ipynb
    ```
   
5. Launch ETL Part 
- Mortgage ETL Notebook: [Scala](../../../../examples/XGBoost-Examples/mortgage/notebooks/scala/mortgage-ETL.ipynb) or
  [Python](../../../../examples/XGBoost-Examples/mortgage/notebooks/python/MortgageETL.ipynb)
- Taxi ETL Notebook: [Scala](../../../../examples/XGBoost-Examples/taxi/notebooks/scala/taxi-ETL.ipynb) or
  [Python](../../../../examples/XGBoost-Examples/taxi/notebooks/python/taxi-ETL.ipynb)
- Note: Agaricus does not have ETL part.
   
6. Launch XGBoost Part
- Mortgage XGBoost Notebook: [Scala](../../../../examples/XGBoost-Examples/mortgage/notebooks/scala/mortgage-gpu.ipynb) 
- Taxi XGBoost Notebook: [Scala](../../../../examples/XGBoost-Examples/taxi/notebooks/scala/taxi-gpu.ipynb)
- Agaricus XGBoost Notebook: [Scala](../../../../examples/XGBoost-Examples/agaricus/notebooks/scala/agaricus-gpu.ipynb) 

================================================
FILE: docs/get-started/xgboost-examples/notebook/toree.md
================================================
Get Started with XGBoost4J-Spark with Apache Toree Jupyter Notebook
===================================================================

This is a getting started guide to XGBoost4J-Spark using an [Apache Toree](https://toree.apache.org/) Jupyter notebook. 
At the end of this guide, you will be able to run a sample notebook that runs on NVIDIA GPUs.

Before you begin, please ensure that you have setup a Spark Cluster(Standalone or YARN).
You should change `--master` config according to your cluster architecture. For example, set `--master yarn` for spark on YARN.

It is assumed that the `SPARK_MASTER` and `SPARK_HOME` environment variables are defined and point to the Spark Master URL (e.g. `spark://localhost:7077`),
and the home directory for Apache Spark respectively.

1. Make sure you have jupyter notebook and [sbt](https://www.scala-sbt.org/1.x/docs/Installing-sbt-on-Linux.html) installed first.
2. Build the 'toree' locally to support scala 2.12, and install it.

    ``` bash
    # Download toree
    wget https://github.com/apache/incubator-toree/archive/refs/tags/v0.5.0-incubating-rc4.tar.gz
    tar -xvzf v0.5.0-incubating-rc4.tar.gz
    # Build the Toree pip package.
    cd incubator-toree-0.5.0-incubating-rc4
    make pip-release
    # Install Toree
    pip install dist/toree-pip/toree-0.5.0.tar.gz
    ```
3. Prepare packages and dataset.

    Make sure you have prepared the necessary packages and dataset by following this [guide](/docs/get-started/xgboost-examples/prepare-package-data/preparation-scala.md)

4. Install a new kernel with gpu enabled and launch the notebook

    Note: For ETL jobs, Set `spark.task.resource.gpu.amount` to `1/spark.executor.cores`.

    For ETL:
    ``` bash
    jupyter toree install                                \
    --spark_home=${SPARK_HOME}                             \
    --user                                          \
    --toree_opts='--nosparkcontext'                         \
    --kernel_name="ETL-Spark"                         \
    --spark_opts='--master ${SPARK_MASTER} \
      --jars ${RAPIDS_JAR},${SAMPLE_JAR}       \
      --conf spark.plugins=com.nvidia.spark.SQLPlugin  \
      --conf spark.executor.extraClassPath=${RAPIDS_JAR} \
      --conf spark.executor.cores=10 \
      --conf spark.task.resource.gpu.amount=0.1 \
      --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
      --files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh'
    ```

    For XGBoost:
     ``` bash
    jupyter toree install                                \
    --spark_home=${SPARK_HOME}                             \
    --user                                          \
    --toree_opts='--nosparkcontext'                         \
    --kernel_name="XGBoost-Spark"                         \
    --spark_opts='--master ${SPARK_MASTER} \
      --jars ${RAPIDS_JAR},${SAMPLE_JAR}       \
      --conf spark.plugins=com.nvidia.spark.SQLPlugin  \
      --conf spark.executor.extraClassPath=${RAPIDS_JAR} \
      --conf spark.rapids.memory.gpu.pool=NONE \
      --conf spark.executor.resource.gpu.amount=1 \
      --conf spark.executor.cores=10 \
      --conf spark.task.resource.gpu.amount=1 \
      --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
      --files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh'
    ```

    Launch the notebook:

    ``` bash
    jupyter notebook
    ```

4. Launch ETL Part 
- Mortgage ETL Notebook: [Scala](../../../../examples/XGBoost-Examples/mortgage/notebooks/scala/mortgage-ETL.ipynb)
- Taxi ETL Notebook: [Scala](../../../../examples/XGBoost-Examples/taxi/notebooks/scala/taxi-ETL.ipynb)
- Note: Agaricus does not have ETL part.
   
5. Launch XGBoost Part
- Mortgage XGBoost Notebook: [Scala](../../../../examples/XGBoost-Examples/mortgage/notebooks/scala/mortgage-gpu.ipynb)
- Taxi XGBoost Notebook: [Scala](../../../../examples/XGBoost-Examples/taxi/notebooks/scala/taxi-gpu.ipynb)
- Agaricus XGBoost Notebook: [Scala](../../../../examples/XGBoost-Examples/agaricus/notebooks/scala/agaricus-gpu.ipynb)

================================================
FILE: docs/get-started/xgboost-examples/on-prem-cluster/kubernetes-scala.md
================================================
Get Started with XGBoost4J-Spark on Kubernetes
==============================================
This is a getting started guide to deploy XGBoost4J-Spark package on a Kubernetes cluster. At the end of this guide,
the reader will be able to run a sample Apache Spark XGBoost application on NVIDIA GPU Kubernetes cluster.

Prerequisites
-------------

* Apache Spark 3.2.0+ (e.g.: Spark 3.2.0)
* Hardware Requirements
  * NVIDIA Pascal™ GPU architecture or better
  * Multi-node clusters with homogenous GPU configuration
* Software Requirements
  * Ubuntu 20.04, 22.04/CentOS7, Rocky Linux 8
  * CUDA 11.0+
  * NVIDIA driver compatible with your CUDA
  * NCCL 2.7.8+
* [Kubernetes cluster with NVIDIA GPUs](https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html)
  * See official [Spark on Kubernetes](https://spark.apache.org/docs/latest/running-on-kubernetes.html#prerequisites) 
    instructions for detailed spark-specific cluster requirements
* kubectl installed and configured in the job submission environment
  * Required for managing jobs and retrieving logs

Build a GPU Spark Docker Image
------------------------------

Build a GPU Docker image with Spark resources in it, this Docker image must be accessible by each node in the Kubernetes cluster.

1. Locate your Spark installations. If you don't have one, you can [download](https://spark.apache.org/downloads.html) from Apache and unzip it.
2. `export SPARK_HOME=<path to spark>`
3. [Download the Dockerfile](/dockerfile/Dockerfile) into `${SPARK_HOME}`. (Here CUDA 11.0 is used as an example in the Dockerfile,
   you may need to update it for other CUDA versions.)
4. __(OPTIONAL)__ install any additional library jars into the `${SPARK_HOME}/jars` directory.
    * Most public cloud file systems are not natively supported -- pulling data and jar files from S3, GCS, etc. require installing additional libraries.
5. Build and push the docker image.

``` bash
export SPARK_HOME=<path to spark>
export SPARK_DOCKER_IMAGE=<gpu spark docker image repo and name>
export SPARK_DOCKER_TAG=<spark docker image tag>

pushd ${SPARK_HOME}
wget https://github.com/NVIDIA/spark-rapids-examples/raw/branch-25.08/dockerfile/Dockerfile

# Optionally install additional jars into ${SPARK_HOME}/jars/

docker build . -t ${SPARK_DOCKER_IMAGE}:${SPARK_DOCKER_TAG}
docker push ${SPARK_DOCKER_IMAGE}:${SPARK_DOCKER_TAG}
popd
```

Get Jars and Dataset
-------------------------------

Make sure you have prepared the necessary packages and dataset by following this [guide](/docs/get-started/xgboost-examples/prepare-package-data/preparation-scala.md).

Make sure that data and jars are accessible by each node of the Kubernetes cluster 
via [Kubernetes volumes](https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes), 
on cluster filesystems like HDFS, or in [object stores like S3 and GCS](https://spark.apache.org/docs/2.3.0/cloud-integration.html). 
Note that using [application dependencies](https://spark.apache.org/docs/latest/running-on-kubernetes.html#dependency-management) from 
the submission client’s local file system is currently not yet supported.

#### Note: 
1. Mortgage and Taxi jobs have ETLs to generate the processed data. 
2. For convenience, a subset of [Taxi](/datasets/) dataset is made available in this repo that can be readily used for launching XGBoost job. Use [ETL](#etl) to generate larger datasets for trainig and testing. 
3. Agaricus does not have an ETL process, it is combined with XGBoost as there is just a filter operation.

Save Kubernetes Template Resources
----------------------------------

When using Spark on Kubernetes the driver and executor pods can be launched with pod templates. In the XGBoost4J-Spark use case,
these template yaml files are used to allocate and isolate specific GPUs to each pod. The following is a barebones template file to allocate 1 GPU per pod.

```
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: gpu-example
      resources:
        limits:
          nvidia.com/gpu: 1
```

This 1 GPU template file should be sufficient for all XGBoost jobs because each executor should only run 1 task on a single GPU.
Save this yaml file to the local environment of the machine you are submitting jobs from, 
you will need to provide a path to it as an argument in your spark-submit command. 
Without the template file a pod will see every GPU on the cluster node it is allocated on and can attempt
to execute using a GPU which is already in use -- causing undefined behavior and errors.

<span id="etl">Launch Mortgage or Taxi ETL Part</span>
---------------------------
Use the ETL app to process raw Mortgage data. You can either use this ETLed data to split into training and evaluation data or run the ETL on different subsets of the dataset to produce training and evaluation datasets. 

Note: For ETL jobs, Set `spark.task.resource.gpu.amount` to `1/spark.executor.cores`.

Run spark-submit

``` bash
${SPARK_HOME}/bin/spark-submit \
   --conf spark.plugins=com.nvidia.spark.SQLPlugin \
   --conf spark.executor.resource.gpu.amount=1 \
   --conf spark.executor.cores=10 \
   --conf spark.task.resource.gpu.amount=0.1 \
   --conf spark.rapids.sql.incompatibleDateFormats.enabled=true \
   --conf spark.rapids.sql.csv.read.double.enabled=true \
   --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
   --conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer \
   --files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh \
   --jars ${RAPIDS_JAR}                                           \
   --master <k8s://ip:port or k8s://URL>                                                                  \
   --deploy-mode ${SPARK_DEPLOY_MODE}                                             \
   --num-executors ${SPARK_NUM_EXECUTORS}                                         \
   --driver-memory ${SPARK_DRIVER_MEMORY}                                         \
   --executor-memory ${SPARK_EXECUTOR_MEMORY}                                     \
   --class com.nvidia.spark.examples.mortgage.ETLMain  \
   $SAMPLE_JAR \
   -format=csv \
   -dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/" \
   -dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/train/" \
   -dataPath="tmp::${SPARK_XGBOOST_DIR}/mortgage/output/tmp/"

# if generating eval data, change the data path to eval
# -dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/"
# -dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/eval/"
# -dataPath="tmp::${SPARK_XGBOOST_DIR}/mortgage/output/tmp/"
# if running Taxi ETL benchmark, change the class and data path params to
# -class com.nvidia.spark.examples.taxi.ETLMain  
# -dataPath="raw::${SPARK_XGBOOST_DIR}/taxi/your-path"
# -dataPath="out::${SPARK_XGBOOST_DIR}/taxi/your-path"
```

Launch XGBoost Part on GPU
---------------------------

Variables required to run spark-submit command:

``` bash
# Variables dependent on how data was made accessible to each node
# Make sure to include relevant spark-submit configuration arguments
# location where data was saved
export DATA_PATH=<path to data directory> 

# Variables independent of how data was made accessible to each node
# kubernetes master URL, used as the spark master for job submission
export SPARK_MASTER=<k8s://ip:port or k8s://URL>

# local path to the template file saved in the previous step
export TEMPLATE_PATH=${HOME}/gpu_executor_template.yaml

# spark docker image location
export SPARK_DOCKER_IMAGE=<spark docker image repo and name>
export SPARK_DOCKER_TAG=<spark docker image tag>

# kubernetes service account to launch the job with
export K8S_ACCOUNT=<kubernetes service account name>

# spark deploy mode, cluster mode recommended for spark on kubernetes
export SPARK_DEPLOY_MODE=cluster

# run a single executor for this example to limit the number of spark tasks and
# partitions to 1 as currently this number must match the number of input files
export SPARK_NUM_EXECUTORS=1

# spark driver memory
export SPARK_DRIVER_MEMORY=4g

# spark executor memory
export SPARK_EXECUTOR_MEMORY=8g

# example class to use
export EXAMPLE_CLASS=com.nvidia.spark.examples.mortgage.Main
# or change to com.nvidia.spark.examples.taxi.Main to run Taxi Xgboost benchmark
# or change to com.nvidia.spark.examples.agaricus.Main to run Agaricus Xgboost benchmark

# tree construction algorithm
export TREE_METHOD=gpu_hist
```

Run spark-submit:

``` bash
${SPARK_HOME}/bin/spark-submit                                                          \
  --conf spark.plugins=com.nvidia.spark.SQLPlugin \
  --conf spark.rapids.memory.gpu.pool=NONE \
  --conf spark.executor.resource.gpu.amount=1 \
  --conf spark.task.resource.gpu.amount=1 \
  --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
  --files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh \
  --jars ${RAPIDS_JAR}                           \
  --master ${SPARK_MASTER}                                                              \
  --deploy-mode ${SPARK_DEPLOY_MODE}                                                    \
  --class ${EXAMPLE_CLASS}                                                              \
  --conf spark.executor.instances=${SPARK_NUM_EXECUTORS}                                \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=${K8S_ACCOUNT}         \
  --conf spark.kubernetes.container.image=${SPARK_DOCKER_IMAGE}:${SPARK_DOCKER_TAG}     \
  --conf spark.kubernetes.driver.podTemplateFile=${TEMPLATE_PATH}                       \
  --conf spark.kubernetes.executor.podTemplateFile=${TEMPLATE_PATH}                     \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark                  \
  ${SAMPLE_JAR}                                                                        \
  -dataPath=train::${SPARK_XGBOOST_DIR}/mortgage/output/train/                   \
  -dataPath=trans::${SPARK_XGBOOST_DIR}/mortgage/output/eval/                    \
  -format=parquet                                                                \
  -numWorkers=${SPARK_NUM_EXECUTORS}                                                    \
  -treeMethod=${TREE_METHOD}                                                            \
  -numRound=100                                                                         \
  -maxDepth=8                   
  
   # Please make sure to change the class and data path while running Taxi or Agaricus benchmark                                                       
                                                
```

Retrieve the logs using the driver's pod name that is printed to `stdout` by spark-submit 
```
export POD_NAME=<kubernetes pod name>
kubectl logs -f ${POD_NAME}
```

In the driver log, you should see timings* (in seconds), and the accuracy metric(take Mortgage as example):
```
--------------
==> Benchmark: Elapsed time for [Mortgage GPU train csv stub Unknown Unknown Unknown]: 30.132s
--------------

--------------
==> Benchmark: Elapsed time for [Mortgage GPU transform csv stub Unknown Unknown Unknown]: 22.352s
--------------

--------------
==> Benchmark: Accuracy for [Mortgage GPU Accuracy csv stub Unknown Unknown Unknown]: 0.9869451418401349
--------------
```

\* Kubernetes logs may not be nicely formatted since `stdout` and `stderr` are not kept separately.

\* The timings in this Getting Started guide are only for illustrative purpose. 
Please see our [release announcement](https://medium.com/rapids-ai/nvidia-gpus-and-apache-spark-one-step-closer-2d99e37ac8fd) for official benchmarks.


================================================
FILE: docs/get-started/xgboost-examples/on-prem-cluster/standalone-python.md
================================================
Get Started with XGBoost4J-Spark on an Apache Spark Standalone Cluster
======================================================================
This is a getting started guide to XGBoost4J-Spark on an Apache Spark 3.2+ Standalone Cluster.
At the end of this guide, the user can run a sample Apache Spark Python application that runs on NVIDIA GPUs.

Prerequisites
-------------

* Apache Spark 3.2.0+ (e.g.: Spark 3.2.0)
* Hardware Requirements
  * NVIDIA Pascal™ GPU architecture or better
  * Multi-node clusters with homogenous GPU configuration
* Software Requirements
  * Ubuntu 20.04, 22.04/CentOS7, Rocky Linux 8
  * CUDA 11.5+
  * NVIDIA driver compatible with your CUDA
  * NCCL 2.7.8+
  * Python 3.8 or 3.9
  * NumPy
  * XGBoost 1.7.0+
  * cudf-cu11  

The number of GPUs in each host dictates the number of Spark executors that can run there.
Additionally, cores per Spark executor and cores per Spark task must match, such that each executor can run 1 task at any given time.

For example, if each host has 4 GPUs, there should be 4 or fewer executors running on each host,
and each executor should run at most 1 task (e.g.: a total of 4 tasks running on 4 GPUs).

In Spark Standalone mode, the default configuration is for an executor to take up all the cores assigned to each Spark Worker.
In this example, we will limit the number of cores to 1, to match our dataset.
Please see https://spark.apache.org/docs/latest/spark-standalone.html for more documentation regarding Standalone configuration.

We use `SPARK_HOME` environment variable to point to the Apache Spark cluster.
And here are the steps to enable the GPU resources discovery for Spark 3.2+.

1. Copy the spark config file from template

    ``` bash
    cd ${SPARK_HOME}/conf/
    cp spark-defaults.conf.template spark-defaults.conf
    ```

2. Add the following configs to the file `spark-defaults.conf`.

   The number in the first config should **NOT** be larger than the actual number of the GPUs on current host.
   This example uses 1 as below for one GPU on the host.

    ```bash
    spark.worker.resource.gpu.amount 1
    spark.worker.resource.gpu.discoveryScript ${SPARK_HOME}/examples/src/main/scripts/getGpusResources.sh
    ```
3. Install the XGBoost, cudf-cu11, numpy libraries on all nodes before running XGBoost application.

``` bash
pip install xgboost
pip install cudf-cu11 --extra-index-url=https://pypi.nvidia.com
pip install numpy
pip install scikit-learn
```

Get Application Files, Jar and Dataset
-------------------------------

Make sure you have prepared the necessary packages and dataset by following this [guide](../prepare-package-data/preparation-python.md)


#### Note: 
1. Mortgage and Taxi jobs have ETLs to generate the processed data.
2. For convenience, a subset of [Taxi](/datasets/) dataset is made available in this repo that can be readily used for launching XGBoost job. Use [ETL](standalone-python.md#launch-mortgage-or-taxi-etl-part) to generate larger datasets for training and testing.
3. Agaricus does not have an ETL process, it is combined with XGBoost as there is just a filter operation.


Launch a Standalone Spark Cluster
---------------------------------

1. Copy required jars to `$SPARK_HOME/jars` folder.

    ``` bash
    cp ${RAPIDS_JAR} $SPARK_HOME/jars/
    ```

2. Start the Spark Master process.

    ``` bash
    ${SPARK_HOME}/sbin/start-master.sh
    ```

    Note the hostname or ip address of the Master host, so that it can be given to each Worker process, in this example the Master and Worker will run on the same host.

3. Start a spark slave process.

    ``` bash
    export SPARK_MASTER=spark://`hostname -f`:7077
    export SPARK_CORES_PER_WORKER=1

    ${SPARK_HOME}/sbin/start-slave.sh ${SPARK_MASTER} -c ${SPARK_CORES_PER_WORKER}
    ```

    Note that in this example the Master and Worker processes are both running on the same host. This is not a requirement, as long as all hosts that are used to run the Spark app have access to the dataset.

Launch Mortgage or Taxi ETL Part
---------------------------
Use the ETL app to process raw Mortgage data. You can either use this ETLed data to split into training and evaluation data or run the ETL on different subsets of the dataset to produce training and evaluation datasets.

Note: For ETL jobs, Set `spark.task.resource.gpu.amount` to `1/spark.executor.cores`.
### ETL on GPU
``` bash
${SPARK_HOME}/bin/spark-submit \
    --master spark://$HOSTNAME:7077 \
    --executor-memory 32G \
    --conf spark.executor.resource.gpu.amount=1 \
    --conf spark.executor.cores=10 \
    --conf spark.task.resource.gpu.amount=0.1 \
    --conf spark.plugins=com.nvidia.spark.SQLPlugin \
    --conf spark.rapids.sql.incompatibleDateFormats.enabled=true \
    --conf spark.rapids.sql.csv.read.double.enabled=true \
    --conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer \
    --py-files ${SAMPLE_ZIP} \
    main.py \
    --mainClass='com.nvidia.spark.examples.mortgage.etl_main' \
    --format=csv \
    --dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/" \
    --dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/train/" \
    --dataPath="tmp::${SPARK_XGBOOST_DIR}/mortgage/output/tmp/"

# if generating eval data, change the data path to eval
# --dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/"
# --dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/eval/"
# --dataPath="tmp::${SPARK_XGBOOST_DIR}/mortgage/output/tmp/"
# if running Taxi ETL benchmark, change the class and data path params to
# -class com.nvidia.spark.examples.taxi.ETLMain  
# -dataPath="raw::${SPARK_XGBOOST_DIR}/taxi/your-path"
# -dataPath="out::${SPARK_XGBOOST_DIR}/taxi/your-path"
```
### ETL on CPU
```bash
${SPARK_HOME}/bin/spark-submit \
    --master spark://$HOSTNAME:7077 \
    --executor-memory 32G \
    --conf spark.executor.instances=1 \
    --py-files ${SAMPLE_ZIP} \
    main.py \
    --mainClass='com.nvidia.spark.examples.mortgage.etl_main' \
    --format=csv \
    --dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/" \
    --dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/train/" \
    --dataPath="tmp::${SPARK_XGBOOST_DIR}/mortgage/output/tmp/"

# if generating eval data, change the data path to eval
# --dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/"
# --dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/eval/"
# --dataPath="tmp::${SPARK_XGBOOST_DIR}/mortgage/output/tmp/"
# if running Taxi ETL benchmark, change the class and data path params to
# -class com.nvidia.spark.examples.taxi.ETLMain  
# -dataPath="raw::${SPARK_XGBOOST_DIR}/taxi/your-path"
# -dataPath="out::${SPARK_XGBOOST_DIR}/taxi/your-path"
```

Launch XGBoost Part on GPU
---------------------------

Variables required to run spark-submit command:

``` bash
# this is the same master host we defined while launching the cluster
export SPARK_MASTER=spark://`hostname -f`:7077

# Currently the number of tasks and executors must match the number of input files.
# For this example, we will set these such that we have 1 executor, with 1 core per executor

## take up the the whole worker
export SPARK_CORES_PER_EXECUTOR=${SPARK_CORES_PER_WORKER}

## run 1 executor
export SPARK_NUM_EXECUTORS=1

## cores/executor * num_executors, which in this case is also 1, limits
## the number of cores given to the application
export TOTAL_CORES=$((SPARK_CORES_PER_EXECUTOR * SPARK_NUM_EXECUTORS))

# spark driver memory
export SPARK_DRIVER_MEMORY=4g

# spark executor memory
export SPARK_EXECUTOR_MEMORY=8g

# example class to use
export EXAMPLE_CLASS=com.nvidia.spark.examples.mortgage.main
# or change to com.nvidia.spark.examples.taxi.main to run Taxi Xgboost benchmark
# or change to com.nvidia.spark.examples.agaricus.main to run Agaricus Xgboost benchmark

# tree construction algorithm
export TREE_METHOD=gpu_hist

# if you enable archive python environment
export PYSPARK_DRIVER_PYTHON=python
export PYSPARK_PYTHON=./environment/bin/python
```

Run spark-submit:

``` bash
${SPARK_HOME}/bin/spark-submit                                                  \
 --conf spark.plugins=com.nvidia.spark.SQLPlugin                       \
 --conf spark.rapids.memory.gpu.pool=NONE                     \
 --conf spark.executor.resource.gpu.amount=1                           \
 --conf spark.task.resource.gpu.amount=1                              \
 --master ${SPARK_MASTER}                                                       \
 --driver-memory ${SPARK_DRIVER_MEMORY}                                         \
 --executor-memory ${SPARK_EXECUTOR_MEMORY}                                     \
 --conf spark.cores.max=${TOTAL_CORES}                                          \
 --archives your_pyspark_venv.tar.gz#environment     #if you enabled archive python environment \
 --jars ${RAPIDS_JAR}    \
 --py-files ${SAMPLE_ZIP}                   \
 ${MAIN_PY}                                                     \
 --mainClass=${EXAMPLE_CLASS}                                                   \
 --dataPath=train::${SPARK_XGBOOST_DIR}/mortgage/output/train/      \
 --dataPath=trans::${SPARK_XGBOOST_DIR}/mortgage/output/eval/      \
 --format=parquet                                 \
 --numWorkers=${SPARK_NUM_EXECUTORS}                                            \
 --treeMethod=${TREE_METHOD}                                                    \
 --numRound=100                                                                 \
 --maxDepth=8

 # Change the format to csv if your input file is CSV format.
 # Please make sure to change the class and data path while running Taxi or Agaricus benchmark  
```

In the `stdout` log on driver side, you should see timings<sup>*</sup> (in seconds), and the accuracy metric:

```
----------------------------------------------------------------------------------------------------
Training takes 14.65 seconds

----------------------------------------------------------------------------------------------------
Transformation takes 12.21 seconds

----------------------------------------------------------------------------------------------------
Accuracy is 0.9873692247091792
```

Launch XGBoost Part on CPU
---------------------------

If you are running this example after running the GPU example above, please set these variables,
to set both training and testing to run on the CPU exclusively:

``` bash
# this is the same master host we defined while launching the cluster
export SPARK_MASTER=spark://`hostname -f`:7077

# Currently the number of tasks and executors must match the number of input files.
# For this example, we will set these such that we have 1 executor, with 1 core per executor

## take up the the whole worker
export SPARK_CORES_PER_EXECUTOR=${SPARK_CORES_PER_WORKER}

## run 1 executor
export SPARK_NUM_EXECUTORS=1

## cores/executor * num_executors, which in this case is also 1, limits
## the number of cores given to the application
export TOTAL_CORES=$((SPARK_CORES_PER_EXECUTOR * SPARK_NUM_EXECUTORS))

# spark driver memory
export SPARK_DRIVER_MEMORY=4g

# spark executor memory
export SPARK_EXECUTOR_MEMORY=8g

# example class to use
export EXAMPLE_CLASS=com.nvidia.spark.examples.mortgage.main
# Please make sure to change the class while running Taxi or Agaricus benchmark    

# tree construction algorithm
export TREE_METHOD=hist

# if you enable archive python environment
export PYSPARK_DRIVER_PYTHON=python
export PYSPARK_PYTHON=./environment/bin/python
```

This is the same command as for the GPU example, repeated for convenience:

``` bash
${SPARK_HOME}/bin/spark-submit                                                  \
 --master ${SPARK_MASTER}                                                       \
 --driver-memory ${SPARK_DRIVER_MEMORY}                                         \
 --executor-memory ${SPARK_EXECUTOR_MEMORY}                                     \
 --conf spark.cores.max=${TOTAL_CORES}                                          \
 --archives your_pyspark_venv.tar.gz#environment     #if you enabled archive python environment \
 --jars ${RAPIDS_JAR}     \
 --py-files ${SAMPLE_ZIP}                       \
 ${SPARK_PYTHON_ENTRYPOINT}                                                     \
 --mainClass=${EXAMPLE_CLASS}                                                   \
 --dataPath=train::${DATA_PATH}/mortgage/output/train/      \
 --dataPath=trans::${DATA_PATH}/mortgage/output/eval/         \
 --format=parquet                                                               \
 --numWorkers=${SPARK_NUM_EXECUTORS}                                            \
 --treeMethod=${TREE_METHOD}                                                    \
 --numRound=100                                                                 \
 --maxDepth=8

 # Change the format to csv if your input file is CSV format.
 # Please make sure to change the class and data path while running Taxi or Agaricus benchmark  
 
```

In the `stdout` log on driver side, you should see timings<sup>*</sup> (in seconds), and the accuracy metric:

```
----------------------------------------------------------------------------------------------------
Training takes 225.7 seconds

----------------------------------------------------------------------------------------------------
Transformation takes 36.26 seconds

----------------------------------------------------------------------------------------------------
Accuracy is 0.9873709530950067
```

<sup>*</sup> The timings in this Getting Started guide are only illustrative.
Please see our [release announcement](https://medium.com/rapids-ai/nvidia-gpus-and-apache-spark-one-step-closer-2d99e37ac8fd) for official benchmarks.


================================================
FILE: docs/get-started/xgboost-examples/on-prem-cluster/standalone-scala.md
================================================
Get Started with XGBoost4J-Spark on an Apache Spark Standalone Cluster
======================================================================

This is a getting-started guide to XGBoost on an Apache Spark 3.2+ Standalone Cluster. At the end of this guide,
the user can run a sample Apache Spark application that runs on NVIDIA GPUs.

Prerequisites
-------------

* Apache Spark 3.2.0+ Standalone Cluster (e.g.: Spark 3.2.0)
* Hardware Requirements
  * NVIDIA Pascal™ GPU architecture or better
  * Multi-node clusters with homogenous GPU configuration
* Software Requirements
  * Ubuntu 20.04, 22.04/CentOS7, Rocky Linux 8
  * CUDA 11.0+
  * NVIDIA driver compatible with your CUDA
  * NCCL 2.7.8+
  
The number of GPUs in each host dictates the number of Spark executors that can run there. Additionally,
cores per Spark executor and cores per Spark task must match, such that each executor can run 1 task at any given time.

For example, if each host has 4 GPUs, there should be 4 or fewer executors running on each host,
and each executor should run at most 1 task (e.g.: a total of 4 tasks running on 4 GPUs).

In Spark Standalone mode, the default configuration is for an executor to take up all the cores assigned to each Spark Worker.
In this example, we will limit the number of cores to 1, to match our dataset.
Please see https://spark.apache.org/docs/latest/spark-standalone.html for more documentation regarding Standalone configuration.

We use `SPARK_HOME` environment variable to point to the Apache Spark cluster.
And here are steps to enable the GPU resources discovery for Spark 3.2+.

1. Copy the spark configure file from template.

    ``` bash
    cd ${SPARK_HOME}/conf/
    cp spark-defaults.conf.template spark-defaults.conf
    ```

2. Add the following configs to the file `spark-defaults.conf`.
  
    The number in first config should NOT be larger than the actual number of the GPUs on current host.
   This example uses 1 as below for one GPU on the host.

    ``` bash
    spark.worker.resource.gpu.amount 1
    spark.worker.resource.gpu.discoveryScript ${SPARK_HOME}/examples/src/main/scripts/getGpusResources.sh
    ```

Get Jars and Dataset
-------------------------------

Make sure you have prepared the necessary packages and dataset 
by following this [guide](/docs/get-started/xgboost-examples/prepare-package-data/preparation-scala.md)

#### Note: 
1. Mortgage and Taxi jobs have ETLs to generate the processed data. 
2. For convenience, a subset of [Taxi](/datasets/) dataset is made available in this repo that can be readily used for launching XGBoost job. Use [ETL](#etl) to generate larger datasets for trainig and testing. 
3. Agaricus does not have an ETL process, it is combined with XGBoost as there is just a filter operation.


Launch a Standalone Spark Cluster
---------------------------------

1. Copy required jars to `$SPARK_HOME/jars` folder.

    ``` bash
    cp $RAPIDS_JAR $SPARK_HOME/jars/
    ```

2. Start the Spark Master process.

    ``` bash
    ${SPARK_HOME}/sbin/start-master.sh
    ```

    Note the hostname or ip address of the Master host, so that it can be given to each Worker process,
    in this example the Master and Worker will run on the same host.

3. Start a Spark slave process.

    ``` bash
    export SPARK_MASTER=spark://`hostname -f`:7077
    export SPARK_CORES_PER_WORKER=1

    ${SPARK_HOME}/sbin/start-slave.sh ${SPARK_MASTER} -c ${SPARK_CORES_PER_WORKER} 
    ```

    Note that in this example the Master and Worker processes are both running on the same host. 
    This is not a requirement, as long as all hosts that are used to run the Spark app have access to the dataset.

<span id="etl">Launch Mortgage or Taxi ETL Part</span>
---------------------------

Use the ETL app to process raw Mortgage data. You can either use this ETLed data to split into training and evaluation data or run the ETL on different subsets of the dataset to produce training and evaluation datasets.
Run spark-submit

Note: For ETL jobs, Set `spark.task.resource.gpu.amount` to `1/spark.executor.cores`.

### ETL on GPU 
``` bash
${SPARK_HOME}/bin/spark-submit \
    --master spark://$HOSTNAME:7077 \
    --executor-memory 32G \
    --conf spark.executor.resource.gpu.amount=1 \
    --conf spark.executor.cores=10 \
    --conf spark.task.resource.gpu.amount=0.1 \
    --conf spark.plugins=com.nvidia.spark.SQLPlugin \
    --conf spark.rapids.sql.incompatibleDateFormats.enabled=true \
    --conf spark.rapids.sql.csv.read.double.enabled=true \
    --conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer \
    --class com.nvidia.spark.examples.mortgage.ETLMain  \
    $SAMPLE_JAR \
    -format=csv \
    -dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/" \
    -dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/train/" \
    -dataPath="tmp::${SPARK_XGBOOST_DIR}/mortgage/output/tmp/"

# if generating eval data, change the data path to eval 
# -dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/"
# -dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/eval/"
# -dataPath="tmp::${SPARK_XGBOOST_DIR}/mortgage/output/tmp/"
# if running Taxi ETL benchmark, change the class and data path params to
# -class com.nvidia.spark.examples.taxi.ETLMain  
# -dataPath="raw::${SPARK_XGBOOST_DIR}/taxi/your-path"
# -dataPath="out::${SPARK_XGBOOST_DIR}/taxi/your-path"
```

### ETL on CPU

```bash
${SPARK_HOME}/bin/spark-submit \
--master spark://$HOSTNAME:7077 \
--executor-memory 32G \
--conf spark.executor.instances=1 \
--conf spark.sql.broadcastTimeout=700 \
--class com.nvidia.spark.examples.mortgage.ETLMain  \
$SAMPLE_JAR \
-format=csv \
-dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/" \
-dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/train/" \
-dataPath="tmp::${SPARK_XGBOOST_DIR}/mortgage/output/tmp/"

# if generating eval data, change the data path to eval 
# -dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/"
# -dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/eval/"
# if running Taxi ETL benchmark, change the class and data path params to
# -class com.nvidia.spark.examples.taxi.ETLMain  
# -dataPath="raw::${SPARK_XGBOOST_DIR}/taxi/your-path"
# -dataPath="out::${SPARK_XGBOOST_DIR}/taxi/your-path"
```

Launch XGBoost Part on GPU
---------------------------

Variables required to run spark-submit command:

``` bash
# this is the same master host we defined while launching the cluster
export SPARK_MASTER=spark://`hostname -f`:7077

# Currently the number of tasks and executors must match the number of input files.
# For this example, we will set these such that we have 1 executor, with 1 core per executor

## take up the the whole worker
export SPARK_CORES_PER_EXECUTOR=${SPARK_CORES_PER_WORKER}

## run 1 executor
export SPARK_NUM_EXECUTORS=1

## cores/executor * num_executors, which in this case is also 1, limits
## the number of cores given to the application
export TOTAL_CORES=$((SPARK_CORES_PER_EXECUTOR * SPARK_NUM_EXECUTORS))

# spark driver memory
export SPARK_DRIVER_MEMORY=4g

# spark executor memory
export SPARK_EXECUTOR_MEMORY=8g

# example class to use
export EXAMPLE_CLASS=com.nvidia.spark.examples.mortgage.Main
# or change to com.nvidia.spark.examples.taxi.Main to run Taxi Xgboost benchmark
# or change to com.nvidia.spark.examples.agaricus.Main to run Agaricus Xgboost benchmark

# tree construction algorithm
export TREE_METHOD=gpu_hist
```

Run spark-submit:

``` bash
${SPARK_HOME}/bin/spark-submit                                                  \
 --conf spark.plugins=com.nvidia.spark.SQLPlugin                       \
 --conf spark.rapids.memory.gpu.pool=NONE                     \
 --conf spark.executor.resource.gpu.amount=1                           \
 --conf spark.task.resource.gpu.amount=1                              \
 --master ${SPARK_MASTER}                                                       \
 --driver-memory ${SPARK_DRIVER_MEMORY}                                         \
 --executor-memory ${SPARK_EXECUTOR_MEMORY}                                     \
 --conf spark.cores.max=${TOTAL_CORES}                                          \
 --class ${EXAMPLE_CLASS}                                                       \
 ${SAMPLE_JAR}                                                                 \
 -dataPath=train::${SPARK_XGBOOST_DIR}/mortgage/output/train/      \
 -dataPath=trans::${SPARK_XGBOOST_DIR}/mortgage/output/eval/          \
 -format=parquet                                                                    \
 -numWorkers=${SPARK_NUM_EXECUTORS}                                             \
 -treeMethod=${TREE_METHOD}                                                     \
 -numRound=100                                                                  \
 -maxDepth=8                      
 # Please make sure to change the class and data path while running Taxi or Agaricus benchmark                                              
```

In `stdout` log on driver side, you should see timings<sup>*</sup> (in seconds), 
and the accuracy metric(take Mortgage as example):

```
--------------
==> Benchmark: Elapsed time for [Mortgage GPU train csv stub Unknown Unknown Unknown]: 26.572s
--------------

--------------
==> Benchmark: Elapsed time for [Mortgage GPU transform csv stub Unknown Unknown Unknown]: 10.323s
--------------

--------------
==> Benchmark: Accuracy for [Mortgage GPU Accuracy csv stub Unknown Unknown Unknown]: 0.9869227318579323
--------------
```

Launch XGBoost Part on CPU
---------------------------

If you are running this example after running the GPU example above, please set these variables, 
to set both training and testing to run on the CPU exclusively:

``` bash
# this is the same master host we defined while launching the cluster
export SPARK_MASTER=spark://`hostname -f`:7077

# Currently the number of tasks and executors must match the number of input files.
# For this example, we will set these such that we have 1 executor, with 1 core per executor

## take up the the whole worker
export SPARK_CORES_PER_EXECUTOR=${SPARK_CORES_PER_WORKER}

## run 1 executor
export SPARK_NUM_EXECUTORS=1

## cores/executor * num_executors, which in this case is also 1, limits
## the number of cores given to the application
export TOTAL_CORES=$((SPARK_CORES_PER_EXECUTOR * SPARK_NUM_EXECUTORS))

# spark driver memory
export SPARK_DRIVER_MEMORY=4g

# spark executor memory
export SPARK_EXECUTOR_MEMORY=8g

# example class to use
export EXAMPLE_CLASS=com.nvidia.spark.examples.mortgage.Main
# Please make sure to change the class while running Taxi or Agaricus benchmark     

# tree construction algorithm
export TREE_METHOD=hist
```

This is the same command as for the GPU example, repeated for convenience:

```bash
${SPARK_HOME}/bin/spark-submit                                                  \
 --master ${SPARK_MASTER}                                                       \
 --driver-memory ${SPARK_DRIVER_MEMORY}                                         \
 --executor-memory ${SPARK_EXECUTOR_MEMORY}                                     \
 --conf spark.cores.max=${TOTAL_CORES}                                          \
 --class ${EXAMPLE_CLASS}                                                       \
 ${SAMPLE_JAR}                                                                 \
 -dataPath=train::${SPARK_XGBOOST_DIR}/mortgage/output/train/      \
 -dataPath=trans::${SPARK_XGBOOST_DIR}/mortgage/output/eval/          \
 -format=parquet                                                                    \
 -numWorkers=${SPARK_NUM_EXECUTORS}                                             \
 -treeMethod=${TREE_METHOD}                                                     \
 -numRound=100                                                                  \
 -maxDepth=8                  
 
 # Please make sure to change the class and data path while running Taxi or Agaricus benchmark                                                       
```

In the `stdout` log on driver side, you should see timings<sup>*</sup> (in seconds), and the accuracy metric(take Mortgage as example):

```
--------------
==> Benchmark: Elapsed time for [Mortgage CPU train csv stub Unknown Unknown Unknown]: 305.535s
--------------

--------------
==> Benchmark: Elapsed time for [Mortgage CPU transform csv stub Unknown Unknown Unknown]: 52.867s
--------------

--------------
==> Benchmark: Accuracy for [Mortgage CPU Accuracy csv stub Unknown Unknown Unknown]: 0.9872234894511343
--------------
```

<sup>*</sup> The timings in this Getting Started guide are only for illustrative purpose. 
Please see our [release announcement](https://medium.com/rapids-ai/nvidia-gpus-and-apache-spark-one-step-closer-2d99e37ac8fd) 
for official benchmarks.


================================================
FILE: docs/get-started/xgboost-examples/on-prem-cluster/yarn-python.md
================================================
Get Started with XGBoost4J-Spark on Apache Hadoop YARN
======================================================
This is a getting started guide to XGBoost4J-Spark on Apache Hadoop YARN supporting GPU scheduling.
At the end of this guide, the reader will be able to run a sample Apache Spark Python application that runs on NVIDIA GPUs.

Prerequisites
-------------

* Apache Spark 3.2.0+ running on YARN supporting GPU scheduling. (e.g.: Spark 3.2.0, Hadoop-Yarn 3.3.0)
* Hardware Requirements
  * NVIDIA Pascal™ GPU architecture or better
  * Multi-node clusters with homogenous GPU configuration
* Software Requirements
  * Ubuntu 20.04, 22.04/CentOS7, Rocky Linux 8
  * CUDA 11.5+
  * NVIDIA driver compatible with your CUDA
  * NCCL 2.7.8+
  * Python 3.8 or 3.9
  * NumPy
  * XGBoost 1.7.0+
  * cudf-cu11  
  
The number of GPUs per NodeManager dictates the number of Spark executors that can run in that NodeManager. 
Additionally, cores per Spark executor and cores per Spark task must match, such that each executor can run 1 task at any given time.

For example: if each NodeManager has 4 GPUs, there should be 4 or fewer executors running on each NodeManager, 
and each executor should run 1 task (e.g.: A total of 4 tasks running on 4 GPUs). In order to achieve this, 
you may need to adjust `spark.task.cpus` and `spark.executor.cores` to match (both set to 1 by default).

Additionally, we recommend adjusting `executor-memory` to divide host memory evenly amongst the number of GPUs in each NodeManager,
such that Spark will schedule as many executors as there are GPUs in each NodeManager.

We use `SPARK_HOME` environment variable to point to the Apache Spark cluster. 
And as to how to enable GPU scheduling and isolation for Yarn,
please refer to [here](https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/UsingGpus.html).

Please make sure to install the XGBoost, cudf-cu11, numpy libraries on all nodes before running XGBoost application.
``` bash
pip install xgboost
pip install cudf-cu11 --extra-index-url=https://pypi.nvidia.com
pip install numpy
pip install scikit-learn
```
You can also create an isolated python environment by using [Virtualenv](https://virtualenv.pypa.io/en/latest/),
and then directly pass/unpack the archive file and enable the environment on executors
by leveraging the --archives option or spark.archives configuration.
``` bash
# create an isolated python environment and install libraries
python -m venv pyspark_venv
source pyspark_venv/bin/activate
pip install xgboost
pip install cudf-cu11 --extra-index-url=https://pypi.nvidia.com
pip install numpy
pip install scikit-learn
venv-pack -o pyspark_venv.tar.gz

# enable archive python environment on executors
export PYSPARK_DRIVER_PYTHON=python # Do not set in cluster modes.
export PYSPARK_PYTHON=./environment/bin/python
spark-submit --archives pyspark_venv.tar.gz#environment app.py
```

Get Application Files, Jar and Dataset
-------------------------------

Make sure you have prepared the necessary packages and dataset by following this [guide](../prepare-package-data/preparation-python.md)

Then create a directory in HDFS, and run below commands,

``` bash
[xgboost4j_spark_python]$ hadoop fs -mkdir /tmp/xgboost4j_spark_python
[xgboost4j_spark_python]$ hadoop fs -copyFromLocal ${SPARK_XGBOOST_DIR}/mortgage/* /tmp/xgboost4j_spark_python
```

Launch Mortgage or Taxi ETL Part
---------------------------

Use the ETL app to process raw Mortgage data. You can either use this ETLed data to split into training and evaluation data or run the ETL on different subsets of the dataset to produce training and evaluation datasets.

Note: For ETL jobs, Set `spark.task.resource.gpu.amount` to `1/spark.executor.cores`.

``` bash
# location where data was downloaded
export DATA_PATH=hdfs:/tmp/xgboost4j_spark_python/

${SPARK_HOME}/bin/spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --conf spark.executor.cores=10 \
    --conf spark.task.resource.gpu.amount=0.1 \
    --conf spark.rapids.sql.incompatibleDateFormats.enabled=true \
    --conf spark.rapids.sql.csv.read.double.enabled=true \
    --conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer \
    --jars ${RAPIDS_JAR}\
    ${MAIN_PY} \
    --mainClass='com.nvidia.spark.examples.mortgage.etl_main' \
    --format=csv \
    --dataPath="data::${DATA_PATH}/mortgage/data/mortgage/input/" \
    --dataPath="out::${DATA_PATH}/mortgage/data/mortgage/output/train/" \
    --dataPath="tmp::${SPARK_XGBOOST_DIR}/mortgage/output/tmp/"

# if generating eval data, change the data path to eval
# --dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/"
# --dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/eval/"
# --dataPath="tmp::${SPARK_XGBOOST_DIR}/mortgage/output/tmp/"
# if running Taxi ETL benchmark, change the class and data path params to
# -class com.nvidia.spark.examples.taxi.ETLMain  
# -dataPath="raw::${SPARK_XGBOOST_DIR}/taxi/your-path"
# -dataPath="out::${SPARK_XGBOOST_DIR}/taxi/your-path"
```

Launch XGBoost Part on GPU
---------------------------

Variables required to run spark-submit command:

``` bash
# location where data was downloaded
export DATA_PATH=hdfs:/tmp/xgboost4j_spark_python

# spark deploy mode (see Apache Spark documentation for more information)
export SPARK_DEPLOY_MODE=cluster

# run a single executor for this example to limit the number of spark tasks and
# partitions to 1 as currently this number must match the number of input files
export SPARK_NUM_EXECUTORS=1

# spark driver memory
export SPARK_DRIVER_MEMORY=4g

# spark executor memory
export SPARK_EXECUTOR_MEMORY=8g

# python entrypoint
export SPARK_PYTHON_ENTRYPOINT=${LIBS_PATH}/main.py

# example class to use
export EXAMPLE_CLASS=com.nvidia.spark.examples.mortgage.main
# or change to com.nvidia.spark.examples.taxi.main to run Taxi Xgboost benchmark
# or change to com.nvidia.spark.examples.agaricus.main to run Agaricus Xgboost benchmark

# tree construction algorithm
export TREE_METHOD=gpu_hist

# if you enable archive python environment
export PYSPARK_DRIVER_PYTHON=python
export PYSPARK_PYTHON=./environment/bin/python
```

Run spark-submit:

``` bash
${SPARK_HOME}/bin/spark-submit                                                  \
 --conf spark.plugins=com.nvidia.spark.SQLPlugin                       \
 --conf spark.rapids.memory.gpu.pool=NONE                     \
 --conf spark.executor.resource.gpu.amount=1                           \
 --conf spark.task.resource.gpu.amount=1                              \
 --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh        \
 --files ${SPARK_HOME}/examples/src/main/scripts/getGpusResources.sh            \
 --master yarn                                                                  \
 --deploy-mode ${SPARK_DEPLOY_MODE}                                             \
 --archives your_pyspark_venv.tar.gz#environment     #if you enabled archive python environment \
 --num-executors ${SPARK_NUM_EXECUTORS}                                         \
 --driver-memory ${SPARK_DRIVER_MEMORY}                                         \
 --executor-memory ${SPARK_EXECUTOR_MEMORY}                                     \
 --jars ${RAPIDS_JAR}        \
 --py-files ${SAMPLE_ZIP}                   \
 ${MAIN_PY}                                                     \
 --mainClass=${EXAMPLE_CLASS}                                                   \
 --dataPath=train::${DATA_PATH}/mortgage/out/train/      \
 --dataPath=trans::${DATA_PATH}/mortgage/out/eval/        \
 --format=parquet                                                                   \
 --numWorkers=${SPARK_NUM_EXECUTORS}                                            \
 --treeMethod=${TREE_METHOD}                                                    \
 --numRound=100                                                                 \
 --maxDepth=8

# Change the format to csv if your input file is CSV format.
# Please make sure to change the class and data path while running Taxi or Agaricus benchmark  
```

In the `stdout` driver log, you should see timings<sup>*</sup> (in seconds), and the accuracy metric:

```
----------------------------------------------------------------------------------------------------
Training takes 10.75 seconds

----------------------------------------------------------------------------------------------------
Transformation takes 4.38 seconds

----------------------------------------------------------------------------------------------------
Accuracy is 0.997544753891
```

Launch XGBoost Part on CPU
---------------------------

If you are running this example after running the GPU example above, please set these variables, to set both training and testing to run on the CPU exclusively:

``` bash
# location where data was downloaded
export DATA_PATH=hdfs:/tmp/xgboost4j_spark_python/

# spark deploy mode (see Apache Spark documentation for more information)
export SPARK_DEPLOY_MODE=cluster

# run a single executor for this example to limit the number of spark tasks and
# partitions to 1 as currently this number must match the number of input files
export SPARK_NUM_EXECUTORS=1

# spark driver memory
export SPARK_DRIVER_MEMORY=4g

# spark executor memory
export SPARK_EXECUTOR_MEMORY=8g

# example class to use
export EXAMPLE_CLASS=com.nvidia.spark.examples.mortgage.main
# or change to com.nvidia.spark.examples.taxi.main to run Taxi Xgboost benchmark
# or change to com.nvidia.spark.examples.agaricus.main to run Agaricus Xgboost benchmark

# tree construction algorithm
export TREE_METHOD=hist

# if you enable archive python environment
export PYSPARK_DRIVER_PYTHON=python
export PYSPARK_PYTHON=./environment/bin/python
```

This is the same command as for the GPU example, repeated for convenience:

``` bash
${SPARK_HOME}/bin/spark-submit                                                  \
 --master yarn                                                                  \
 --archives your_pyspark_venv.tar.gz#environment     #if you enabled archive python environment \
 --deploy-mode ${SPARK_DEPLOY_MODE}                                             \
 --num-executors ${SPARK_NUM_EXECUTORS}                                         \
 --driver-memory ${SPARK_DRIVER_MEMORY}                                         \
 --executor-memory ${SPARK_EXECUTOR_MEMORY}                                     \
 --jars ${RAPIDS_JAR}        \
 --py-files ${SAMPLE_ZIP}                                  \
 ${MAIN_PY}                                                     \
 --mainClass=${EXAMPLE_CLASS}                                                   \
 --dataPath=train::${DATA_PATH}/mortgage/output/train/       \
 --dataPath=trans::${DATA_PATH}/mortgage/output/eval/         \
 --format=parquet                                                               \
 --numWorkers=${SPARK_NUM_EXECUTORS}                                            \
 --treeMethod=${TREE_METHOD}                                                    \
 --numRound=100                                                                 \
 --maxDepth=8
 
 # Please make sure to change the class and data path while running Taxi or Agaricus benchmark  
```

In the `stdout` driver log, you should see timings<sup>*</sup> (in seconds), and the accuracy metric:

```
----------------------------------------------------------------------------------------------------
Training takes 10.76 seconds

----------------------------------------------------------------------------------------------------
Transformation takes 1.25 seconds

----------------------------------------------------------------------------------------------------
Accuracy is 0.998526852335
```

<sup>*</sup> The timings in this Getting Started guide are only for illustrative purpose.
Please see our [release announcement](https://medium.com/rapids-ai/nvidia-gpus-and-apache-spark-one-step-closer-2d99e37ac8fd) for official benchmarks.


================================================
FILE: docs/get-started/xgboost-examples/on-prem-cluster/yarn-scala.md
================================================
Get Started with XGBoost4J-Spark on Apache Hadoop YARN
======================================================

This is a getting started guide to XGBoost4J-Spark on Apache Hadoop YARN supporting GPU scheduling. 
At the end of this guide, the reader will be able to run a sample Apache Spark application that runs on NVIDIA GPUs.

Prerequisites
-------------

* Apache Spark 3.2.0+ running on YARN supporting GPU scheduling. (e.g.: Spark 3.2.0, Hadoop-Yarn 3.3.0)
* Hardware Requirements
  * NVIDIA Pascal™ GPU architecture or better
  * Multi-node clusters with homogenous GPU configuration
* Software Requirements
  * Ubuntu 20.04, 22.04/CentOS7, Rocky Linux 8
  * CUDA 11.0+
  * NVIDIA driver compatible with your CUDA
  * NCCL 2.7.8+

The number of GPUs per NodeManager dictates the number of Spark executors that can run in that NodeManager. 
Additionally, cores per Spark executor and cores per Spark task must match, such that each executor can run 1 task at any given time.

For example: if each NodeManager has 4 GPUs, there should be 4 or fewer executors running on each NodeManager, 
and each executor should run 1 task (e.g.: A total of 4 tasks running on 4 GPUs). In order to achieve this, 
you may need to adjust `spark.task.cpus` and `spark.executor.cores` to match (both set to 1 by default).
Additionally, we recommend adjusting `executor-memory` to divide host memory evenly amongst the number of GPUs in each NodeManager,
such that Spark will schedule as many executors as there are GPUs in each NodeManager.

We use `SPARK_HOME` environment variable to point to the Apache Spark cluster.
And as to how to enable GPU scheduling and isolation for Yarn, 
please refer to [here](https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/UsingGpus.html).

Get Jars and Dataset
-------------------------------

Make sure you have prepared the necessary packages and dataset by following this [guide](/docs/get-started/xgboost-examples/prepare-package-data/preparation-scala.md)

#### Note: 
1. Mortgage and Taxi jobs have ETLs to generate the processed data.
2. For convenience, a subset of [Taxi](/datasets/) dataset is made available in this repo that can be readily used for launching XGBoost job. Use [ETL](#etl) to generate larger datasets for trainig and testing. 
3. Agaricus does not have an ETL process, it is combined with XGBoost as there is just a filter operation.

Create a directory in HDFS, and copy:

``` bash
[xgboost4j_spark]$ hadoop fs -mkdir /tmp/xgboost4j_spark
[xgboost4j_spark]$ hadoop fs -copyFromLocal ${SPARK_XGBOOST_DIR}/mortgage/* /tmp/xgboost4j_spark
```

<span id="etl">Launch Mortgage or Taxi ETL Part</span>
---------------------------

Use the ETL app to process raw Mortgage data. You can either use this ETLed data to split into training and evaluation data or run the ETL on different subsets of the dataset to produce training and evaluation datasets.

Note: For ETL jobs, Set `spark.task.resource.gpu.amount` to `1/spark.executor.cores`.


Run spark-submit

``` bash
${SPARK_HOME}/bin/spark-submit \
   --conf spark.plugins=com.nvidia.spark.SQLPlugin \
   --conf spark.executor.resource.gpu.amount=1 \
   --conf spark.executor.cores=10 \
   --conf spark.task.resource.gpu.amount=0.1 \
   --conf spark.rapids.sql.incompatibleDateFormats.enabled=true \
   --conf spark.rapids.sql.csv.read.double.enabled=true \
   --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
   --conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer \
   --files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh \
   --jars ${RAPIDS_JAR}                                           \
   --master yarn                                                                  \
   --deploy-mode ${SPARK_DEPLOY_MODE}                                             \
   --num-executors ${SPARK_NUM_EXECUTORS}                                         \
   --driver-memory ${SPARK_DRIVER_MEMORY}                                         \
   --executor-memory ${SPARK_EXECUTOR_MEMORY}                                     \
   --class com.nvidia.spark.examples.mortgage.ETLMain  \
   $SAMPLE_JAR \
   -format=csv \
   -dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/" \
   -dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/train/" \
   -dataPath="tmp::${SPARK_XGBOOST_DIR}/mortgage/output/tmp/"

# if generating eval data, change the data path to eval 
# -dataPath="data::${SPARK_XGBOOST_DIR}/mortgage/input/"
# -dataPath="out::${SPARK_XGBOOST_DIR}/mortgage/output/eval/"
# -dataPath="tmp::${SPARK_XGBOOST_DIR}/mortgage/output/tmp/"
# if running Taxi ETL benchmark, change the class and data path params to
# -class com.nvidia.spark.examples.taxi.ETLMain  
# -dataPath="raw::${SPARK_XGBOOST_DIR}/taxi/your-path"
# -dataPath="out::${SPARK_XGBOOST_DIR}/taxi/your-path"
```

Launch XGBoost Part on GPU
---------------------------

Variables required to run spark-submit command:

``` bash
# location where data was downloaded 
export DATA_PATH=hdfs:/tmp/xgboost4j_spark/data

# spark deploy mode (see Apache Spark documentation for more information) 
export SPARK_DEPLOY_MODE=cluster

# run a single executor for this example to limit the number of spark tasks and
# partitions to 1 as currently this number must match the number of input files
export SPARK_NUM_EXECUTORS=1

# spark driver memory
export SPARK_DRIVER_MEMORY=4g

# spark executor memory
export SPARK_EXECUTOR_MEMORY=8g

# example class to use
export EXAMPLE_CLASS=com.nvidia.spark.examples.mortgage.Main
# or change to com.nvidia.spark.examples.taxi.Main to run Taxi Xgboost benchmark
# or change to com.nvidia.spark.examples.agaricus.Main to run Agaricus Xgboost benchmark

# tree construction algorithm
export TREE_METHOD=gpu_hist
```

Run spark-submit:

``` bash
${SPARK_HOME}/bin/spark-submit                                                  \
 --conf spark.plugins=com.nvidia.spark.SQLPlugin \
 --conf spark.rapids.memory.gpu.pool=NONE \
 --conf spark.executor.resource.gpu.amount=1 \
 --conf spark.task.resource.gpu.amount=1 \
 --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
 --files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh \
 --jars ${RAPIDS_JAR}                                           \
 --master yarn                                                                  \
 --deploy-mode ${SPARK_DEPLOY_MODE}                                             \
 --num-executors ${SPARK_NUM_EXECUTORS}                                         \
 --driver-memory ${SPARK_DRIVER_MEMORY}                                         \
 --executor-memory ${SPARK_EXECUTOR_MEMORY}                                     \
 --class ${EXAMPLE_CLASS}                                                       \
 ${SAMPLE_JAR}                                                                 \
 -dataPath=train::${SPARK_XGBOOST_DIR}/mortgage/output/train/                   \
 -dataPath=trans::${SPARK_XGBOOST_DIR}/mortgage/output/eval/                    \
 -format=parquet                                                                \
 -numWorkers=${SPARK_NUM_EXECUTORS}                                             \
 -treeMethod=${TREE_METHOD}                                                     \
 -numRound=100                                                                  \
 -maxDepth=8                                                                    
  # Please make sure to change the class and data path while running Taxi or Agaricus benchmark   
```

In the `stdout` driver log, you should see timings<sup>*</sup> (in seconds), and the accuracy metric(take Mortgage as example):

```
--------------
==> Benchmark: Elapsed time for [Mortgage GPU train csv stub Unknown Unknown Unknown]: 29.642s
--------------

--------------
==> Benchmark: Elapsed time for [Mortgage GPU transform csv stub Unknown Unknown Unknown]: 21.272s
--------------

--------------
==> Benchmark: Accuracy for [Mortgage GPU Accuracy csv stub Unknown Unknown Unknown]: 0.9874184013493451
--------------
```

Launch XGBoost Part on CPU
---------------------------

If you are running this example after running the GPU example above, please set these variables, to set both training and testing to run on the CPU exclusively:

``` bash
# location where data was downloaded 
export DATA_PATH=hdfs:/tmp/xgboost4j_spark/data

# spark deploy mode (see Apache Spark documentation for more information) 
export SPARK_DEPLOY_MODE=cluster

# run a single executor for this example to limit the number of spark tasks and
# partitions to 1 as currently this number must match the number of input files
export SPARK_NUM_EXECUTORS=1

# spark driver memory
export SPARK_DRIVER_MEMORY=4g

# spark executor memory
export SPARK_EXECUTOR_MEMORY=8g

# example class to use
export EXAMPLE_CLASS=com.nvidia.spark.examples.mortgage.Main
# Please make sure to change the class while running Taxi or Agaricus benchmark   

# tree construction algorithm
export TREE_METHOD=hist
```

This is the same command as for the GPU example, repeated for convenience:

``` bash
${SPARK_HOME}/bin/spark-submit                                                  \
 --master yarn                                                                  \
 --deploy-mode ${SPARK_DEPLOY_MODE}                                             \
 --num-executors ${SPARK_NUM_EXECUTORS}                                         \
 --driver-memory ${SPARK_DRIVER_MEMORY}                                         \
 --executor-memory ${SPARK_EXECUTOR_MEMORY}                                     \
 --class ${EXAMPLE_CLASS}                                                       \
 ${SAMPLE_JAR}                                                                 \
 -dataPath=train::${SPARK_XGBOOST_DIR}/mortgage/output/train/                   \
 -dataPath=trans::${SPARK_XGBOOST_DIR}/mortgage/output/eval/                    \
 -format=parquet                                                                \
 -numWorkers=${SPARK_NUM_EXECUTORS}                                             \
 -treeMethod=${TREE_METHOD}                                                     \
 -numRound=100                                                                  \
 -maxDepth=8                            
   
  # Please make sure to change the class and data path while running Taxi or Agaricus benchmark                                                       
                                      
```

In the `stdout` driver log, you should see timings<sup>*</sup> (in seconds), and the accuracy metric(take Mortgage as example):

```
--------------
==> Benchmark: Elapsed time for [Mortgage CPU train csv stub Unknown Unknown Unknown]: 286.398s
--------------

--------------
==> Benchmark: Elapsed time for [Mortgage CPU transform csv stub Unknown Unknown Unknown]: 49.836s
--------------

--------------
==> Benchmark: Accuracy for [Mortgage CPU Accuracy csv stub Unknown Unknown Unknown]: 0.9873709530950067
--------------
```

<sup>*</sup> The timings in this Getting Started guide are only for illustrative purpose.
Please see our [release announcement](https://medium.com/rapids-ai/nvidia-gpus-and-apache-spark-one-step-closer-2d99e37ac8fd) for official benchmarks.


================================================
FILE: docs/get-started/xgboost-examples/prepare-package-data/preparation-python.md
================================================
## Prepare packages and dataset for pyspark

For simplicity export the location to these jars. All examples assume the packages and dataset will be placed in the `/opt/xgboost` directory:

### Download the jars

Download the RAPIDS Accelerator for Apache Spark plugin jar
  * [RAPIDS Spark Package](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/26.02.0/rapids-4-spark_2.12-26.02.0.jar)

### Build XGBoost Python Examples

Following this [guide](/docs/get-started/xgboost-examples/building-sample-apps/python.md), you can get *samples.zip* and *main.py* and copy them to `/opt/xgboost`

### Download dataset

You need to copy the dataset to `/opt/xgboost`. Use the following links to download the data.
1. [Mortgage dataset](/docs/get-started/xgboost-examples/dataset/mortgage.md)
2. [Taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
3. [Agaricus dataset](https://github.com/dmlc/xgboost/tree/master/demo/data)


================================================
FILE: docs/get-started/xgboost-examples/prepare-package-data/preparation-scala.md
================================================
## Prepare packages and dataset for scala

For simplicity export the location to these jars. All examples assume the packages and dataset will be placed in the `/opt/xgboost` directory:

### Download the jars

1. Download the RAPIDS Accelerator for Apache Spark plugin jar
   * [RAPIDS Spark Package](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/26.02.0/rapids-4-spark_2.12-26.02.0.jar)

### Build XGBoost Scala Examples

Following this [guide](/docs/get-started/xgboost-examples/building-sample-apps/scala.md), you can get *sample_xgboost_apps-0.2.3-jar-with-dependencies.jar* and copy it to `/opt/xgboost`

### Download dataset

You need to copy the dataset to `/opt/xgboost`. Use the following links to download the data.
1. [Mortgage dataset](/docs/get-started/xgboost-examples/dataset/mortgage.md)
2. [Taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
3. [Agaricus dataset](https://github.com/dmlc/xgboost/tree/master/demo/data)


================================================
FILE: docs/trouble-shooting/xgboost-examples-trouble-shooting.md
================================================
## XGBoost

### 1. NCCL errors

XGBoost supports distributed GPU training which depends on NCCL2 available at [this link](https://developer.nvidia.com/nccl). NCCL auto-detects which network interfaces to use for inter-node communication. If some interfaces are in state up, however are not able to communicate between nodes, NCCL may try to use them anyway and therefore fail during the init functions or **even hang**.

To track NCCL error, User needs to enable NCCL_DEBUG when submitting spark application by 

``` xml
--conf spark.executorEnv.NCCL_DEBUG=INFO
```

Sometimes, Node tries to connect to another node which selects an inappropriate interface, which may cause xgboost task hang. To fix this kind of issue, User needs to specify an appropriate interface for the node by NCCL_SOCKET_IFNAME

``` xml
--conf spark.executorEnv.NCCL_SOCKET_IFNAME=eth0
```

================================================
FILE: examples/MIG-Support/README.md
================================================
# Multi-Instance GPU (MIG) support in Apache Hadoop YARN

There are multiple solutions for MIG scheduling on YARN that you can choose based on your environment and
deployment requirements:

- [YARN 3.3.0+ MIG GPU Plugin](/examples/MIG-Support/device-plugins/gpu-mig) for adding a Java-based plugin for MIG
on top of the Pluggable Device Framework
- [YARN 3.1.2 until YARN 3.3.0 MIG GPU Support](/examples/MIG-Support/resource-types/gpu-mig) for
patching and rebuilding YARN code base to support MIG devices.
- [YARN 3.1.2+ MIG GPU Support without modifying YARN / Device Plugin Code](/examples/MIG-Support/yarn-unpatched)
relying on installing nvidia CLI wrappers written in `bash`, but unlike the solutions above without
any Java code changes.

## Limitations and Caveats

Note that are some common caveats for the solutions above.

### Single MIG GPU per Container

Please see the [MIG Application Considerations](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#app-considerations)
and [CUDA Device Enumeration](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#cuda-visible-devices).

It is important to note that CUDA 11 only supports enumeration of a single MIG instance.
It is recommended that you configure YARN to only allow a single GPU be requested. See
the YARN config `yarn.resource-types.nvidia/miggpu.maximum-allocation` for the
[Pluggable Device Framework](/examples/MIG-Support/device-plugins/gpu-mig) solution and
`yarn.resource-types.yarn.io/gpu.maximum-allocation` for the remainder of MIG Support options above, respectively.

### Metrics
Some metrics are not and cannot be broken down by MIG device. For example, `utilization` is the
aggregate utilization of the parent GPU, and there is no attribution of `temperature` to a
particular MIG device.

### GPU index / address as reported by Apache Spark in logs and UI

With YARN isolation using NVIDIA Container Runtime ensuring a single visible device
per Docker container running a Spark Executor, each Executor will see a disjoint list comprising
a single device.
Therefore, the user will end up observing index 0 being used by all executors. However, they refer
to different GPU/MIG instances. You can verify this by running something like the following on a
YARN worker node host OS:

```bash
for cid in $(sudo docker ps -q); do sudo docker exec $cid bash -c "printenv | grep VISIBLE; nvidia-smi -L"; done
NVIDIA_VISIBLE_DEVICES=3
GPU 0: NVIDIA A30 (UUID: GPU-05aa99be-b706-0dc1-ab62-dd12f2227b7d)
  MIG 1g.6gb      Device  0: (UUID: MIG-70dc024a-e8d7-587c-81dd-57ad493b1d91)
NVIDIA_VISIBLE_DEVICES=1
GPU 0: NVIDIA A30 (UUID: GPU-05aa99be-b706-0dc1-ab62-dd12f2227b7d)
  MIG 1c.2g.12gb  Device  0: (UUID: MIG-54cc2421-6f2d-59e9-b074-20707aadd71e)
NVIDIA_VISIBLE_DEVICES=2
GPU 0: NVIDIA A30 (UUID: GPU-05aa99be-b706-0dc1-ab62-dd12f2227b7d)
  MIG 1g.6gb      Device  0: (UUID: MIG-7e5552bf-d328-57a8-b091-0720d4530ffb)
NVIDIA_VISIBLE_DEVICES=0
GPU 0: NVIDIA A30 (UUID: GPU-05aa99be-b706-0dc1-ab62-dd12f2227b7d)
  MIG 1c.2g.12gb  Device  0: (UUID: MIG-e6af58f0-9af8-594f-825e-74d23e1a68c1)
```






================================================
FILE: examples/MIG-Support/device-plugins/gpu-mig/README.md
================================================
# NVIDIA GPU Plugin for YARN with MIG support for YARN 3.3.0+

This plugin adds support for GPUs with [MIG](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/) on YARN. The built-in YARN GPU plugin does not support MIG enabled GPUs.
This plugin also works with GPUs without MIG or GPUs with MIG disabled but the limitation section still applies. It supports heterogenous environments where
there may be some MIG enabled GPUs and some without MIG. If you are not using MIG enabled GPUs, you should use the built-in YARN GPU plugin.

## Compatibility

It works with Apache YARN 3.3.0+ versions that support the [Pluggable Device Framework](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/PluggableDeviceFramework.html). This plugin requires YARN to be configured with Docker using the NVIDIA Container Toolkit (nvidia-docker2).

## Limitations

Please see the [MIG Application Considerations](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#app-considerations)
and [CUDA Device Enumeration](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#cuda-visible-devices).

It is important to note that CUDA 11 only supports enumeration of a single MIG instance. This means that this plugin
only supports 1 GPU per container and the plugin will throw an exception by default if you request more.
It is recommended that you configure YARN to only allow a single GPU be requested. See the yarn config:
```
 yarn.resource-types.nvidia/miggpu.maximum-allocation
```
See [YARN Resource Configuration](https://hadoop.apache.org/docs/r3.3.1/hadoop-yarn/hadoop-yarn-site/ResourceModel.html) for more details.
If you do not configure the maximum allocation and someone requests multiple GPUs, the default behavior is to throw an exception. The user
visible exception is not very useful, as the real exception will be in the nodemanager logs. See the [Configuration](#configuration) section for options
if it throws an exception.

## Building From Source

```
mvn package 
```

This will create a jar `target/yarn-gpu-mig-plugin-1.0.0.jar`. This jar can be installed on your YARN cluster as a plugin.

## Installation

These instructions assume YARN is already installed and configured with Docker enabled using the NVIDIA Container Toolkit (nvidia-docker2).
Enable and configure your [GPUs with MIG](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html) on all of the nodes it applies to.

Install the jar into your Hadoop Cluster, see the [Test and Use Your Own Plugin](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/DevelopYourOwnDevicePlugin.html)
section. This recommends installing it in something like `$HADOOP_COMMOND_HOME/share/hadoop/yarn`.

Configure the device plugin, see the YARN documentation on [Pluggable Device Framework](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/PluggableDeviceFramework.html).

After enabling the framework, enable the plugin in `yarn-site.xml`:

```
<property>
  <name>yarn.nodemanager.pluggable-device-framework.device-classes</name>
  <value>com.nvidia.spark.NvidiaGPUMigPluginForRuntimeV2</value>
</property>

```

Configure YARN to have the new resource type by modifying the `resource-types.xml` file to include:

```
<property>
  <name>yarn.resource-types</name>
  <value>nvidia/miggpu</value>
</property>
```

Restart YARN to pick up any configuration changes.

## Configuration

To change the behavior of throwing when the user allocates multiple GPUs, you can either set a config in the `yarn-site.xml` or set
an environment variable when launching the Spark application. The environment variable will take precendence if both are set.
In either case, `true` means to throw if a user requests multiple GPUs (this is the default), `false`
means it won't throw and if the container is allocated with multiple MIG devices from the same
GPU, it is up to the application to know how to use them.

Config for `yarn-site.xml`:
```
<property>
  <name>com.nvidia.spark.NvidiaGPUMigPluginForRuntimeV2.throwOnMultipleGPUs</name>
  <value>true</value>
</property>
```

Environment variable for Spark application:
```
--conf spark.executorEnv.NVIDIA_MIG_PLUGIN_THROW_ON_MULTIPLE_GPUS=true
```

## Using with Apache Spark on YARN
Spark supports [scheduling GPUs and other custom resources on YARN](http://spark.apache.org/docs/latest/running-on-yarn.html#resource-allocation-and-configuration-overview). There are 2 options for using this plugin with Spark to allocate GPUs with MIG support: 

- Use Spark 3.2.1 or newer and remap the standard Spark `gpu` resource (i.e.: `spark.executor.resource.gpu.amount`) to be the new MIG GPU resource type using:
```
--conf spark.yarn.resourceGpuDeviceName=nvidia/miggpu
```
This means users don't have to change their configs if they were already using the `gpu` resource type.

- Spark applications specify the `nvidia/miggpu` resource type instead of the `gpu` resource type. For this the user has to change the resource
type to `nvidia/miggpu`, update the discovery script, and specify an extra YARN config(`spark.yarn.executor.resource.nvidia/miggpu.amount`).
The command would be something like below (update the amounts according to your setup):
```
 --conf spark.executor.resource.nvidia/miggpu.amount=1 --conf spark.executor.resource.nvidia/miggpu.discoveryScript=./getMIGGPUs --conf spark.task.resource.nvidia/miggpu.amount=0.25 --files ./getMIGGpus --conf spark.yarn.executor.resource.nvidia/miggpu.amount=1
```
Note the getMIGGpus discovery script would is in the `scripts` directory in this repo. It just changes the resource name returned to match
`nvidia/miggpu`.

## Testing
Run a Spark application using the [Rapids Accelerator for Apache Spark](https://nvidia.github.io/spark-rapids/) and request GPUs
from YARN and verify they use the MIG enabled GPUs.


================================================
FILE: examples/MIG-Support/device-plugins/gpu-mig/pom.xml
================================================
<?xml version="1.0" encoding="UTF-8"?>
<!--
  Copyright (c) 2021, NVIDIA CORPORATION.

  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
-->
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.nvidia</groupId>
    <artifactId>yarn-gpu-mig-plugin</artifactId>
    <name>YARN Device Plugin that supports MIG</name>
    <description>The root project of the YARN Device Plugin that supports MIG</description>
    <version>1.0.0</version>
    <packaging>jar</packaging>

    <licenses>
        <license>
            <name>Apache License, Version 2.0</name>
            <url>https://www.apache.org/licenses/LICENSE-2.0.txt</url>
            <distribution>repo</distribution>
        </license>
    </licenses>

    <properties>
        <yarn.version>3.3.6</yarn.version>
        <java.version>1.8</java.version>
        <maven.compiler.version>3.8.1</maven.compiler.version>
        <maven.jar.plugin.version>3.2.0</maven.jar.plugin.version>
        <junit.version>4.13.1</junit.version>
        <mockito.core.version>3.4.6</mockito.core.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-yarn-server-nodemanager</artifactId>
            <version>${yarn.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>${junit.version}</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.mockito</groupId>
            <artifactId>mockito-core</artifactId>
            <version>${mockito.core.version}</version>
            <scope>test</scope>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>${maven.compiler.version}</version>
                <configuration>
                    <source>${java.version}</source>
                    <target>${java.version}</target>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-jar-plugin</artifactId>
                <version>${maven.jar.plugin.version}</version>
                <executions>
                    <execution>
                        <id>default-jar</id>
                        <phase>package</phase>
                        <goals>
                            <goal>jar</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>


================================================
FILE: examples/MIG-Support/device-plugins/gpu-mig/scripts/getMIGGPUs
================================================
#!/usr/bin/env bash

# Copyright (c) 2021, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# This script is a basic example script to get resource information about NVIDIA MIG GPUs.
# It works with the NVIDIA GPU Plugin for YARN with MIG support and is expected to be run
# in a container where the nvidia-docker-v2 plugin has taken care of mapping the MIG
# devices. This is the same as the Aapche Spark script, except the resource name is changed
# to match the new plugin.
#
# It assumes the drivers are properly installed and the nvidia-smi command is available.
# It is not guaranteed to work on all setups so please test and customize as needed
# for your environment. It can be passed into SPARK via the config
# spark.{driver/executor}.resource.gpu.discoveryScript to allow the driver or executor to discover
# the GPUs it was allocated. It assumes you are running within an isolated container where the
# GPUs are allocated exclusively to that driver or executor.
# It outputs a JSON formatted string that is expected by the
# spark.{driver/executor}.resource.gpu.discoveryScript config.
#
# Example output: {"name": "nvidia/miggpu", "addresses":["0"]}

ADDRS=`nvidia-smi --query-gpu=index --format=csv,noheader | sed -e ':a' -e 'N' -e'$!ba' -e 's/\n/","/g'`
echo {\"name\": \"nvidia/miggpu\", \"addresses\":[\"$ADDRS\"]}


================================================
FILE: examples/MIG-Support/device-plugins/gpu-mig/src/main/java/com/nvidia/spark/NvidiaGPUMigPluginForRuntimeV2.java
================================================
/*
 * Copyright (c) 2021, NVIDIA CORPORATION.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package com.nvidia.spark;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.util.Shell;
import org.apache.hadoop.yarn.exceptions.YarnException;
import org.apache.hadoop.yarn.server.nodemanager.api.deviceplugin.Device;
import org.apache.hadoop.yarn.server.nodemanager.api.deviceplugin.DevicePlugin;
import org.apache.hadoop.yarn.server.nodemanager.api.deviceplugin.DevicePluginScheduler;
import org.apache.hadoop.yarn.server.nodemanager.api.deviceplugin.DeviceRegisterRequest;
import org.apache.hadoop.yarn.server.nodemanager.api.deviceplugin.DeviceRuntimeSpec;
import org.apache.hadoop.yarn.server.nodemanager.api.deviceplugin.YarnRuntimeType;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.File;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import java.util.TreeSet;

/**
 * Nvidia GPU plugin supporting both Nvidia container runtime v2.
 * It supports discovering and allocating MIG devices. Currently, with CUDA 11,
 * only enumeration of a single MIG instance is supported. This means that
 * this plugin officially only supports 1 GPU per container and by default
 * will throw an exception if more are requested. The behavior of throwing
 * an exception is configurable by either setting the environment variable
 * {@code NVIDIA_MIG_PLUGIN_THROW_ON_MULTIPLE_GPUS} or by setting the YARN config
 * {@code com.nvidia.spark.NvidiaGPUMigPluginForRuntimeV2.throwOnMultipleGPUs}
 * to false.
 */
public class NvidiaGPUMigPluginForRuntimeV2 implements DevicePlugin,
        DevicePluginScheduler {
    public static final Logger LOG = LoggerFactory.getLogger(
            NvidiaGPUMigPluginForRuntimeV2.class);

    public static final String NV_RESOURCE_NAME = "nvidia/miggpu";

    private NvidiaCommandExecutor shellExecutor = new NvidiaCommandExecutor();

    private Map<String, String> environment = new HashMap<>();

    // If this environment is set, use it directly
    private static final String ENV_BINARY_PATH = "NVIDIA_SMI_PATH";

    private static final String DEFAULT_BINARY_NAME = "nvidia-smi";

    private static final String DEV_NAME_PREFIX = "nvidia";

    private static final String THROW_MULTI_CONF =
            "com.nvidia.spark.NvidiaGPUMigPluginForRuntimeV2.throwOnMultipleGPUs";

    private static final String THROW_MULTI_ENV = "NVIDIA_MIG_PLUGIN_THROW_ON_MULTIPLE_GPUS";

    private Boolean shouldThrowOnMultipleGPUFromConf =
        new Configuration().getBoolean(THROW_MULTI_CONF, true);
    private String shouldThrowOnMultipleGPUFromEnv = null;

    private String pathOfGpuBinary = null;

    // command should not run more than 10 sec.
    private static final int MAX_EXEC_TIMEOUT_MS = 10 * 1000;

    // When executable path not set, try to search default dirs
    // By default search /usr/bin, /bin, and /usr/local/nvidia/bin (when
    // launched by nvidia-docker.
    private static final String[] DEFAULT_BINARY_SEARCH_DIRS = new String[]{
            "/usr/bin", "/bin", "/usr/local/nvidia/bin"};

    // device id -> mig id, populated during discovery and used when launching
    // containers
    private Map<Integer, String> migDevices = new HashMap<>();

    private String migInfoOutput = null;


    @Override
    public DeviceRegisterRequest getRegisterRequestInfo() throws Exception {
        return DeviceRegisterRequest.Builder.newInstance()
                .setResourceName(NV_RESOURCE_NAME).build();
    }

    @Override
    public Set<Device> getDevices() throws Exception {
        shellExecutor.searchBinary();
        TreeSet<Device> r = new TreeSet<>();
        String output;
        try {
            output = shellExecutor.getDeviceInfo();
            String[] lines = output.trim().split("\n");
            int id = 0;
            for (String oneLine : lines) {
                String[] tokensEachLine = oneLine.split(",");
                if (tokensEachLine.length != 3) {
                    throw new Exception("Cannot parse the output to get the MIG enabled info. "
                            + "output: " + oneLine + " expected index,pci.bus_id,mig.mode.current");
                }
                String minorNumber = tokensEachLine[0].trim();
                String busId = tokensEachLine[1].trim();
                String migMode = tokensEachLine[2].trim();
                String majorNumber = getMajorNumber(DEV_NAME_PREFIX
                        + minorNumber);

                if (majorNumber != null) {
                    if (migMode.equalsIgnoreCase("enabled")) {
                        if (migInfoOutput == null) {
                            // we get the mig info for all the GPUs on the host so only get it once
                            migInfoOutput = shellExecutor.getDeviceMigInfo();
                            if (migInfoOutput == null) {
                                throw new Exception("MIG device enabled but no device info found");
                            }
                        }
                        String[] linesMig = migInfoOutput.trim().split("\n");
                        Integer minorNumInt = Integer.parseInt(minorNumber);
                        Integer migDevCount = 0;
                        Integer numMigOutputLines = linesMig.length;
                        for (int idmig = 0; idmig < numMigOutputLines; idmig++) {
                            // first line should start with GPU
                            // GPU 0: NVIDIA A30 (UUID: GPU-e7076666-0544-e103-4f65-a047fc18269e)
                            // MIG 1g.6gb      Device  0: (UUID: MIG-de9876e2-eef7-5b5a-9701-db694ffe8a77)
                            if (linesMig[idmig].startsWith("GPU " + minorNumInt) && numMigOutputLines > (idmig + 1)) {
                                // process any MIG devices, this expects all the lines to be MIG devices until
                                // we find one that starts with GPU
                                String nextLine = linesMig[++idmig].trim();
                                String regex = "MIG (.+)Device\\s+(\\d+):\\s+\\(UUID:(.*)\\)";
                                Pattern pattern = Pattern.compile(regex);
                                while (nextLine.startsWith("MIG")) {
                                    Matcher matcher = pattern.matcher(nextLine);
                                    while (matcher.find()) {
                                        String devId = matcher.group(2);
                                        migDevices.put(id, devId);
                                        migDevCount++;
                                        r.add(Device.Builder.newInstance()
                                                .setId(id)
                                                .setMajorNumber(Integer.parseInt(majorNumber))
                                                .setMinorNumber(minorNumInt)
                                                .setBusID(busId)
                                                .setDevPath("/dev/" + DEV_NAME_PREFIX + minorNumber)
                                                .setHealthy(true)
                                                .setStatus(devId)
                                                .build());
                                        id++;
                                        if (++idmig < numMigOutputLines) {
                                            nextLine = linesMig[idmig].trim();
                                        } else {
                                            nextLine = "";
                                        }
                                    }
                                }
                                idmig = numMigOutputLines;
                            }
                        }
                        if (migDevCount < 1) {
                            throw new IOException("Error finding MIG devices on GPU with " +
                                "MIG enabled: " + migInfoOutput);
                        }
                        LOG.info("Added GPU " + majorNumber + ":" + minorNumInt +
                            " with MIG Enabled, found " + migDevCount + " MIG devices");
                    } else {
                        Integer majorNumInt = Integer.parseInt(majorNumber);
                        Integer minorNumInt = Integer.parseInt(minorNumber);
                        r.add(Device.Builder.newInstance()
                                .setId(id)
                                .setMajorNumber(majorNumInt)
                                .setMinorNumber(minorNumInt)
                                .setBusID(busId)
                                .setDevPath("/dev/" + DEV_NAME_PREFIX + minorNumber)
                                .setHealthy(true)
                                .build());
                        LOG.info("Added GPU " + majorNumInt + ":" + minorNumInt);
                        id++;
                    }
                }
            }
            return r;
        } catch (IOException e) {
            LOG.debug("Failed to get output from {}", pathOfGpuBinary);
            throw new YarnException(e);
        }
    }

    private Boolean shouldThrowOnMultipleGPUs() {
        // env setting takes highest priority if it is set
        if (shouldThrowOnMultipleGPUFromEnv != null) {
            return Boolean.parseBoolean(shouldThrowOnMultipleGPUFromEnv);
        }
        return shouldThrowOnMultipleGPUFromConf;
    }

    @Override
    public DeviceRuntimeSpec onDevicesAllocated(Set<Device> allocatedDevices,
                                                YarnRuntimeType yarnRuntime) throws Exception {
        LOG.debug("Generating runtime spec for allocated devices: {}, {}",
                allocatedDevices, yarnRuntime.getName());
        if (allocatedDevices.size() > 1 && shouldThrowOnMultipleGPUs()) {
            throw new YarnException("Allocating more than 1 GPU per container is" +
                    " not supported with use of MIG!");
        }
        if (yarnRuntime == YarnRuntimeType.RUNTIME_DOCKER) {
            String nvidiaRuntime = "nvidia";
            String nvidiaVisibleDevices = "NVIDIA_VISIBLE_DEVICES";
            StringBuffer gpuMinorNumbersSB = new StringBuffer();
            for (Device device : allocatedDevices) {
                Integer minorNum = device.getMinorNumber();
                Integer id = device.getId();
                if (migDevices.containsKey(id)) {
                    gpuMinorNumbersSB.append(minorNum + ":" + migDevices.get(id) + ",");
                } else {
                    gpuMinorNumbersSB.append(minorNum + ",");
                }
            }
            String minorNumbers = gpuMinorNumbersSB.toString();
            LOG.info("Nvidia Docker v2 assigned GPU: " + minorNumbers);
            String deviceStr = minorNumbers.substring(0, minorNumbers.length() - 1);
            return DeviceRuntimeSpec.Builder.newInstance()
                    .addEnv(nvidiaVisibleDevices, deviceStr)
                    .setContainerRuntime(nvidiaRuntime)
                    .build();
        }
        return null;
    }

    @Override
    public void onDevicesReleased(Set<Device> releasedDevices) throws Exception {
        // do nothing
    }

    // Get major number from device name.
    private String getMajorNumber(String devName) {
        String output = null;
        // output "major:minor" in hex
        try {
            LOG.debug("Get major numbers from /dev/{}", devName);
            output = shellExecutor.getMajorMinorInfo(devName);
            String[] strs = output.trim().split(":");
            output = Integer.toString(Integer.parseInt(strs[0], 16));
        } catch (IOException e) {
            String msg =
                    "Failed to get major number from reading /dev/" + devName;
            LOG.warn(msg);
        } catch (NumberFormatException e) {
            LOG.error("Failed to parse device major number from stat output");
            output = null;
        }
        return output;
    }

    @Override
    public Set<Device> allocateDevices(Set<Device> availableDevices, int count,
                                       Map<String, String> envs) {
        Set<Device> allocation = new TreeSet<>();
        String envShouldThrow = envs.get(THROW_MULTI_ENV);
        if (envShouldThrow != null) {
            shouldThrowOnMultipleGPUFromEnv = envShouldThrow;
        }
        // Only officially support 1 GPU per container so don't worry about topology
        // scheduling.
        basicSchedule(allocation, count, availableDevices);
        return allocation;
    }

    public void basicSchedule(Set<Device> allocation, int count,
                              Set<Device> availableDevices) {
        // Basic scheduling
        // allocate all available
        if (count == availableDevices.size()) {
            allocation.addAll(availableDevices);
            return;
        }
        int number = 0;
        for (Device d : availableDevices) {
            allocation.add(d);
            number++;
            if (number == count) {
                break;
            }
        }
    }

    /**
     * A shell wrapper class easy for test.
     */
    public class NvidiaCommandExecutor {

        public String getDeviceInfo() throws IOException {
            return Shell.execCommand(environment,
                    new String[]{pathOfGpuBinary, "--query-gpu=index,pci.bus_id,mig.mode.current",
                            "--format=csv,noheader"}, MAX_EXEC_TIMEOUT_MS);
        }

        public String getDeviceMigInfo() throws IOException {
            return Shell.execCommand(environment,
                    new String[]{pathOfGpuBinary, "-L"}, MAX_EXEC_TIMEOUT_MS);
        }

        public String getMajorMinorInfo(String devName) throws IOException {
            // output "major:minor" in hex
            Shell.ShellCommandExecutor shexec = new Shell.ShellCommandExecutor(
                    new String[]{"stat", "-c", "%t:%T", "/dev/" + devName});
            shexec.execute();
            return shexec.getOutput();
        }

        public void searchBinary() throws Exception {
            if (pathOfGpuBinary != null) {
                LOG.info("Skip searching, the NVIDIA gpu binary is already set: "
                        + pathOfGpuBinary);
                return;
            }
            // search env for the binary
            String envBinaryPath = System.getenv(ENV_BINARY_PATH);
            if (null != envBinaryPath) {
                if (new File(envBinaryPath).exists()) {
                    pathOfGpuBinary = envBinaryPath;
                    LOG.info("Use NVIDIA gpu binary: " + pathOfGpuBinary);
                    return;
                }
            }
            LOG.debug("Search binary..");
            // search if binary exists in default folders
            File binaryFile;
            boolean found = false;
            for (String dir : DEFAULT_BINARY_SEARCH_DIRS) {
                binaryFile = new File(dir, DEFAULT_BINARY_NAME);
                if (binaryFile.exists()) {
                    found = true;
                    pathOfGpuBinary = binaryFile.getAbsolutePath();
                    LOG.info("Found binary:" + pathOfGpuBinary);
                    break;
                }
            }
            if (!found) {
                LOG.error("No binary found from env variable: "
                        + ENV_BINARY_PATH + " or path "
                        + DEFAULT_BINARY_SEARCH_DIRS.toString());
                throw new Exception("No binary found for "
                        + NvidiaGPUMigPluginForRuntimeV2.class);
            }
        }
    }

    // visible for testing
    public void setPathOfGpuBinary(String pOfGpuBinary) {
        this.pathOfGpuBinary = pOfGpuBinary;
    }

    // visible for testing
    public void setShellExecutor(NvidiaCommandExecutor shellExecutor) {
        this.shellExecutor = shellExecutor;
    }

    // visible for testing
    public void setMigDevices(Map<Integer, String> migDevices) {
        this.migDevices = migDevices;
    }

    // visible for testing
    public void setShouldThrowOnMultipleGPUFromConf(Boolean shouldThrow) {
        this.shouldThrowOnMultipleGPUFromConf = shouldThrow;
    }
}


================================================
FILE: examples/MIG-Support/device-plugins/gpu-mig/src/test/java/com/nvidia/spark/TestNvidiaGPUMigPluginForRuntimeV2.java
================================================
/*
 * Copyright (c) 2021, NVIDIA CORPORATION.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package com.nvidia.spark;

import org.apache.hadoop.yarn.server.nodemanager.api.deviceplugin.Device;
import org.apache.hadoop.yarn.server.nodemanager.api.deviceplugin.DeviceRuntimeSpec;
import org.apache.hadoop.yarn.server.nodemanager.api.deviceplugin.YarnRuntimeType;
import org.junit.Assert;
import org.junit.Test;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import java.util.TreeSet;

import static org.mockito.Mockito.mock;
import static org.mockito.Mockito.when;

/**
 * Test case for NvidiaGPUMigPluginForRuntimeV2 device plugin.
 */
public class TestNvidiaGPUMigPluginForRuntimeV2 {

    private static final Logger LOG =
            LoggerFactory.getLogger(TestNvidiaGPUMigPluginForRuntimeV2.class);

    @Test
    public void testGetNvidiaDevices() throws Exception {
        NvidiaGPUMigPluginForRuntimeV2.NvidiaCommandExecutor mockShell =
                mock(NvidiaGPUMigPluginForRuntimeV2.NvidiaCommandExecutor.class);
        String deviceInfoShellOutput =
                "0, 00000000:04:00.0, [N/A]\n" +
                "1, 00000000:82:00.0, Enabled";
        String majorMinorNumber0 = "c3:0";
        String majorMinorNumber1 = "c3:1";
        String deviceMigInfoShellOutput =
                "GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-aa72194b-fdd4-24b0-f659-17c929f46267)\n" +
                "  MIG 1g.10gb     Device  0: (UUID: MIG-aa2c982c-48a9-5046-b7f8-aa4732879e02)\n" +
                "GPU 1: NVIDIA A100 80GB PCIe (UUID: GPU-aa7153bf-c0ba-00ef-cdce-f861c34172f6)\n" +
                "  MIG 1g.10gb     Device  0: (UUID: MIG-aa59d467-ba39-5d0a-a085-66af03246526)\n" +
                "  MIG 1g.10gb     Device  1: (UUID: MIG-aad5cb29-8e6f-510a-8352-8e18f483dc74)" +
        when(mockShell.getDeviceInfo()).thenReturn(deviceInfoShellOutput);
        when(mockShell.getDeviceMigInfo()).thenReturn(deviceMigInfoShellOutput);
        when(mockShell.getMajorMinorInfo("nvidia0"))
                .thenReturn(majorMinorNumber0);
        when(mockShell.getMajorMinorInfo("nvidia1"))
                .thenReturn(majorMinorNumber1);
        NvidiaGPUMigPluginForRuntimeV2 plugin = new NvidiaGPUMigPluginForRuntimeV2();
        plugin.setShellExecutor(mockShell);
        plugin.setPathOfGpuBinary("/fake/nvidia-smi");

        Set<Device> expectedDevices = new TreeSet<>();
        expectedDevices.add(Device.Builder.newInstance()
                .setId(0).setHealthy(true)
                .setBusID("00000000:04:00.0")
                .setDevPath("/dev/nvidia0")
                .setMajorNumber(195)
                .setStatus("0")
                .setMinorNumber(0).build());
        expectedDevices.add(Device.Builder.newInstance()
                .setId(1).setHealthy(true)
                .setBusID("00000000:82:00.0")
                .setDevPath("/dev/nvidia1")
                .setMajorNumber(195)
                .setStatus("0")
                .setMinorNumber(1).build());
        expectedDevices.add(Device.Builder.newInstance()
                .setId(2).setHealthy(true)
                .setBusID("00000000:82:00.0")
                .setDevPath("/dev/nvidia1")
                .setMajorNumber(195)
                .setStatus("1")
                .setMinorNumber(1).build());
        Set<Device> devices = plugin.getDevices();
        Assert.assertEquals(expectedDevices, devices);
    }

    @Test(expected = Exception.class)
    public void testOnDeviceAllocatedMultiGPU() throws Exception {
        NvidiaGPUMigPluginForRuntimeV2 plugin = new NvidiaGPUMigPluginForRuntimeV2();
        Set<Device> allocatedDevices = new TreeSet<>();

        DeviceRuntimeSpec spec = plugin.onDevicesAllocated(allocatedDevices,
                YarnRuntimeType.RUNTIME_DEFAULT);
        Assert.assertNull(spec);

        // allocate one device
        allocatedDevices.add(Device.Builder.newInstance()
                .setId(0).setHealthy(true)
                .setBusID("00000000:04:00.0")
                .setDevPath("/dev/nvidia0")
                .setMajorNumber(195)
                .setMinorNumber(0).build());
        spec = plugin.onDevicesAllocated(allocatedDevices,
                YarnRuntimeType.RUNTIME_DOCKER);
        Assert.assertEquals("nvidia", spec.getContainerRuntime());
        Assert.assertEquals("0", spec.getEnvs().get("NVIDIA_VISIBLE_DEVICES"));

        // two device allowed
        allocatedDevices.add(Device.Builder.newInstance()
                .setId(0).setHealthy(true)
                .setBusID("00000000:82:00.0")
                .setDevPath("/dev/nvidia1")
                .setMajorNumber(195)
                .setMinorNumber(1).build());
        spec = plugin.onDevicesAllocated(allocatedDevices,
                YarnRuntimeType.RUNTIME_DOCKER);
    }

    @Test
    public void testMultiGPUsEnvPrecedence() throws Exception {
        NvidiaGPUMigPluginForRuntimeV2 plugin = new NvidiaGPUMigPluginForRuntimeV2();
        Set<Device> allocatedDevices = new TreeSet<>();

        DeviceRuntimeSpec spec = plugin.onDevicesAllocated(allocatedDevices,
                YarnRuntimeType.RUNTIME_DEFAULT);
        Assert.assertNull(spec);

        // allocate one device
        allocatedDevices.add(Device.Builder.newInstance()
                .setId(0).setHealthy(true)
                .setBusID("00000000:04:00.0")
                .setDevPath("/dev/nvidia0")
                .setMajorNumber(195)
                .setMinorNumber(0).build());

        // two device allowed
        allocatedDevices.add(Device.Builder.newInstance()
                .setId(0).setHealthy(true)
                .setBusID("00000000:82:00.0")
                .setDevPath("/dev/nvidia1")
                .setMajorNumber(195)
                .setMinorNumber(1).build());

        // test that env variable takes presedence
        plugin.setShouldThrowOnMultipleGPUFromConf(true);
        Map<String, String> envs = new HashMap<>();
        envs.put("NVIDIA_MIG_PLUGIN_THROW_ON_MULTIPLE_GPUS", "false");
        // note the allocated devices doesn't matter here, just the env passed in
        plugin.allocateDevices(allocatedDevices, 2, envs);
        spec = plugin.onDevicesAllocated(allocatedDevices,
                YarnRuntimeType.RUNTIME_DOCKER);
        Assert.assertEquals("nvidia", spec.getContainerRuntime());
        Assert.assertEquals("0,1", spec.getEnvs().get("NVIDIA_VISIBLE_DEVICES"));
    }

    @Test
    public void testMultiGPUsConf() throws Exception {
        NvidiaGPUMigPluginForRuntimeV2 plugin = new NvidiaGPUMigPluginForRuntimeV2();
        Set<Device> allocatedDevices = new TreeSet<>();

        DeviceRuntimeSpec spec = plugin.onDevicesAllocated(allocatedDevices,
                YarnRuntimeType.RUNTIME_DEFAULT);
        Assert.assertNull(spec);

        // allocate one device
        allocatedDevices.add(Device.Builder.newInstance()
                .setId(0).setHealthy(true)
                .setBusID("00000000:04:00.0")
                .setDevPath("/dev/nvidia0")
                .setMajorNumber(195)
                .setMinorNumber(0).build());

        // two device allowed
        allocatedDevices.add(Device.Builder.newInstance()
                .setId(0).setHealthy(true)
                .setBusID("00000000:82:00.0")
                .setDevPath("/dev/nvidia1")
                .setMajorNumber(195)
                .setMinorNumber(1).build());

        // test that env variable takes presedence
        plugin.setShouldThrowOnMultipleGPUFromConf(false);
        spec = plugin.onDevicesAllocated(allocatedDevices,
                YarnRuntimeType.RUNTIME_DOCKER);
        Assert.assertEquals("nvidia", spec.getContainerRuntime());
        Assert.assertEquals("0,1", spec.getEnvs().get("NVIDIA_VISIBLE_DEVICES"));
    }

    @Test
    public void testOnDeviceAllocatedMig() throws Exception {
        NvidiaGPUMigPluginForRuntimeV2 plugin = new NvidiaGPUMigPluginForRuntimeV2();
        Set<Device> allocatedDevices = new TreeSet<>();

        DeviceRuntimeSpec spec = plugin.onDevicesAllocated(allocatedDevices,
                YarnRuntimeType.RUNTIME_DEFAULT);
        Assert.assertNull(spec);

        Map<Integer, String> testMigDevices = new HashMap<>();
        testMigDevices.put(0, "0");
        plugin.setMigDevices(testMigDevices);

        // allocate one device
        allocatedDevices.add(Device.Builder.newInstance()
                .setId(0).setHealthy(true)
                .setBusID("00000000:04:00.0")
                .setDevPath("/dev/nvidia0")
                .setMajorNumber(195)
                .setMinorNumber(0).build());
        spec = plugin.onDevicesAllocated(allocatedDevices,
                YarnRuntimeType.RUNTIME_DOCKER);
        Assert.assertEquals("nvidia", spec.getContainerRuntime());
        Assert.assertEquals("0:0", spec.getEnvs().get("NVIDIA_VISIBLE_DEVICES"));
    }

    @Test
    public void testOnDeviceAllocatedNoMig() throws Exception {
        NvidiaGPUMigPluginForRuntimeV2 plugin = new NvidiaGPUMigPluginForRuntimeV2();
        Set<Device> allocatedDevices = new TreeSet<>();

        DeviceRuntimeSpec spec = plugin.onDevicesAllocated(allocatedDevices,
                YarnRuntimeType.RUNTIME_DEFAULT);
        Assert.assertNull(spec);

        // allocate one device
        allocatedDevices.add(Device.Builder.newInstance()
                .setId(0).setHealthy(true)
                .setBusID("00000000:04:00.0")
                .setDevPath("/dev/nvidia0")
                .setMajorNumber(195)
                .setMinorNumber(0).build());
        spec = plugin.onDevicesAllocated(allocatedDevices,
                YarnRuntimeType.RUNTIME_DOCKER);
        Assert.assertEquals("nvidia", spec.getContainerRuntime());
        Assert.assertEquals("0", spec.getEnvs().get("NVIDIA_VISIBLE_DEVICES"));
    }
}


================================================
FILE: examples/MIG-Support/resource-types/gpu-mig/README.md
================================================
# NVIDIA Support for GPU for YARN with MIG support for YARN 3.1.2 until YARN 3.3.0

This adds support for GPUs with [MIG](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/) on YARN for versions prior to
YARN 3.3.0 which don't support the pluggable device framework. Use the [GPU Plugin for YARN with MIG support](../../device-plugins/gpu-mig/README.md)
for YARN 3.3.0 and newer versions. The built-in YARN GPU plugin does not support MIG enabled GPUs. This patch
works with GPUs without MIG or GPUs with MIG disabled but the limitation section still applies. It supports heterogenous
environments where there may be some MIG enabled GPUs and some without MIG. This requires patching YARN and rebuilding it.

## Compatibility

Requires YARN 3.1.2 or newer that supports GPU scheduling. See the [supported versions](#supported-versions) section below for specific versions supported.
MIG support requires YARN to be configured with Docker and using the NVIDIA Container Toolkit (nvidia-docker2)

## Limitations

Please see the [MIG Application Considerations](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#app-considerations)
and [CUDA Device Enumeration](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#cuda-visible-devices).

It is important to note that CUDA 11 only supports enumeration of a single MIG instance. This means that with this patch 
and MIG support enabled, it only supports 1 GPU per container and will throw an exception by default if you request more.
It is recommended that you configure YARN to only allow a single GPU be requested. See the yarn config:
```
 yarn.resource-types.yarn.io/gpu.maximum-allocation
```
See [YARN Resource Configuration](https://hadoop.apache.org/docs/r3.1.2/hadoop-yarn/hadoop-yarn-site/ResourceModel.html) for more details.
If you do not configure the maximum allocation and someone requests multiple GPUs, the default behavior is to throw an exception.
See the [Configuration](#configuration) section for options if it throws an exception.

## Supported Versions
There are different patches available depending on the YARN version you are using:

- YARN 3.1.2 use patch `yarn312MIG.patch`
- YARN versions 3.1.3 to 3.1.5 (git hash cd7c34f9b4005d27886f73e58bef88e706fcccf9 since 3.1.5 was not released when this was tested) use `yarn313to315MIG.patch`
- YARN 3.2.0, no patch is currently available, backport patch for YARN 3.2.1 or contact us.
- YARN 3.2.1 and 3.2.3 use patch `yarn321to323MIG.patch`

## Building
Apply the patch to your YARN version and build it like you would normally for your deployment.

For example:
```
patch -p1 < yarn312MIG.patch
mvn clean package -Pdist -Dtar -DskipTests
```

Run unit tests:
```
mvn test -Pdist -Dtar -Dtest=TestGpuDiscoverer
mvn test -Pdist -Dtar -Dtest=TestNvidiaDockerV2CommandPlugin
```

## Installation

These instructions assume YARN is already installed and configured with GPU Scheduling enabled using Docker and the NVIDIA Container Toolkit (nvidia-docker2).
See [Using GPU on YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/UsingGpus.html) if you need more information. 

Enable and configure your [GPUs with MIG](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html) on all of the nodes it applies to.

Install the new YARN version built with the patch on your YARN Cluster.

Enable the MIG GPU support in the Hadoop configuration files:

```
<property>
  <name>yarn.nodemanager.resource-plugins.gpu.use-mig-enabled</name>
  <value>true</value>
</property>

```

Restart YARN if needed to pick up any configuration changes.

## Configuration

The default behavior of the GPU resource plugin on YARN is to use `auto` discovery mode of GPUs on each nodemanager.
It also allows you to manually allow certain gpu devices. This configuration was extended to support MIG devices.
`yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices` configuration can be used to manually specify devices.
GPU device is identified by their minor device number, index, and optionally MIG device index. A common approach to get
minor device number of GPUs is using nvidia-smi -q and search Minor Number output and optionally MIG device indices.
The format is index:minor_number[:mig_index][,index:minor_number...]. An example of manual specification is
0:0,1:1:0,1:1:1,2:2" to allow YARN NodeManager to manage GPU devices with indices 0/1/2 and minor number 0/1/2
where GPU indices 1 has 2 MIG enabled devices with indices 0/1.
```
<property>
  <name>yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices</name>
  <value>0:0,1:1:0,1:1:1,2:2</value>
</property>
```

To change the behavior of throwing when the user allocates multiple GPUs can be controlled by setting an environment variable
when the Spark application is launched. Setting it to `true` means to throw if a user requests multiple GPUs (this is the default), `false`
means it won't throw and if the container is allocated with multiple MIG devices from the same GPU, it is up to the
application to know how to use them.

Environment variable for Spark application:
```
--conf spark.executorEnv.NVIDIA_MIG_PLUGIN_THROW_ON_MULTIPLE_GPUS=false
```

## Testing
Run a Spark application using the [Rapids Accelerator for Apache Spark](https://nvidia.github.io/spark-rapids/) and request GPUs
from YARN and verify they use the MIG enabled GPUs.


================================================
FILE: examples/MIG-Support/resource-types/gpu-mig/yarn312MIG.patch
================================================
diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
index 36fafefdbc4..e37d0a3a685 100644
--- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
+++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
@@ -1574,6 +1574,10 @@ public static boolean isAclEnabled(Configuration conf) {
   @Private
   public static final String AUTOMATICALLY_DISCOVER_GPU_DEVICES = "auto";
 
+  @Private
+  public static final String USE_MIG_ENABLED_GPUS =
+          NM_GPU_RESOURCE_PREFIX + "use-mig-enabled";
+
   /**
    * This setting controls where to how to invoke GPU binaries
    */
diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/AssignedGpuDevice.java b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/AssignedGpuDevice.java
index 26fd9050742..e84b920dcee 100644
--- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/AssignedGpuDevice.java
+++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/AssignedGpuDevice.java
@@ -34,6 +34,12 @@ public AssignedGpuDevice(int index, int minorNumber,
     this.containerId = containerId.toString();
   }
 
+  public AssignedGpuDevice(int index, int minorNumber,
+                           int migIndex, ContainerId containerId) {
+    super(index, minorNumber, migIndex);
+    this.containerId = containerId.toString();
+  }
+
   public String getContainerId() {
     return containerId;
   }
@@ -49,6 +55,7 @@ public boolean equals(Object obj) {
     }
     AssignedGpuDevice other = (AssignedGpuDevice) obj;
     return index == other.index && minorNumber == other.minorNumber
+        && migDeviceIndex == other.migDeviceIndex
         && containerId.equals(other.containerId);
   }
 
@@ -68,12 +75,16 @@ public int compareTo(Object obj) {
     if (0 != result) {
       return result;
     }
-    return containerId.compareTo(other.containerId);
+    result = containerId.compareTo(other.containerId);
+    if (0 != result) {
+      return result;
+    }
+    return Integer.compare(migDeviceIndex, other.migDeviceIndex);
   }
 
   @Override
   public int hashCode() {
     final int prime = 47;
-    return prime * (prime * index + minorNumber) + containerId.hashCode();
+    return prime * (prime * index + minorNumber + migDeviceIndex) + containerId.hashCode();
   }
 }
diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/GpuDevice.java b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/GpuDevice.java
index bce1d9fa480..3cb42d3c58f 100644
--- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/GpuDevice.java
+++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/GpuDevice.java
@@ -26,6 +26,7 @@
 public class GpuDevice implements Serializable, Comparable {
   protected int index;
   protected int minorNumber;
+  protected int migDeviceIndex = -1;
   private static final long serialVersionUID = -6812314470754667710L;
 
   public GpuDevice(int index, int minorNumber) {
@@ -33,6 +34,12 @@ public GpuDevice(int index, int minorNumber) {
     this.minorNumber = minorNumber;
   }
 
+  public GpuDevice(int index, int minorNumber, int migIndex) {
+    this.index = index;
+    this.minorNumber = minorNumber;
+    this.migDeviceIndex = migIndex;
+  }
+
   public int getIndex() {
     return index;
   }
@@ -41,13 +48,17 @@ public int getMinorNumber() {
     return minorNumber;
   }
 
+  public int getMIGIndex() {
+    return migDeviceIndex;
+  }
+
   @Override
   public boolean equals(Object obj) {
     if (obj == null || !(obj instanceof GpuDevice)) {
       return false;
     }
     GpuDevice other = (GpuDevice) obj;
-    return index == other.index && minorNumber == other.minorNumber;
+    return index == other.index && minorNumber == other.minorNumber && migDeviceIndex == other.migDeviceIndex;
   }
 
   @Override
@@ -62,17 +73,21 @@ public int compareTo(Object obj) {
     if (0 != result) {
       return result;
     }
-    return Integer.compare(minorNumber, other.minorNumber);
+    result = Integer.compare(minorNumber, other.minorNumber);
+    if (0 != result) {
+      return result;
+    }
+    return Integer.compare(migDeviceIndex, other.migDeviceIndex);
   }
 
   @Override
   public int hashCode() {
     final int prime = 47;
-    return prime * index + minorNumber;
+    return prime * index + minorNumber + migDeviceIndex;
   }
 
   @Override
   public String toString() {
-    return "(index=" + index + ",minor_number=" + minorNumber + ")";
+    return "(index=" + index + ",minor_number=" + minorNumber + ",mig_index=" + migDeviceIndex + ")";
   }
 }
diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/GpuDiscoverer.java b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/GpuDiscoverer.java
index 6e3cf1315ce..55f7379d4cc 100644
--- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/GpuDiscoverer.java
+++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/GpuDiscoverer.java
@@ -30,6 +30,7 @@
 import org.apache.hadoop.yarn.server.nodemanager.webapp.dao.gpu.GpuDeviceInformation;
 import org.apache.hadoop.yarn.server.nodemanager.webapp.dao.gpu.GpuDeviceInformationParser;
 import org.apache.hadoop.yarn.server.nodemanager.webapp.dao.gpu.PerGpuDeviceInformation;
+import org.apache.hadoop.yarn.server.nodemanager.webapp.dao.gpu.PerGpuMigDevice;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
@@ -149,6 +150,10 @@ public synchronized GpuDeviceInformation getGpuDeviceInformation()
         YarnConfiguration.NM_GPU_ALLOWED_DEVICES,
         YarnConfiguration.AUTOMATICALLY_DISCOVER_GPU_DEVICES);
 
+    Boolean useMIGEnabledGPUs = conf.getBoolean(
+            YarnConfiguration.USE_MIG_ENABLED_GPUS, false);
+    LOG.info("Use MIG enabled is: " + useMIGEnabledGPUs);
+
     List<GpuDevice> gpuDevices = new ArrayList<>();
 
     if (allowedDevicesStr.equals(
@@ -171,21 +176,45 @@ public synchronized GpuDeviceInformation getGpuDeviceInformation()
              i++) {
           List<PerGpuDeviceInformation> gpuInfos =
               lastDiscoveredGpuInformation.getGpus();
-          gpuDevices.add(new GpuDevice(i, gpuInfos.get(i).getMinorNumber()));
+          if (useMIGEnabledGPUs &&
+              gpuInfos.get(i).getMIGMode().getCurrentMigMode().equalsIgnoreCase("enabled")) {
+            LOG.info("GPU id " + i + " has MIG mode enabled.");
+            for (PerGpuMigDevice dev: gpuInfos.get(i).getMIGDevices()) {
+              gpuDevices.add(new GpuDevice(i, gpuInfos.get(i).getMinorNumber(), dev.getMigDeviceIndex()));
+            }
+          } else {
+            gpuDevices.add(new GpuDevice(i, gpuInfos.get(i).getMinorNumber()));
+          }
         }
+        LOG.info("Discovered GPU devices: " + gpuDevices);
       }
     } else{
       for (String s : allowedDevicesStr.split(",")) {
         if (s.trim().length() > 0) {
           String[] kv = s.trim().split(":");
-          if (kv.length != 2) {
-            throw new YarnException(
-                "Illegal format, it should be index:minor_number format, now it="
-                    + s);
+          if (useMIGEnabledGPUs) {
+            if (kv.length != 2 && kv.length != 3) {
+              throw new YarnException(
+                      "Illegal format, it should be index:minor_number or index:minor_number:mig_device_id" +
+                              " format, now it=" + s);
+            }
+            if (kv.length == 3) {
+              // assumes this is MIG enabled device
+              gpuDevices.add(
+                      new GpuDevice(Integer.parseInt(kv[0]), Integer.parseInt(kv[1]), Integer.parseInt(kv[2])));
+            } else {
+              gpuDevices.add(
+                      new GpuDevice(Integer.parseInt(kv[0]), Integer.parseInt(kv[1])));
+            }
+          } else {
+            if (kv.length != 2) {
+              throw new YarnException(
+                      "Illegal format, it should be index:minor_number format, now it="
+                              + s);
+            }
+            gpuDevices.add(
+                    new GpuDevice(Integer.parseInt(kv[0]), Integer.parseInt(kv[1])));
           }
-
-          gpuDevices.add(
-              new GpuDevice(Integer.parseInt(kv[0]), Integer.parseInt(kv[1])));
         }
       }
       LOG.info("Allowed GPU devices:" + gpuDevices);
diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/GpuDockerCommandPluginFactory.java b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/GpuDockerCommandPluginFactory.java
index 051afd6c561..996cb58ac45 100644
--- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/GpuDockerCommandPluginFactory.java
+++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/GpuDockerCommandPluginFactory.java
@@ -36,7 +36,7 @@ public static DockerCommandPlugin createGpuDockerCommandPlugin(
     }
     // nvidia-docker2
     if (impl.equals(YarnConfiguration.NVIDIA_DOCKER_V2)) {
-      return new NvidiaDockerV2CommandPlugin();
+      return new NvidiaDockerV2CommandPlugin(conf);
     }
 
     throw new YarnException(
diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/NvidiaDockerV2CommandPlugin.java b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/NvidiaDockerV2CommandPlugin.java
index ff25eb6ced6..c2cc0e5a2d1 100644
--- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/NvidiaDockerV2CommandPlugin.java
+++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/NvidiaDockerV2CommandPlugin.java
@@ -21,7 +21,9 @@
 import com.google.common.annotations.VisibleForTesting;
 import org.apache.commons.logging.Log;
 import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.yarn.api.records.ResourceInformation;
+import org.apache.hadoop.yarn.conf.YarnConfiguration;
 import org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container;
 import org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ResourceMappings;
 import org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources

Download .txt

gitextract_pa_r2orm/

├── .github/
│   ├── ISSUE_TEMPLATE/
│   │   └── bug_report.md
│   └── workflows/
│       ├── add-to-project.yml
│       ├── license-header-check.yml
│       ├── markdown-links-check/
│       │   └── markdown-links-check-config.json
│       ├── markdown-links-check.yml
│       ├── shell-check.yml
│       └── signoff-check.yml
├── .gitignore
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── dockerfile/
│   ├── Dockerfile
│   └── gpu_executor_template.yaml
├── docs/
│   ├── get-started/
│   │   └── xgboost-examples/
│   │       ├── building-sample-apps/
│   │       │   ├── python.md
│   │       │   └── scala.md
│   │       ├── csp/
│   │       │   ├── aws/
│   │       │   │   └── ec2.md
│   │       │   ├── databricks/
│   │       │   │   ├── databricks.md
│   │       │   │   └── init.sh
│   │       │   └── dataproc/
│   │       │       └── gcp.md
│   │       ├── dataset/
│   │       │   └── mortgage.md
│   │       ├── notebook/
│   │       │   ├── python-notebook.md
│   │       │   ├── spylon.md
│   │       │   └── toree.md
│   │       ├── on-prem-cluster/
│   │       │   ├── kubernetes-scala.md
│   │       │   ├── standalone-python.md
│   │       │   ├── standalone-scala.md
│   │       │   ├── yarn-python.md
│   │       │   └── yarn-scala.md
│   │       └── prepare-package-data/
│   │           ├── preparation-python.md
│   │           └── preparation-scala.md
│   └── trouble-shooting/
│       └── xgboost-examples-trouble-shooting.md
├── examples/
│   ├── MIG-Support/
│   │   ├── README.md
│   │   ├── device-plugins/
│   │   │   └── gpu-mig/
│   │   │       ├── README.md
│   │   │       ├── pom.xml
│   │   │       ├── scripts/
│   │   │       │   └── getMIGGPUs
│   │   │       └── src/
│   │   │           ├── main/
│   │   │           │   └── java/
│   │   │           │       └── com/
│   │   │           │           └── nvidia/
│   │   │           │               └── spark/
│   │   │           │                   └── NvidiaGPUMigPluginForRuntimeV2.java
│   │   │           └── test/
│   │   │               └── java/
│   │   │                   └── com/
│   │   │                       └── nvidia/
│   │   │                           └── spark/
│   │   │                               └── TestNvidiaGPUMigPluginForRuntimeV2.java
│   │   ├── resource-types/
│   │   │   └── gpu-mig/
│   │   │       ├── README.md
│   │   │       ├── yarn312MIG.patch
│   │   │       ├── yarn313to315MIG.patch
│   │   │       └── yarn321to323MIG.patch
│   │   └── yarn-unpatched/
│   │       ├── README.md
│   │       └── scripts/
│   │           ├── mig2gpu.sh
│   │           ├── nvidia-container-cli-wrapper.sh
│   │           └── nvidia-smi
│   ├── ML+DL-Examples/
│   │   ├── Optuna-Spark/
│   │   │   ├── README.md
│   │   │   └── optuna-examples/
│   │   │       ├── databricks/
│   │   │       │   ├── init_optuna.sh
│   │   │       │   └── start_cluster.sh
│   │   │       ├── optuna-dataframe.ipynb
│   │   │       └── optuna-joblibspark.ipynb
│   │   ├── Spark-DL/
│   │   │   └── dl_inference/
│   │   │       ├── README.md
│   │   │       ├── databricks/
│   │   │       │   ├── README.md
│   │   │       │   └── setup/
│   │   │       │       ├── init_spark_dl.sh
│   │   │       │       └── start_cluster.sh
│   │   │       ├── dataproc/
│   │   │       │   ├── README.md
│   │   │       │   └── setup/
│   │   │       │       ├── init_spark_dl.sh
│   │   │       │       └── start_cluster.sh
│   │   │       ├── huggingface/
│   │   │       │   ├── conditional_generation_tf.ipynb
│   │   │       │   ├── conditional_generation_torch.ipynb
│   │   │       │   ├── deepseek-r1_torch.ipynb
│   │   │       │   ├── gemma-7b_torch.ipynb
│   │   │       │   ├── pipelines_tf.ipynb
│   │   │       │   ├── pipelines_torch.ipynb
│   │   │       │   ├── qwen-2.5-7b_torch.ipynb
│   │   │       │   └── sentence_transformers_torch.ipynb
│   │   │       ├── pytorch/
│   │   │       │   ├── housing_regression_torch.ipynb
│   │   │       │   └── image_classification_torch.ipynb
│   │   │       ├── requirements.txt
│   │   │       ├── server_utils.py
│   │   │       ├── tensorflow/
│   │   │       │   ├── image_classification_tf.ipynb
│   │   │       │   ├── keras_preprocessing_tf.ipynb
│   │   │       │   ├── keras_resnet50_tf.ipynb
│   │   │       │   └── text_classification_tf.ipynb
│   │   │       ├── tf_requirements.txt
│   │   │       ├── torch_requirements.txt
│   │   │       ├── vllm/
│   │   │       │   ├── qwen-2.5-14b-tensor-parallel_vllm.ipynb
│   │   │       │   └── qwen-2.5-7b_vllm.ipynb
│   │   │       └── vllm_requirements.txt
│   │   └── Spark-Rapids-ML/
│   │       └── pca/
│   │           ├── README.md
│   │           ├── notebooks/
│   │           │   └── pca.ipynb
│   │           └── start-spark-rapids.sh
│   ├── SQL+DF-Examples/
│   │   ├── customer-churn/
│   │   │   ├── README.md
│   │   │   └── notebooks/
│   │   │       └── python/
│   │   │           ├── README.md
│   │   │           ├── augment.ipynb
│   │   │           ├── churn/
│   │   │           │   ├── augment.py
│   │   │           │   ├── eda.py
│   │   │           │   └── etl.py
│   │   │           └── etl.ipynb
│   │   ├── demo/
│   │   │   ├── Spark_get_json_object.ipynb
│   │   │   └── Spark_parquet_microkernels.ipynb
│   │   ├── micro-benchmarks/
│   │   │   ├── README.md
│   │   │   └── notebooks/
│   │   │       ├── micro-benchmarks-cpu.ipynb
│   │   │       └── micro-benchmarks-gpu.ipynb
│   │   ├── retail-analytics/
│   │   │   ├── README.md
│   │   │   └── notebooks/
│   │   │       └── python/
│   │   │           ├── retail-analytic.ipynb
│   │   │           └── retail-datagen.ipynb
│   │   └── tpcds/
│   │       ├── README.md
│   │       └── notebooks/
│   │           └── TPCDS-SF10.ipynb
│   ├── UDF-Examples/
│   │   └── RAPIDS-accelerated-UDFs/
│   │       ├── Dockerfile
│   │       ├── README.md
│   │       ├── clone-cudf-repo.sh
│   │       ├── conftest.py
│   │       ├── extract-cudf-libs.sh
│   │       ├── pom.xml
│   │       ├── pytest.ini
│   │       ├── run_pyspark_from_build.sh
│   │       ├── runtests.py
│   │       └── src/
│   │           └── main/
│   │               ├── cpp/
│   │               │   ├── CMakeLists.txt
│   │               │   ├── benchmarks/
│   │               │   │   ├── CMakeLists.txt
│   │               │   │   ├── cosine_similarity/
│   │               │   │   │   └── cosine_similarity_benchmark.cpp
│   │               │   │   ├── fixture/
│   │               │   │   │   └── benchmark_fixture.hpp
│   │               │   │   └── synchronization/
│   │               │   │       ├── synchronization.cpp
│   │               │   │       └── synchronization.hpp
│   │               │   └── src/
│   │               │       ├── CosineSimilarityJni.cpp
│   │               │       ├── StringWordCountJni.cpp
│   │               │       ├── cosine_similarity.cu
│   │               │       ├── cosine_similarity.hpp
│   │               │       ├── string_word_count.cu
│   │               │       └── string_word_count.hpp
│   │               ├── java/
│   │               │   └── com/
│   │               │       └── nvidia/
│   │               │           └── spark/
│   │               │               └── rapids/
│   │               │                   └── udf/
│   │               │                       ├── hive/
│   │               │                       │   ├── DecimalFraction.java
│   │               │                       │   ├── StringWordCount.java
│   │               │                       │   ├── URLDecode.java
│   │               │                       │   └── URLEncode.java
│   │               │                       └── java/
│   │               │                           ├── CosineSimilarity.java
│   │               │                           ├── DecimalFraction.java
│   │               │                           ├── NativeUDFExamplesLoader.java
│   │               │                           ├── URLDecode.java
│   │               │                           └── URLEncode.java
│   │               ├── python/
│   │               │   ├── asserts.py
│   │               │   ├── conftest.py
│   │               │   ├── data_gen.py
│   │               │   ├── rapids_udf_test.py
│   │               │   ├── spark_init_internal.py
│   │               │   └── spark_session.py
│   │               └── scala/
│   │                   └── com/
│   │                       └── nvidia/
│   │                           └── spark/
│   │                               └── rapids/
│   │                                   └── udf/
│   │                                       └── scala/
│   │                                           ├── URLDecode.scala
│   │                                           └── URLEncode.scala
│   ├── XGBoost-Examples/
│   │   ├── .gitignore
│   │   ├── README.md
│   │   ├── agaricus/
│   │   │   ├── .gitignore
│   │   │   ├── notebooks/
│   │   │   │   ├── python/
│   │   │   │   │   └── agaricus-gpu.ipynb
│   │   │   │   └── scala/
│   │   │   │       └── agaricus-gpu.ipynb
│   │   │   ├── pom.xml
│   │   │   ├── python/
│   │   │   │   └── com/
│   │   │   │       ├── __init__.py
│   │   │   │       └── nvidia/
│   │   │   │           ├── __init__.py
│   │   │   │           └── spark/
│   │   │   │               ├── __init__.py
│   │   │   │               └── examples/
│   │   │   │                   ├── __init__.py
│   │   │   │                   └── agaricus/
│   │   │   │                       ├── __init__.py
│   │   │   │                       └── main.py
│   │   │   └── scala/
│   │   │       └── src/
│   │   │           └── com/
│   │   │               └── nvidia/
│   │   │                   └── spark/
│   │   │                       └── examples/
│   │   │                           └── agaricus/
│   │   │                               └── Main.scala
│   │   ├── aggregator/
│   │   │   └── .gitignore
│   │   ├── app-parameters/
│   │   │   ├── supported_xgboost_parameters_python.md
│   │   │   └── supported_xgboost_parameters_scala.md
│   │   ├── assembly/
│   │   │   └── assembly-no-scala.xml
│   │   ├── main.py
│   │   ├── mortgage/
│   │   │   ├── .gitignore
│   │   │   ├── notebooks/
│   │   │   │   ├── python/
│   │   │   │   │   ├── MortgageETL+XGBoost.ipynb
│   │   │   │   │   ├── MortgageETL.ipynb
│   │   │   │   │   ├── cv-mortgage-gpu.ipynb
│   │   │   │   │   └── mortgage-gpu.ipynb
│   │   │   │   └── scala/
│   │   │   │       ├── mortgage-ETL.ipynb
│   │   │   │       ├── mortgage-gpu.ipynb
│   │   │   │       └── mortgage_gpu_crossvalidation.ipynb
│   │   │   ├── pom.xml
│   │   │   ├── python/
│   │   │   │   └── com/
│   │   │   │       ├── __init__.py
│   │   │   │       └── nvidia/
│   │   │   │           ├── __init__.py
│   │   │   │           └── spark/
│   │   │   │               ├── __init__.py
│   │   │   │               └── examples/
│   │   │   │                   ├── __init__.py
│   │   │   │                   └── mortgage/
│   │   │   │                       ├── __init__.py
│   │   │   │                       ├── consts.py
│   │   │   │                       ├── cross_validator_main.py
│   │   │   │                       ├── etl.py
│   │   │   │                       ├── etl_main.py
│   │   │   │                       └── main.py
│   │   │   └── scala/
│   │   │       └── src/
│   │   │           └── com/
│   │   │               └── nvidia/
│   │   │                   └── spark/
│   │   │                       └── examples/
│   │   │                           └── mortgage/
│   │   │                               ├── CrossValidationMain.scala
│   │   │                               ├── ETLMain.scala
│   │   │                               ├── Main.scala
│   │   │                               ├── Mortgage.scala
│   │   │                               └── XGBoostETL.scala
│   │   ├── pack_pyspark_example.sh
│   │   ├── pom.xml
│   │   ├── taxi/
│   │   │   ├── .gitignore
│   │   │   ├── notebooks/
│   │   │   │   ├── python/
│   │   │   │   │   ├── cv-taxi-gpu.ipynb
│   │   │   │   │   ├── taxi-ETL.ipynb
│   │   │   │   │   └── taxi-gpu.ipynb
│   │   │   │   └── scala/
│   │   │   │       ├── taxi-ETL.ipynb
│   │   │   │       ├── taxi-gpu.ipynb
│   │   │   │       └── taxi_gpu_crossvalidation.ipynb
│   │   │   ├── pom.xml
│   │   │   ├── python/
│   │   │   │   └── com/
│   │   │   │       ├── __init__.py
│   │   │   │       └── nvidia/
│   │   │   │           ├── __init__.py
│   │   │   │           └── spark/
│   │   │   │               ├── __init__.py
│   │   │   │               └── examples/
│   │   │   │                   ├── __init__.py
│   │   │   │                   └── taxi/
│   │   │   │                       ├── __init__.py
│   │   │   │                       ├── consts.py
│   │   │   │                       ├── cross_validator_main.py
│   │   │   │                       ├── etl_main.py
│   │   │   │                       ├── main.py
│   │   │   │                       └── pre_process.py
│   │   │   └── scala/
│   │   │       └── src/
│   │   │           └── com/
│   │   │               └── nvidia/
│   │   │                   └── spark/
│   │   │                       └── examples/
│   │   │                           └── taxi/
│   │   │                               ├── CrossValidationMain.scala
│   │   │                               ├── ETLMain.scala
│   │   │                               ├── Main.scala
│   │   │                               └── Taxi.scala
│   │   └── utility/
│   │       ├── .gitignore
│   │       ├── pom.xml
│   │       ├── python/
│   │       │   └── com/
│   │       │       ├── __init__.py
│   │       │       └── nvidia/
│   │       │           ├── __init__.py
│   │       │           └── spark/
│   │       │               ├── __init__.py
│   │       │               └── examples/
│   │       │                   ├── __init__.py
│   │       │                   ├── main.py
│   │       │                   └── utility/
│   │       │                       ├── __init__.py
│   │       │                       ├── args.py
│   │       │                       └── utils.py
│   │       └── scala/
│   │           └── src/
│   │               └── com/
│   │                   └── nvidia/
│   │                       └── spark/
│   │                           └── examples/
│   │                               └── utility/
│   │                                   ├── Benchmark.scala
│   │                                   ├── SparkSetup.scala
│   │                                   ├── Vectorize.scala
│   │                                   └── XGBoostArgs.scala
│   └── spark-connect-gpu/
│       ├── client/
│       │   ├── Dockerfile
│       │   ├── README.md
│       │   ├── docker-compose.yaml
│       │   ├── nds/
│       │   │   ├── nds.ipynb
│       │   │   └── query_0.sql
│       │   ├── notebook/
│       │   │   ├── README.md
│       │   │   ├── spark-connect-gpu-etl-ml.ipynb
│       │   │   └── work/
│       │   │       ├── csv_raw_schema.ddl
│       │   │       └── name_mapping.csv
│       │   ├── python/
│       │   │   ├── batch-job.ipynb
│       │   │   └── batch-job.py
│       │   ├── requirements.txt
│       │   └── scala/
│       │       ├── .gitignore
│       │       ├── pom.xml
│       │       ├── run.sh
│       │       ├── scala-run.ipynb
│       │       └── src/
│       │           └── main/
│       │               └── scala/
│       │                   └── connect.scala
│       └── server/
│           ├── README.md
│           ├── docker-compose.yaml
│           ├── proxy-service/
│           │   ├── Dockerfile
│           │   └── nginx.conf
│           ├── spark-connect-server/
│           │   ├── Dockerfile
│           │   ├── requirements.txt
│           │   ├── spark-defaults.conf
│           │   └── spark-env.sh
│           ├── spark-master/
│           │   ├── Dockerfile
│           │   └── spark-env.sh
│           └── spark-worker/
│               ├── Dockerfile
│               ├── requirements.txt
│               └── spark-env.sh
├── scripts/
│   ├── README.md
│   ├── building/
│   │   └── python_build.sh
│   ├── csp-startup-scripts/
│   │   ├── README.md
│   │   └── emr/
│   │       ├── cgroup-bootstrap-action-emr6.sh
│   │       ├── cgroup-bootstrap-action-emr7.sh
│   │       ├── config-emr6.json
│   │       ├── config-emr7.json
│   │       └── emr-spark-plugin-startup.py
│   ├── encoding/
│   │   └── python/
│   │       ├── .gitignore
│   │       ├── com/
│   │       │   ├── __init__.py
│   │       │   └── nvidia/
│   │       │       ├── __init__.py
│   │       │       └── spark/
│   │       │           ├── __init__.py
│   │       │           └── encoding/
│   │       │               ├── __init__.py
│   │       │               ├── criteo/
│   │       │               │   ├── __init__.py
│   │       │               │   ├── common.py
│   │       │               │   ├── one_hot_cpu_main.py
│   │       │               │   └── target_cpu_main.py
│   │       │               ├── main.py
│   │       │               └── utility/
│   │       │                   ├── __init__.py
│   │       │                   ├── args.py
│   │       │                   └── utils.py
│   │       └── main.py
│   └── encoding-sample/
│       ├── repartition.py
│       ├── run.sh
│       └── truncate-model.py
└── tools/
    ├── databricks/
    │   ├── README.md
    │   ├── [RAPIDS Accelerator for Apache Spark] Profiling Tool Notebook Template.ipynb
    │   └── [RAPIDS Accelerator for Apache Spark] Qualification Tool Notebook Template.ipynb
    └── emr/
        ├── README.md
        ├── [RAPIDS Accelerator for Apache Spark] Profiling Tool Notebook Template.ipynb
        └── [RAPIDS Accelerator for Apache Spark] Qualification Tool Notebook Template.ipynb

Download .txt

SYMBOL INDEX (420 symbols across 47 files)

FILE: examples/MIG-Support/device-plugins/gpu-mig/src/main/java/com/nvidia/spark/NvidiaGPUMigPluginForRuntimeV2.java
  class NvidiaGPUMigPluginForRuntimeV2 (line 52) | public class NvidiaGPUMigPluginForRuntimeV2 implements DevicePlugin,
    method getRegisterRequestInfo (line 97) | @Override
    method getDevices (line 103) | @Override
    method shouldThrowOnMultipleGPUs (line 202) | private Boolean shouldThrowOnMultipleGPUs() {
    method onDevicesAllocated (line 210) | @Override
    method onDevicesReleased (line 243) | @Override
    method getMajorNumber (line 249) | private String getMajorNumber(String devName) {
    method allocateDevices (line 268) | @Override
    method basicSchedule (line 282) | public void basicSchedule(Set<Device> allocation, int count,
    class NvidiaCommandExecutor (line 303) | public class NvidiaCommandExecutor {
      method getDeviceInfo (line 305) | public String getDeviceInfo() throws IOException {
      method getDeviceMigInfo (line 311) | public String getDeviceMigInfo() throws IOException {
      method getMajorMinorInfo (line 316) | public String getMajorMinorInfo(String devName) throws IOException {
      method searchBinary (line 324) | public void searchBinary() throws Exception {
    method setPathOfGpuBinary (line 363) | public void setPathOfGpuBinary(String pOfGpuBinary) {
    method setShellExecutor (line 368) | public void setShellExecutor(NvidiaCommandExecutor shellExecutor) {
    method setMigDevices (line 373) | public void setMigDevices(Map<Integer, String> migDevices) {
    method setShouldThrowOnMultipleGPUFromConf (line 378) | public void setShouldThrowOnMultipleGPUFromConf(Boolean shouldThrow) {

FILE: examples/MIG-Support/device-plugins/gpu-mig/src/test/java/com/nvidia/spark/TestNvidiaGPUMigPluginForRuntimeV2.java
  class TestNvidiaGPUMigPluginForRuntimeV2 (line 37) | public class TestNvidiaGPUMigPluginForRuntimeV2 {
    method testGetNvidiaDevices (line 42) | @Test
    method testOnDeviceAllocatedMultiGPU (line 93) | @Test(expected = Exception.class)
    method testMultiGPUsEnvPrecedence (line 125) | @Test
    method testMultiGPUsConf (line 162) | @Test
    method testOnDeviceAllocatedMig (line 195) | @Test
    method testOnDeviceAllocatedNoMig (line 221) | @Test

FILE: examples/ML+DL-Examples/Spark-DL/dl_inference/server_utils.py
  function _find_ports (line 41) | def _find_ports(num_ports: int, start_port: int = 7000) -> List[int]:
  function _get_valid_vllm_parameters_task (line 55) | def _get_valid_vllm_parameters_task() -> Set[str]:
  function _start_triton_server_task (line 74) | def _start_triton_server_task(
  function _start_vllm_server_task (line 158) | def _start_vllm_server_task(
  function _stop_server_task (line 225) | def _stop_server_task(
  class ServerManager (line 255) | class ServerManager:
    method __init__ (line 270) | def __init__(self, model_name: str, model_path: Optional[str] = None):
    method _get_num_executors (line 284) | def _get_num_executors(self) -> int:
    method host_to_http_url (line 299) | def host_to_http_url(self) -> Dict[str, str]:
    method _get_node_rdd (line 310) | def _get_node_rdd(self) -> RDD:
    method _use_stage_level_scheduling (line 316) | def _use_stage_level_scheduling(self, rdd: RDD) -> RDD:
    method start_servers (line 354) | def start_servers(
    method stop_servers (line 399) | def stop_servers(
  class TritonServerManager (line 443) | class TritonServerManager(ServerManager):
    method __init__ (line 459) | def __init__(self, model_name: str, model_path: Optional[str] = None):
    method host_to_grpc_url (line 463) | def host_to_grpc_url(self) -> Dict[str, str]:
    method start_servers (line 474) | def start_servers(
  class VLLMServerManager (line 499) | class VLLMServerManager(ServerManager):
    method __init__ (line 517) | def __init__(self, model_name: str, model_path: str = None):
    method _get_valid_vllm_parameters (line 521) | def _get_valid_vllm_parameters(self) -> List[str]:
    method _validate_vllm_kwargs (line 526) | def _validate_vllm_kwargs(self, kwargs: Dict[str, Any]):
    method start_servers (line 545) | def start_servers(

FILE: examples/SQL+DF-Examples/customer-churn/notebooks/python/churn/augment.py
  function get_currency_type (line 33) | def get_currency_type():
  function _register_session (line 52) | def _register_session(s):
  function _get_uniques (line 56) | def _get_uniques(ct):
  function register_options (line 95) | def register_options(**kwargs):
  function load_supplied_data (line 100) | def load_supplied_data(session, input_file):
  function replicate_df (line 150) | def replicate_df(df, duplicates):
  function examine_categoricals (line 163) | def examine_categoricals(df, columns=None):
  function billing_events (line 189) | def billing_events(df):
  function resolve_path (line 244) | def resolve_path(name):
  function write_df (line 254) | def write_df(df, name, skip_replication=False, partition_by=None):
  function customer_meta (line 275) | def customer_meta(df):
  function phone_features (line 330) | def phone_features(df):
  function internet_features (line 341) | def internet_features(df):
  function account_features (line 370) | def account_features(df):
  function debug_augmentation (line 393) | def debug_augmentation(df):

FILE: examples/SQL+DF-Examples/customer-churn/notebooks/python/churn/eda.py
  function isnumeric (line 20) | def isnumeric(data_type):
  function percent_true (line 25) | def percent_true(df, cols):
  function cardinalities (line 30) | def cardinalities(df, cols):
  function likely_unique (line 40) | def likely_unique(counts):
  function likely_categoricals (line 45) | def likely_categoricals(counts):
  function unique_values (line 49) | def unique_values(df, cols):
  function unique_values_array (line 55) | def unique_values_array(df, cols):
  function unique_values_driver (line 69) | def unique_values_driver(df, cols):
  function approx_ecdf (line 72) | def approx_ecdf(df, cols):
  function gen_summary (line 83) | def gen_summary(df, output_prefix=""):
  function losses_by_month (line 125) | def losses_by_month(be):
  function output_reports (line 129) | def output_reports(df, be=None, report_prefix=""):

FILE: examples/SQL+DF-Examples/customer-churn/notebooks/python/churn/etl.py
  function register_options (line 29) | def register_options(**kwargs):
  function _register_session (line 34) | def _register_session(s):
  function _register_views (line 38) | def _register_views(lvars, *names):
  function withsession (line 43) | def withsession(df_arg=0):
  function read_df (line 51) | def read_df(session, fn):
  function find_customers (line 61) | def find_customers(billing_events_df):
  function customers (line 68) | def customers():
  function join_billing_data (line 72) | def join_billing_data(billing_events_df):
  function join_phone_features (line 117) | def join_phone_features(phone_features_df):
  function untidy_feature (line 152) | def untidy_feature(df, feature):
  function chained_join (line 158) | def chained_join(column, base_df, dfs, how="leftouter"):
  function resolve_nullable_column (line 166) | def resolve_nullable_column(df, col, null_val="No"):
  function resolve_dependent_column (line 170) | def resolve_dependent_column(
  function join_internet_features (line 184) | def join_internet_features(internet_features_df):
  function join_account_features (line 247) | def join_account_features(account_features_df):
  function process_account_meta (line 271) | def process_account_meta(account_meta_df, usecal=None):
  function forcefloat (line 302) | def forcefloat(c):
  function join_wide_table (line 306) | def join_wide_table(customer_billing, customer_phone_features, customer_...
  function cast_and_coalesce_wide_data (line 346) | def cast_and_coalesce_wide_data(wd):
  function write_df (line 374) | def write_df(df, name):

FILE: examples/UDF-Examples/RAPIDS-accelerated-UDFs/conftest.py
  function pytest_addoption (line 15) | def pytest_addoption(parser):

FILE: examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/benchmarks/cosine_similarity/cosine_similarity_benchmark.cpp
  function cosine_similarity_bench_args (line 26) | static void cosine_similarity_bench_args(benchmark::internal::Benchmark* b)
  function BM_cosine_similarity (line 45) | static void BM_cosine_similarity(benchmark::State& state)
  class CosineSimilarity (line 82) | class CosineSimilarity : public native_udf::benchmark {

FILE: examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/benchmarks/fixture/benchmark_fixture.hpp
  type native_udf (line 23) | namespace native_udf {
    function make_cuda (line 27) | inline auto make_cuda() { return std::make_shared<rmm::mr::cuda_memory...
    function make_pool (line 29) | inline auto make_pool()
    class benchmark (line 66) | class benchmark : public ::benchmark::Fixture {
      method SetUp (line 68) | virtual void SetUp(const ::benchmark::State& state)
      method TearDown (line 74) | virtual void TearDown(const ::benchmark::State& state)
      method SetUp (line 82) | virtual void SetUp(::benchmark::State& st) { SetUp(const_cast<const ...
      method TearDown (line 83) | virtual void TearDown(::benchmark::State& st)

FILE: examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/benchmarks/synchronization/synchronization.hpp
  class cuda_event_timer (line 69) | class cuda_event_timer {
    method cuda_event_timer (line 87) | cuda_event_timer() = delete;

FILE: examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/src/CosineSimilarityJni.cpp
  function throw_java_exception (line 38) | void throw_java_exception(JNIEnv* env, char const* class_name, char cons...
  function JNIEXPORT (line 59) | JNIEXPORT jlong JNICALL

FILE: examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/src/StringWordCountJni.cpp
  function throw_java_exception (line 36) | void throw_java_exception(JNIEnv* env, char const* class_name, char cons...
  function JNIEXPORT (line 55) | JNIEXPORT jlong JNICALL

FILE: examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/hive/DecimalFraction.java
  class DecimalFraction (line 40) | public class DecimalFraction extends GenericUDF implements RapidsUDF {
    method getDisplayString (line 43) | @Override
    method initialize (line 48) | @Override
    method evaluate (line 67) | @Override
    method evaluateColumnar (line 83) | @Override

FILE: examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/hive/StringWordCount.java
  class StringWordCount (line 36) | public class StringWordCount extends UDF implements RapidsUDF {
    method evaluate (line 40) | public Integer evaluate(String str) {
    method evaluateColumnar (line 61) | @Override
    method countWords (line 85) | private static native long countWords(long stringsView);

FILE: examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/hive/URLDecode.java
  class URLDecode (line 34) | public class URLDecode extends UDF implements RapidsUDF {
    method evaluate (line 37) | public String evaluate(String s) {
    method evaluateColumnar (line 53) | @Override

FILE: examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/hive/URLEncode.java
  class URLEncode (line 40) | public class URLEncode extends GenericUDF implements RapidsUDF {
    method getDisplayString (line 45) | @Override
    method initialize (line 51) | @Override
    method evaluate (line 74) | @Override
    method evaluateColumnar (line 95) | @Override

FILE: examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/java/CosineSimilarity.java
  class CosineSimilarity (line 29) | public class CosineSimilarity
    method call (line 33) | @Override
    method magnitude (line 53) | private double magnitude(WrappedArray<Float> v) {
    method evaluateColumnar (line 63) | @Override
    method cosineSimilarity (line 81) | private static native long cosineSimilarity(long vectorView1, long vec...

FILE: examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/java/DecimalFraction.java
  class DecimalFraction (line 31) | public class DecimalFraction implements UDF1<BigDecimal, BigDecimal>, Ra...
    method call (line 33) | @Override
    method evaluateColumnar (line 42) | @Override

FILE: examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/java/NativeUDFExamplesLoader.java
  class NativeUDFExamplesLoader (line 24) | public class NativeUDFExamplesLoader {
    method ensureLoaded (line 28) | public static synchronized void ensureLoaded() {

FILE: examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/java/URLDecode.java
  class URLDecode (line 34) | public class URLDecode implements UDF1<String, String>, RapidsUDF {
    method call (line 36) | @Override
    method evaluateColumnar (line 53) | @Override

FILE: examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/java/URLEncode.java
  class URLEncode (line 33) | public class URLEncode implements UDF1<String, String>, RapidsUDF {
    method call (line 35) | @Override
    method evaluateColumnar (line 52) | @Override

FILE: examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/python/asserts.py
  function _assert_equal (line 28) | def _assert_equal(cpu, gpu, float_check, path):
  function assert_equal (line 100) | def assert_equal(cpu, gpu):
  function _has_incompat_conf (line 109) | def _has_incompat_conf(conf):
  class _RowCmp (line 113) | class _RowCmp(object):
    method __init__ (line 115) | def __init__(self, wrapped):
    method cmp (line 132) | def cmp(self, other):
    method __lt__ (line 159) | def __lt__(self, other):
    method __gt__ (line 162) | def __gt__(self, other):
    method __eq__ (line 165) | def __eq__(self, other):
    method __le__ (line 168) | def __le__(self, other):
    method __ge__ (line 171) | def __ge__(self, other):
    method __ne__ (line 174) | def __ne__(self, other):
  function _prep_func_for_compare (line 177) | def _prep_func_for_compare(func, mode):
  function _prep_incompat_conf (line 216) | def _prep_incompat_conf(conf):
  function _assert_gpu_and_cpu_writes_are_equal (line 224) | def _assert_gpu_and_cpu_writes_are_equal(
  function assert_gpu_and_cpu_writes_are_equal_collect (line 258) | def assert_gpu_and_cpu_writes_are_equal_collect(write_func, read_func, b...
  function assert_gpu_and_cpu_writes_are_equal_iterator (line 267) | def assert_gpu_and_cpu_writes_are_equal_iterator(write_func, read_func, ...
  function assert_gpu_fallback_write (line 276) | def assert_gpu_fallback_write(write_func,
  function assert_cpu_and_gpu_are_equal_collect_with_capture (line 312) | def assert_cpu_and_gpu_are_equal_collect_with_capture(func,
  function assert_cpu_and_gpu_are_equal_sql_with_capture (line 343) | def assert_cpu_and_gpu_are_equal_sql_with_capture(df_fun,
  function assert_gpu_fallback_collect (line 361) | def assert_gpu_fallback_collect(func,
  function assert_gpu_sql_fallback_collect (line 386) | def assert_gpu_sql_fallback_collect(df_fun, cpu_fallback_class_name, tab...
  function _assert_gpu_and_cpu_are_equal (line 398) | def _assert_gpu_and_cpu_are_equal(func,
  function run_with_cpu (line 438) | def run_with_cpu(func,
  function run_with_cpu_and_gpu (line 464) | def run_with_cpu_and_gpu(func,
  function assert_gpu_and_cpu_are_equal_collect (line 499) | def assert_gpu_and_cpu_are_equal_collect(func, conf={}, is_cpu_first=True):
  function assert_gpu_and_cpu_are_equal_iterator (line 507) | def assert_gpu_and_cpu_are_equal_iterator(func, conf={}, is_cpu_first=Tr...
  function assert_gpu_and_cpu_row_counts_equal (line 515) | def assert_gpu_and_cpu_row_counts_equal(func, conf={}, is_cpu_first=True):
  function assert_gpu_and_cpu_are_equal_sql (line 523) | def assert_gpu_and_cpu_are_equal_sql(df_fun, table_name, sql, conf=None,...
  function assert_py4j_exception (line 549) | def assert_py4j_exception(func, error_message):
  function assert_gpu_and_cpu_error (line 560) | def assert_gpu_and_cpu_error(df_fun, conf, error_message):
  function with_cpu_sql (line 572) | def with_cpu_sql(df_fun, table_name, sql, conf=None, debug=False):

FILE: examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/python/conftest.py
  function get_float_check (line 22) | def get_float_check():
  function is_incompat (line 30) | def is_incompat():
  function should_sort_on_spark (line 36) | def should_sort_on_spark():
  function should_sort_locally (line 39) | def should_sort_locally():
  function is_allowing_any_non_gpu (line 45) | def is_allowing_any_non_gpu():
  function get_non_gpu_allowed (line 48) | def get_non_gpu_allowed():
  function get_validate_execs_in_gpu_plan (line 51) | def get_validate_execs_in_gpu_plan():
  function runtime_env (line 56) | def runtime_env():
  function is_apache_runtime (line 59) | def is_apache_runtime():
  function is_databricks_runtime (line 62) | def is_databricks_runtime():
  function is_emr_runtime (line 65) | def is_emr_runtime():
  function is_dataproc_runtime (line 68) | def is_dataproc_runtime():
  function is_nightly_run (line 74) | def is_nightly_run():
  function is_at_least_precommit_run (line 77) | def is_at_least_precommit_run():
  function skip_unless_nightly_tests (line 80) | def skip_unless_nightly_tests(description):
  function skip_unless_precommit_tests (line 86) | def skip_unless_precommit_tests(description):
  function get_limit (line 96) | def get_limit():
  function _get_limit_from_mark (line 99) | def _get_limit_from_mark(mark):
  function pytest_runtest_setup (line 105) | def pytest_runtest_setup(item):
  function pytest_configure (line 182) | def pytest_configure(config):
  function pytest_collection_modifyitems (line 195) | def pytest_collection_modifyitems(config, items):
  function std_input_path (line 228) | def std_input_path(request):
  function spark_tmp_path (line 236) | def spark_tmp_path(request):
  class TmpTableFactory (line 252) | class TmpTableFactory:
    method __init__ (line 253) | def __init__(self, base_id):
    method get (line 257) | def get(self):
  function spark_tmp_table_factory (line 263) | def spark_tmp_table_factory(request):
  function _get_jvm_session (line 273) | def _get_jvm_session(spark):
  function _get_jvm (line 276) | def _get_jvm(spark):
  function spark_jvm (line 279) | def spark_jvm():
  class MortgageRunner (line 282) | class MortgageRunner:
    method __init__ (line 283) | def __init__(self, mortgage_format, mortgage_acq_path, mortgage_perf_p...
    method do_test_query (line 288) | def do_test_query(self, spark):
  function mortgage (line 306) | def mortgage(request):
  function enable_cudf_udf (line 319) | def enable_cudf_udf(request):
  function enable_rapids_udf_example_native (line 326) | def enable_rapids_udf_example_native(request):

FILE: examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/python/data_gen.py
  class DataGen (line 30) | class DataGen:
    method __repr__ (line 33) | def __repr__(self):
    method __hash__ (line 38) | def __hash__(self):
    method __eq__ (line 41) | def __eq__(self, other):
    method __ne__ (line 44) | def __ne__(self, other):
    method __init__ (line 47) | def __init__(self, data_type, nullable=True, special_cases =[]):
    method copy_special_case (line 69) | def copy_special_case(self, special_case, weight=1.0):
    method with_special_case (line 76) | def with_special_case(self, special_case, weight=1.0):
    method get_types (line 91) | def get_types(self):
    method start (line 95) | def start(self, rand):
    method _start (line 99) | def _start(self, rand, gen_func):
    method gen (line 119) | def gen(self, force_no_nulls=False):
    method contains_ts (line 129) | def contains_ts(self):
  class ConvertGen (line 133) | class ConvertGen(DataGen):
    method __init__ (line 135) | def __init__(self, child_gen, func, data_type=None, nullable=True):
    method __repr__ (line 142) | def __repr__(self):
    method start (line 145) | def start(self, rand):
  class StringGen (line 153) | class StringGen(DataGen):
    method __init__ (line 155) | def __init__(self, pattern="(.|\n){1,30}", flags=0, charset=sre_yield....
    method with_special_pattern (line 159) | def with_special_pattern(self, pattern, flags=0, charset=sre_yield.CHA...
    method start (line 171) | def start(self, rand):
  class ByteGen (line 181) | class ByteGen(DataGen):
    method __init__ (line 183) | def __init__(self, nullable=True, min_val = BYTE_MIN, max_val = BYTE_M...
    method start (line 188) | def start(self, rand):
  class ShortGen (line 193) | class ShortGen(DataGen):
    method __init__ (line 195) | def __init__(self, nullable=True, min_val = SHORT_MIN, max_val = SHORT...
    method start (line 201) | def start(self, rand):
  class IntegerGen (line 206) | class IntegerGen(DataGen):
    method __init__ (line 208) | def __init__(self, nullable=True, min_val = INT_MIN, max_val = INT_MAX,
    method start (line 214) | def start(self, rand):
  class DecimalGen (line 217) | class DecimalGen(DataGen):
    method __init__ (line 219) | def __init__(self, precision=None, scale=None, nullable=True, special_...
    method __repr__ (line 232) | def __repr__(self):
    method start (line 235) | def start(self, rand):
  class LongGen (line 245) | class LongGen(DataGen):
    method __init__ (line 247) | def __init__(self, nullable=True, min_val = LONG_MIN, max_val = LONG_M...
    method start (line 253) | def start(self, rand):
  class LongRangeGen (line 256) | class LongRangeGen(DataGen):
    method __init__ (line 258) | def __init__(self, nullable=False, start_val=0, direction="inc"):
    method start (line 275) | def start(self, rand):
  class RepeatSeqGen (line 279) | class RepeatSeqGen(DataGen):
    method __init__ (line 281) | def __init__(self, child, length):
    method __repr__ (line 289) | def __repr__(self):
    method _loop_values (line 292) | def _loop_values(self):
    method start (line 297) | def start(self, rand):
  class SetValuesGen (line 303) | class SetValuesGen(DataGen):
    method __init__ (line 305) | def __init__(self, data_type, data):
    method __repr__ (line 310) | def __repr__(self):
    method start (line 313) | def start(self, rand):
  class FloatGen (line 324) | class FloatGen(DataGen):
    method __init__ (line 326) | def __init__(self, nullable=True,
    method _fixup_nans (line 338) | def _fixup_nans(self, v):
    method start (line 343) | def start(self, rand):
  class DoubleGen (line 359) | class DoubleGen(DataGen):
    method __init__ (line 361) | def __init__(self, min_exp=DOUBLE_MIN_EXP, max_exp=DOUBLE_MAX_EXP, no_...
    method make_from (line 388) | def make_from(sign, exp, fraction):
    method _fixup_nans (line 397) | def _fixup_nans(self, v):
    method start (line 402) | def start(self, rand):
  class BooleanGen (line 417) | class BooleanGen(DataGen):
    method __init__ (line 419) | def __init__(self, nullable=True):
    method start (line 422) | def start(self, rand):
  class StructGen (line 425) | class StructGen(DataGen):
    method __init__ (line 427) | def __init__(self, children, nullable=True, special_cases=[]):
    method __repr__ (line 438) | def __repr__(self):
    method start (line 441) | def start(self, rand):
    method contains_ts (line 449) | def contains_ts(self):
  class DateGen (line 452) | class DateGen(DataGen):
    method __init__ (line 454) | def __init__(self, start=None, end=None, nullable=True):
    method _guess_leap_year (line 491) | def _guess_leap_year(t):
    method _to_days_since_epoch (line 501) | def _to_days_since_epoch(self, val):
    method _from_days_since_epoch (line 504) | def _from_days_since_epoch(self, days):
    method start (line 507) | def start(self, rand):
  class TimestampGen (line 512) | class TimestampGen(DataGen):
    method __init__ (line 514) | def __init__(self, start=None, end=None, nullable=True):
    method _to_ms_since_epoch (line 542) | def _to_ms_since_epoch(self, val):
    method _from_ms_since_epoch (line 545) | def _from_ms_since_epoch(self, ms):
    method start (line 548) | def start(self, rand):
    method contains_ts (line 553) | def contains_ts(self):
  class ArrayGen (line 556) | class ArrayGen(DataGen):
    method __init__ (line 558) | def __init__(self, child_gen, min_length=0, max_length=20, nullable=Tr...
    method __repr__ (line 565) | def __repr__(self):
    method start (line 568) | def start(self, rand):
    method contains_ts (line 577) | def contains_ts(self):
  class MapGen (line 580) | class MapGen(DataGen):
    method __init__ (line 582) | def __init__(self, key_gen, value_gen, min_length=0, max_length=20, nu...
    method __repr__ (line 591) | def __repr__(self):
    method start (line 594) | def start(self, rand):
    method contains_ts (line 602) | def contains_ts(self):
  class NullGen (line 606) | class NullGen(DataGen):
    method __init__ (line 608) | def __init__(self):
    method start (line 611) | def start(self, rand):
  function skip_if_not_utc (line 616) | def skip_if_not_utc():
  function gen_df (line 620) | def gen_df(spark, data_gen, length=2048, seed=0, num_slices=None):
  function _mark_as_lit (line 642) | def _mark_as_lit(data, data_type):
  function _gen_scalars_common (line 675) | def _gen_scalars_common(data_gen, count, seed=0):
  function gen_scalars (line 689) | def gen_scalars(data_gen, count, seed=0, force_no_nulls=False):
  function gen_scalar (line 697) | def gen_scalar(data_gen, seed=0, force_no_nulls=False):
  function gen_scalar_values (line 702) | def gen_scalar_values(data_gen, count, seed=0, force_no_nulls=False):
  function gen_scalar_value (line 707) | def gen_scalar_value(data_gen, seed=0, force_no_nulls=False):
  function debug_df (line 712) | def debug_df(df, path = None, file_format = 'json', num_parts = 1):
  function print_params (line 737) | def print_params(data_gen):
  function idfn (line 740) | def idfn(val):
  function meta_idfn (line 744) | def meta_idfn(meta):
  function three_col_df (line 749) | def three_col_df(spark, a_gen, b_gen, c_gen, length=2048, seed=0, num_sl...
  function two_col_df (line 753) | def two_col_df(spark, a_gen, b_gen, length=2048, seed=0, num_slices=None):
  function binary_op_df (line 757) | def binary_op_df(spark, gen, length=2048, seed=0, num_slices=None):
  function unary_op_df (line 760) | def unary_op_df(spark, gen, length=2048, seed=0, num_slices=None):
  function to_cast_string (line 764) | def to_cast_string(spark_type):
  function get_null_lit_string (line 795) | def get_null_lit_string(spark_type):
  function _convert_to_sql (line 802) | def _convert_to_sql(spark_type, data):
  function gen_scalars_for_sql (line 829) | def gen_scalars_for_sql(data_gen, count, seed=0, force_no_nulls=False):
  function copy_and_update (line 961) | def copy_and_update(conf, *more_confs):
  function append_unique_int_col_to_df (line 999) | def append_unique_int_col_to_df(spark, dataframe):

FILE: examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/python/rapids_udf_test.py
  function drop_udf (line 24) | def drop_udf(spark, udfname):
  function skip_if_no_hive (line 27) | def skip_if_no_hive(spark):
  function load_hive_udf_or_skip_test (line 31) | def load_hive_udf_or_skip_test(spark, udfname, udfclass):
  function test_hive_simple_udf (line 35) | def test_hive_simple_udf():
  function test_hive_generic_udf (line 46) | def test_hive_generic_udf():
  function test_hive_simple_udf_native (line 65) | def test_hive_simple_udf_native():
  function load_java_udf_or_skip_test (line 76) | def load_java_udf_or_skip_test(spark, udfname, udfclass, udf_return_type...
  function test_java_url_decode (line 80) | def test_java_url_decode():
  function test_java_url_encode (line 86) | def test_java_url_encode():
  function test_java_decimal_fraction (line 92) | def test_java_decimal_fraction():
  function test_java_cosine_similarity_reasonable_range (line 108) | def test_java_cosine_similarity_reasonable_range():
  function test_java_cosine_similarity_with_nans (line 120) | def test_java_cosine_similarity_with_nans():

FILE: examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/python/spark_init_internal.py
  function _spark__init (line 26) | def _spark__init():
  function _handle_derby_dir (line 59) | def _handle_derby_dir(sb, driver_opts, wid):
  function _handle_event_log_dir (line 66) | def _handle_event_log_dir(sb, wid):
  function get_spark_i_know_what_i_am_doing (line 100) | def get_spark_i_know_what_i_am_doing():
  function spark_version (line 109) | def spark_version():

FILE: examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/python/spark_session.py
  function _from_scala_map (line 20) | def _from_scala_map(scala_map):
  function is_tz_utc (line 34) | def is_tz_utc(spark=_spark):
  function _set_all_confs (line 44) | def _set_all_confs(conf):
  function reset_spark_session_conf (line 49) | def reset_spark_session_conf():
  function _check_for_proper_return_values (line 60) | def _check_for_proper_return_values(something):
  function with_spark_session (line 65) | def with_spark_session(func, conf={}):
  function _add_job_description (line 75) | def _add_job_description(conf):
  function with_cpu_session (line 82) | def with_cpu_session(func, conf={}):
  function with_gpu_session (line 88) | def with_gpu_session(func, conf={}):
  function is_before_spark_311 (line 105) | def is_before_spark_311():
  function is_before_spark_320 (line 108) | def is_before_spark_320():
  function is_before_spark_330 (line 111) | def is_before_spark_330():
  function is_databricks91_or_later (line 114) | def is_databricks91_or_later():

FILE: examples/XGBoost-Examples/agaricus/python/com/nvidia/spark/examples/agaricus/main.py
  function main (line 28) | def main(args, xgboost_args):

FILE: examples/XGBoost-Examples/mortgage/python/com/nvidia/spark/examples/mortgage/cross_validator_main.py
  function main (line 25) | def main(args, xgboost_args):

FILE: examples/XGBoost-Examples/mortgage/python/com/nvidia/spark/examples/mortgage/etl.py
  function load_data (line 25) | def load_data(spark, paths, schema, args, extra_csv_opts={}):
  function prepare_rawDf (line 40) | def prepare_rawDf(spark, args):
  function extract_perf_columns (line 50) | def extract_perf_columns(rawDf):
  function prepare_performance (line 87) | def prepare_performance(spark, args, rawDf):
  function extract_acq_columns (line 179) | def extract_acq_columns(rawDf):
  function prepare_acquisition (line 213) | def prepare_acquisition(spark, args, rawDf):
  function extract_paths (line 218) | def extract_paths(paths, prefix):
  function etl (line 226) | def etl(spark, args):

FILE: examples/XGBoost-Examples/mortgage/python/com/nvidia/spark/examples/mortgage/etl_main.py
  function main (line 21) | def main(args, xgboost_args):

FILE: examples/XGBoost-Examples/mortgage/python/com/nvidia/spark/examples/mortgage/main.py
  function main (line 24) | def main(args, xgboost_args):

FILE: examples/XGBoost-Examples/taxi/python/com/nvidia/spark/examples/taxi/cross_validator_main.py
  function main (line 24) | def main(args, xgboost_args):

FILE: examples/XGBoost-Examples/taxi/python/com/nvidia/spark/examples/taxi/etl_main.py
  function main (line 22) | def main(args, xgboost_args):

FILE: examples/XGBoost-Examples/taxi/python/com/nvidia/spark/examples/taxi/main.py
  function main (line 23) | def main(args, xgboost_args):

FILE: examples/XGBoost-Examples/taxi/python/com/nvidia/spark/examples/taxi/pre_process.py
  function pre_process (line 23) | def pre_process(data_frame):
  function drop_useless (line 36) | def drop_useless(data_frame):
  function encode_categories (line 46) | def encode_categories(data_frame):
  function fill_na (line 52) | def fill_na(data_frame):
  function remove_invalid (line 55) | def remove_invalid(data_frame):
  function convert_datetime (line 68) | def convert_datetime(data_frame):
  function add_h_distance (line 82) | def add_h_distance(data_frame):

FILE: examples/XGBoost-Examples/utility/python/com/nvidia/spark/examples/main.py
  function main (line 20) | def main():

FILE: examples/XGBoost-Examples/utility/python/com/nvidia/spark/examples/utility/args.py
  function _to_bool (line 23) | def _to_bool(literal):
  function _to_ratio_pair (line 27) | def _to_ratio_pair(literal):  # e.g., '80:20'
  function _validate_args (line 44) | def _validate_args(args):
  function _attach_derived_args (line 62) | def _attach_derived_args(args):
  function _inspect_xgb_parameters (line 69) | def _inspect_xgb_parameters() -> typing.Dict[str, type]:
  function parse_arguments (line 91) | def parse_arguments():

FILE: examples/XGBoost-Examples/utility/python/com/nvidia/spark/examples/utility/utils.py
  function merge_dicts (line 27) | def merge_dicts(dict_x, dict_y):
  function show_sample (line 33) | def show_sample(args, data_frame, label):
  function vectorize_data_frame (line 38) | def vectorize_data_frame(data_frame, label):
  function vectorize_data_frames (line 48) | def vectorize_data_frames(data_frames, label):
  function with_benchmark (line 52) | def with_benchmark(phrase, action):
  function check_classification_accuracy (line 61) | def check_classification_accuracy(data_frame, label):
  function check_regression_accuracy (line 69) | def check_regression_accuracy(data_frame, label):
  function prepare_data (line 77) | def prepare_data(spark, args, schema, dataPath):
  function extract_paths (line 86) | def extract_paths(paths, prefix):
  function transform_data (line 91) | def transform_data(
  function valid_input_data (line 104) | def valid_input_data(spark, args, raw_schema, final_schema):

FILE: examples/spark-connect-gpu/client/python/batch-job.py
  function explain (line 33) | def explain(dataframe):

FILE: scripts/csp-startup-scripts/emr/emr-spark-plugin-startup.py
  function upload_file_to_s3 (line 25) | def upload_file_to_s3(file_name, bucket_name, object_name=None):
  function create_emr_cluster (line 56) | def create_emr_cluster(release_label, key_name, service_role, subnet_id,...

FILE: scripts/encoding/python/com/nvidia/spark/encoding/criteo/common.py
  function customize_reader (line 17) | def customize_reader(reader):
  function customize_writer (line 21) | def customize_writer(writer):

FILE: scripts/encoding/python/com/nvidia/spark/encoding/criteo/one_hot_cpu_main.py
  function index (line 22) | def index(df, column):
  function expand (line 28) | def expand(indexer, df, column):
  function main (line 37) | def main(args):

FILE: scripts/encoding/python/com/nvidia/spark/encoding/criteo/target_cpu_main.py
  function get_dict_df (line 25) | def get_dict_df(train_df, target_col, label_col):
  function encode_df (line 32) | def encode_df(original_df, dict_df, col_name):
  function main (line 39) | def main(args):

FILE: scripts/encoding/python/com/nvidia/spark/encoding/main.py
  function main (line 19) | def main():

FILE: scripts/encoding/python/com/nvidia/spark/encoding/utility/args.py
  function _to_bool (line 21) | def _to_bool(literal):
  function _to_str_list (line 24) | def _to_str_list(literal):
  function _validate_args (line 32) | def _validate_args(args):
  function parse_arguments (line 48) | def parse_arguments():

FILE: scripts/encoding/python/com/nvidia/spark/encoding/utility/utils.py
  function load_data (line 18) | def load_data(spark, paths, args, customize=None):
  function save_data (line 25) | def save_data(data_frame, path, args, customize=None):
  function load_model (line 33) | def load_model(model_class, path):
  function load_models (line 36) | def load_models(model_class, paths):
  function save_model (line 39) | def save_model(model, path, args):
  function save_dict (line 43) | def save_dict(mean_dict, target_path):
  function load_dict (line 50) | def load_dict(dict_path):
  function load_dict_df (line 57) | def load_dict_df(spark, dict_df_path):

Download .json

Condensed preview — 277 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (3,892K chars).

[
  {
    "path": ".github/ISSUE_TEMPLATE/bug_report.md",
    "chars": 628,
    "preview": "---\nname: Bug report\nabout: Create a report to help us improve\ntitle: ''\nlabels: ''\nassignees: GaryShen2008\n\n---\n\n**Desc"
  },
  {
    "path": ".github/workflows/add-to-project.yml",
    "chars": 1024,
    "preview": "# Copyright (c) 2024-2025, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
  },
  {
    "path": ".github/workflows/license-header-check.yml",
    "chars": 1613,
    "preview": "# Copyright (c) 2024-2025, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
  },
  {
    "path": ".github/workflows/markdown-links-check/markdown-links-check-config.json",
    "chars": 508,
    "preview": "{\n  \"ignorePatterns\": [\n    {\n      \"pattern\": \"/docs\"\n    },\n    {\n      \"pattern\": \"/datasets\"\n    },\n    {\n      \"pat"
  },
  {
    "path": ".github/workflows/markdown-links-check.yml",
    "chars": 1277,
    "preview": "# Copyright (c) 2022-2025, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
  },
  {
    "path": ".github/workflows/shell-check.yml",
    "chars": 1394,
    "preview": "# Copyright (c) 2025, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": ".github/workflows/signoff-check.yml",
    "chars": 1065,
    "preview": "# Copyright (c) 2021-2024, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
  },
  {
    "path": ".gitignore",
    "chars": 226,
    "preview": "*#*#\n*.#*\n*.iml\n*.ipr\n*.iws\n*.pyc\n*.pyo\n*.swp\n*~\n.DS_Store\n.cache\n.classpath\n.ensime\n.ensime_cache/\n.ensime_lucene\n.gene"
  },
  {
    "path": "CONTRIBUTING.md",
    "chars": 2650,
    "preview": "# Contributing to Spark Examples\n\n### Sign your work\n\nWe require that all contributors sign-off on their commits. \n\nThis"
  },
  {
    "path": "LICENSE",
    "chars": 11348,
    "preview": "                                 Apache License\n                           Version 2.0, January 2004\n                   "
  },
  {
    "path": "README.md",
    "chars": 4895,
    "preview": "# spark-rapids-examples\n\nThis is the [RAPIDS Accelerator for Apache Spark](https://nvidia.github.io/spark-rapids/) examp"
  },
  {
    "path": "dockerfile/Dockerfile",
    "chars": 2788,
    "preview": "# Copyright (c) 2019-2023, NVIDIA CORPORATION. All rights reserved.\n# Licensed to the Apache Software Foundation (ASF) u"
  },
  {
    "path": "dockerfile/gpu_executor_template.yaml",
    "chars": 717,
    "preview": "# Copyright (c) 2024, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "docs/get-started/xgboost-examples/building-sample-apps/python.md",
    "chars": 607,
    "preview": "# Build XGBoost Python Examples\n\n## Build\n\nFollow these steps to package the Python zip file:\n\n``` bash\ngit clone https:"
  },
  {
    "path": "docs/get-started/xgboost-examples/building-sample-apps/scala.md",
    "chars": 789,
    "preview": "# Build XGBoost Scala Examples\n\nThe examples rely on [XGBoost](https://github.com/dmlc/xgboost).\n\n## Build\n\nFollow these"
  },
  {
    "path": "docs/get-started/xgboost-examples/csp/aws/ec2.md",
    "chars": 6987,
    "preview": "# Get Started with XGBoost4J-Spark 3.0 on AWS EC2\n\nThis is a getting started guide to Spark 3.2+ on AWS EC2. At the end "
  },
  {
    "path": "docs/get-started/xgboost-examples/csp/databricks/databricks.md",
    "chars": 8750,
    "preview": "Get Started with XGBoost4J-Spark on Databricks\n======================================================\n\nThis is a getting"
  },
  {
    "path": "docs/get-started/xgboost-examples/csp/databricks/init.sh",
    "chars": 1831,
    "preview": "#!/bin/bash\n# Copyright (c) 2025-2026, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lice"
  },
  {
    "path": "docs/get-started/xgboost-examples/csp/dataproc/gcp.md",
    "chars": 7969,
    "preview": "# Getting started pyspark+xgboost with RAPIDS Accelerator on GCP Dataproc\n [Google Cloud Dataproc](https://cloud.google."
  },
  {
    "path": "docs/get-started/xgboost-examples/dataset/mortgage.md",
    "chars": 1477,
    "preview": "# How to download the Mortgage dataset\n\n\n\n## Steps to download the data\n\n1. Go to the [Fannie Mae](https://capitalmarket"
  },
  {
    "path": "docs/get-started/xgboost-examples/notebook/python-notebook.md",
    "chars": 2907,
    "preview": "Get Started with pyspark+XGBoost with Jupyter Notebook\n================================================================="
  },
  {
    "path": "docs/get-started/xgboost-examples/notebook/spylon.md",
    "chars": 4514,
    "preview": "Get Started with XGBoost4J-Spark with Spylon Kernel Jupyter Notebook\n==================================================="
  },
  {
    "path": "docs/get-started/xgboost-examples/notebook/toree.md",
    "chars": 4090,
    "preview": "Get Started with XGBoost4J-Spark with Apache Toree Jupyter Notebook\n===================================================="
  },
  {
    "path": "docs/get-started/xgboost-examples/on-prem-cluster/kubernetes-scala.md",
    "chars": 11597,
    "preview": "Get Started with XGBoost4J-Spark on Kubernetes\n==============================================\nThis is a getting started "
  },
  {
    "path": "docs/get-started/xgboost-examples/on-prem-cluster/standalone-python.md",
    "chars": 13668,
    "preview": "Get Started with XGBoost4J-Spark on an Apache Spark Standalone Cluster\n================================================="
  },
  {
    "path": "docs/get-started/xgboost-examples/on-prem-cluster/standalone-scala.md",
    "chars": 12769,
    "preview": "Get Started with XGBoost4J-Spark on an Apache Spark Standalone Cluster\n================================================="
  },
  {
    "path": "docs/get-started/xgboost-examples/on-prem-cluster/yarn-python.md",
    "chars": 11978,
    "preview": "Get Started with XGBoost4J-Spark on Apache Hadoop YARN\n======================================================\nThis is a "
  },
  {
    "path": "docs/get-started/xgboost-examples/on-prem-cluster/yarn-scala.md",
    "chars": 11254,
    "preview": "Get Started with XGBoost4J-Spark on Apache Hadoop YARN\n======================================================\n\nThis is a"
  },
  {
    "path": "docs/get-started/xgboost-examples/prepare-package-data/preparation-python.md",
    "chars": 954,
    "preview": "## Prepare packages and dataset for pyspark\n\nFor simplicity export the location to these jars. All examples assume the p"
  },
  {
    "path": "docs/get-started/xgboost-examples/prepare-package-data/preparation-scala.md",
    "chars": 978,
    "preview": "## Prepare packages and dataset for scala\n\nFor simplicity export the location to these jars. All examples assume the pac"
  },
  {
    "path": "docs/trouble-shooting/xgboost-examples-trouble-shooting.md",
    "chars": 863,
    "preview": "## XGBoost\n\n### 1. NCCL errors\n\nXGBoost supports distributed GPU training which depends on NCCL2 available at [this link"
  },
  {
    "path": "examples/MIG-Support/README.md",
    "chars": 3094,
    "preview": "# Multi-Instance GPU (MIG) support in Apache Hadoop YARN\n\nThere are multiple solutions for MIG scheduling on YARN that y"
  },
  {
    "path": "examples/MIG-Support/device-plugins/gpu-mig/README.md",
    "chars": 5837,
    "preview": "# NVIDIA GPU Plugin for YARN with MIG support for YARN 3.3.0+\n\nThis plugin adds support for GPUs with [MIG](https://docs"
  },
  {
    "path": "examples/MIG-Support/device-plugins/gpu-mig/pom.xml",
    "chars": 3508,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!--\n  Copyright (c) 2021, NVIDIA CORPORATION.\n\n  Licensed under the Apache Licen"
  },
  {
    "path": "examples/MIG-Support/device-plugins/gpu-mig/scripts/getMIGGPUs",
    "chars": 1851,
    "preview": "#!/usr/bin/env bash\n\n# Copyright (c) 2021, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \""
  },
  {
    "path": "examples/MIG-Support/device-plugins/gpu-mig/src/main/java/com/nvidia/spark/NvidiaGPUMigPluginForRuntimeV2.java",
    "chars": 16948,
    "preview": "/*\n * Copyright (c) 2021, NVIDIA CORPORATION.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * y"
  },
  {
    "path": "examples/MIG-Support/device-plugins/gpu-mig/src/test/java/com/nvidia/spark/TestNvidiaGPUMigPluginForRuntimeV2.java",
    "chars": 10408,
    "preview": "/*\n * Copyright (c) 2021, NVIDIA CORPORATION.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * y"
  },
  {
    "path": "examples/MIG-Support/resource-types/gpu-mig/README.md",
    "chars": 5369,
    "preview": "# NVIDIA Support for GPU for YARN with MIG support for YARN 3.1.2 until YARN 3.3.0\n\nThis adds support for GPUs with [MIG"
  },
  {
    "path": "examples/MIG-Support/resource-types/gpu-mig/yarn312MIG.patch",
    "chars": 31308,
    "preview": "diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration"
  },
  {
    "path": "examples/MIG-Support/resource-types/gpu-mig/yarn313to315MIG.patch",
    "chars": 35066,
    "preview": "diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration"
  },
  {
    "path": "examples/MIG-Support/resource-types/gpu-mig/yarn321to323MIG.patch",
    "chars": 36485,
    "preview": "diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration"
  },
  {
    "path": "examples/MIG-Support/yarn-unpatched/README.md",
    "chars": 4276,
    "preview": "# MIG Support for Spark on YARN using unmodified versions of Apache Hadoop 3.1.2+\n\nThis document describes a solution fo"
  },
  {
    "path": "examples/MIG-Support/yarn-unpatched/scripts/mig2gpu.sh",
    "chars": 14044,
    "preview": "#!/bin/bash\n\n# Copyright (c) 2022, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\""
  },
  {
    "path": "examples/MIG-Support/yarn-unpatched/scripts/nvidia-container-cli-wrapper.sh",
    "chars": 3616,
    "preview": "#!/bin/bash\n\n# Copyright (c) 2022-2025, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lic"
  },
  {
    "path": "examples/MIG-Support/yarn-unpatched/scripts/nvidia-smi",
    "chars": 1596,
    "preview": "#!/bin/bash\n\n# Copyright (c) 2022, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\""
  },
  {
    "path": "examples/ML+DL-Examples/Optuna-Spark/README.md",
    "chars": 11982,
    "preview": "<img src=\"http://developer.download.nvidia.com/notebooks/dlsw-notebooks/tensorrt_torchtrt_efficientnet/nvidia_logo.png\" "
  },
  {
    "path": "examples/ML+DL-Examples/Optuna-Spark/optuna-examples/databricks/init_optuna.sh",
    "chars": 3101,
    "preview": "#!/bin/bash\n#\n# Copyright (c) 2025-2026, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Li"
  },
  {
    "path": "examples/ML+DL-Examples/Optuna-Spark/optuna-examples/databricks/start_cluster.sh",
    "chars": 2706,
    "preview": "#!/bin/bash\n#\n# Copyright (c) 2025-2026, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Li"
  },
  {
    "path": "examples/ML+DL-Examples/Optuna-Spark/optuna-examples/optuna-dataframe.ipynb",
    "chars": 40367,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n "
  },
  {
    "path": "examples/ML+DL-Examples/Optuna-Spark/optuna-examples/optuna-joblibspark.ipynb",
    "chars": 31857,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n "
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/README.md",
    "chars": 10160,
    "preview": "# Deep Learning Inference on Spark\n\nExample notebooks demonstrating **distributed deep learning inference** using the [p"
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/databricks/README.md",
    "chars": 2574,
    "preview": "# Spark DL Inference on Databricks\n\n**Note**: fields in \\<brackets\\> require user inputs.  \nMake sure you are in [this]("
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/databricks/setup/init_spark_dl.sh",
    "chars": 972,
    "preview": "#!/bin/bash\n# Copyright (c) 2025, NVIDIA CORPORATION.\n\nset -euxo pipefail\n\n# install requirements\nsudo /databricks/pytho"
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/databricks/setup/start_cluster.sh",
    "chars": 3020,
    "preview": "#!/bin/bash\n# Copyright (c) 2025, NVIDIA CORPORATION.\n\nset -eo pipefail\n\nif [ $# -lt 1 ] || [ $# -gt 2 ]; then\n    echo "
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/dataproc/README.md",
    "chars": 3194,
    "preview": "# Spark DL Inference on Dataproc\n\n## Setup\n\n**Note**: fields in \\<brackets\\> require user inputs.  \nMake sure you are in"
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/dataproc/setup/init_spark_dl.sh",
    "chars": 1928,
    "preview": "#!/bin/bash\n# Copyright (c) 2025, NVIDIA CORPORATION.\n\nset -euxo pipefail\n\nfunction get_metadata_attribute() {\n  local -"
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/dataproc/setup/start_cluster.sh",
    "chars": 3946,
    "preview": "#!/bin/bash\n# Copyright (c) 2025, NVIDIA CORPORATION.\n\nset -eo pipefail\n\nTENSOR_PARALLEL=false\nif [[ $# -gt 0 && \"$1\" =="
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb",
    "chars": 57948,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"777fc40d\",\n   \"metadata\": {},\n   \"source\": [\n    \"<img src=\\\"htt"
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_torch.ipynb",
    "chars": 58484,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"application/vnd.databricks.v1+cell\": {\n     \"cellMet"
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/deepseek-r1_torch.ipynb",
    "chars": 27109,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"<img src=\\\"http://developer.downloa"
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/gemma-7b_torch.ipynb",
    "chars": 25398,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"<img src=\\\"http://developer.downloa"
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/pipelines_tf.ipynb",
    "chars": 47534,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"9e9fe848\",\n   \"metadata\": {},\n   \"source\": [\n    \"<img src=\\\"htt"
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/pipelines_torch.ipynb",
    "chars": 37618,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"60f7ac5d-4a95-4170-a0ac-a7faac9d9ef4\",\n   \"metadata\": {},\n   \"so"
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/qwen-2.5-7b_torch.ipynb",
    "chars": 31211,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"<img src=\\\"http://developer.downloa"
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/sentence_transformers_torch.ipynb",
    "chars": 34942,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"777fc40d\",\n   \"metadata\": {},\n   \"source\": [\n    \"<img src=\\\"htt"
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/pytorch/housing_regression_torch.ipynb",
    "chars": 59718,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"792d95f9\",\n   \"metadata\": {},\n   \"source\": [\n    \"<img src=\\\"htt"
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/pytorch/image_classification_torch.ipynb",
    "chars": 109314,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"9e87c927\",\n   \"metadata\": {},\n   \"source\": [\n    \"<img src=\\\"htt"
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/requirements.txt",
    "chars": 744,
    "preview": "# Copyright (c) 2025, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/server_utils.py",
    "chars": 19584,
    "preview": "#\n# Copyright (c) 2025, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you ma"
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/tensorflow/image_classification_tf.ipynb",
    "chars": 124500,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"52d55e3f\",\n   \"metadata\": {},\n   \"source\": [\n    \"<img src=\\\"htt"
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/tensorflow/keras_preprocessing_tf.ipynb",
    "chars": 71048,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"7fcc021a\",\n   \"metadata\": {},\n   \"source\": [\n    \"<img src=\\\"htt"
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/tensorflow/keras_resnet50_tf.ipynb",
    "chars": 47098,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"8e6810cc-5982-4293-bfbd-c91ef0aca204\",\n   \"metadata\": {},\n   \"so"
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/tensorflow/text_classification_tf.ipynb",
    "chars": 145900,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"2cd2accf-5877-4136-a243-7a33a13ce2b4\",\n   \"metadata\": {},\n   \"so"
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/tf_requirements.txt",
    "chars": 638,
    "preview": "# Copyright (c) 2024, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/torch_requirements.txt",
    "chars": 813,
    "preview": "# Copyright (c) 2025, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/vllm/qwen-2.5-14b-tensor-parallel_vllm.ipynb",
    "chars": 27874,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"<img src=\\\"https://developer.downlo"
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/vllm/qwen-2.5-7b_vllm.ipynb",
    "chars": 28535,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"<img src=\\\"http://developer.downloa"
  },
  {
    "path": "examples/ML+DL-Examples/Spark-DL/dl_inference/vllm_requirements.txt",
    "chars": 640,
    "preview": "# Copyright (c) 2025, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "examples/ML+DL-Examples/Spark-Rapids-ML/pca/README.md",
    "chars": 1711,
    "preview": "# Spark-Rapids-ML PCA example\n\nThis is an example of the GPU accelerated PCA algorithm from the [Spark-Rapids-ML](https:"
  },
  {
    "path": "examples/ML+DL-Examples/Spark-Rapids-ML/pca/notebooks/pca.ipynb",
    "chars": 18160,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Principal Component Analysis (PC"
  },
  {
    "path": "examples/ML+DL-Examples/Spark-Rapids-ML/pca/start-spark-rapids.sh",
    "chars": 3052,
    "preview": "#!/bin/bash\n#\n# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.\n#\n# Licensed under the Apache License, Vers"
  },
  {
    "path": "examples/SQL+DF-Examples/customer-churn/README.md",
    "chars": 700,
    "preview": "# Customer Churn\n\nThis demo is derived from [data-science-blueprints](https://github.com/NVIDIA/data-science-blueprints)"
  },
  {
    "path": "examples/SQL+DF-Examples/customer-churn/notebooks/python/README.md",
    "chars": 600,
    "preview": "# telco-churn-augmentation\n\nThis demo shows a realistic ETL workflow based on synthetic normalized data.  It consists of"
  },
  {
    "path": "examples/SQL+DF-Examples/customer-churn/notebooks/python/augment.ipynb",
    "chars": 542378,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Customer churn augment\\n\",\n    \"\\"
  },
  {
    "path": "examples/SQL+DF-Examples/customer-churn/notebooks/python/churn/augment.py",
    "chars": 12539,
    "preview": "# Copyright (c) 2022, NVIDIA Corporation.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "examples/SQL+DF-Examples/customer-churn/notebooks/python/churn/eda.py",
    "chars": 5360,
    "preview": "# Copyright (c) 2022, NVIDIA Corporation.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "examples/SQL+DF-Examples/customer-churn/notebooks/python/churn/etl.py",
    "chars": 11947,
    "preview": "#!/usr/bin/env python\n# coding: utf-8\n\n# Copyright (c) 2022, NVIDIA Corporation.\n#\n# Licensed under the Apache License, "
  },
  {
    "path": "examples/SQL+DF-Examples/customer-churn/notebooks/python/etl.ipynb",
    "chars": 55004,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Transforming and joining raw data"
  },
  {
    "path": "examples/SQL+DF-Examples/demo/Spark_get_json_object.ipynb",
    "chars": 8157,
    "preview": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"markdown\",\n      \"id\": \"Td_alkbOv3Aj\",\n      \"metadata\": {\n        \"id\": \"Td_al"
  },
  {
    "path": "examples/SQL+DF-Examples/demo/Spark_parquet_microkernels.ipynb",
    "chars": 16886,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"raw\",\n   \"id\": \"Td_alkbOv3Aj\",\n   \"metadata\": {\n    \"id\": \"Td_alkbOv3Aj\"\n   },\n   \"so"
  },
  {
    "path": "examples/SQL+DF-Examples/micro-benchmarks/README.md",
    "chars": 2505,
    "preview": "# Microbenchmark\n\nStandard industry benchmarks are a great way to measure performance over \na period of time but another"
  },
  {
    "path": "examples/SQL+DF-Examples/micro-benchmarks/notebooks/micro-benchmarks-cpu.ipynb",
    "chars": 21799,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"d89df9bf\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Microbenchma"
  },
  {
    "path": "examples/SQL+DF-Examples/micro-benchmarks/notebooks/micro-benchmarks-gpu.ipynb",
    "chars": 26667,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"62787244\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Microbenchma"
  },
  {
    "path": "examples/SQL+DF-Examples/retail-analytics/README.md",
    "chars": 563,
    "preview": "\n# Overview Retail Analytics \nThis repository contains two Jupyter notebooks:\n\nData Generation: This notebook generates "
  },
  {
    "path": "examples/SQL+DF-Examples/retail-analytics/notebooks/python/retail-analytic.ipynb",
    "chars": 11697,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": "
  },
  {
    "path": "examples/SQL+DF-Examples/retail-analytics/notebooks/python/retail-datagen.ipynb",
    "chars": 7525,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Generating and Writing Data to "
  },
  {
    "path": "examples/SQL+DF-Examples/tpcds/README.md",
    "chars": 1520,
    "preview": "# TPC-DS Scale Factor 10 (GiB) - CPU Spark vs GPU Spark\n\n[TPC-DS](https://www.tpc.org/tpcds/) is a decision support benc"
  },
  {
    "path": "examples/SQL+DF-Examples/tpcds/notebooks/TPCDS-SF10.ipynb",
    "chars": 13877,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"editable\": true,\n    \"id\": \"HtgYO0bXEBrN\",\n    \"slid"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/Dockerfile",
    "chars": 2467,
    "preview": "#\n# Copyright (c) 2021-2026, NVIDIA CORPORATION. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/README.md",
    "chars": 12912,
    "preview": "# RAPIDS Accelerated UDF Examples\n\nThis project contains sample implementations of RAPIDS accelerated user-defined funct"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/clone-cudf-repo.sh",
    "chars": 3737,
    "preview": "#!/bin/bash\n#\n# Copyright (c) 2026, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/conftest.py",
    "chars": 1987,
    "preview": "# Copyright (c) 2020-2022, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/extract-cudf-libs.sh",
    "chars": 9693,
    "preview": "#!/bin/bash\n#\n# Copyright (c) 2026, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/pom.xml",
    "chars": 34082,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!--\n  Copyright (c) 2020-2026, NVIDIA CORPORATION.\n\n  Licensed under the Apache "
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/pytest.ini",
    "chars": 691,
    "preview": "; Copyright (c) 2020-2022, NVIDIA CORPORATION.\n;\n; Licensed under the Apache License, Version 2.0 (the \"License\");\n; you"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/run_pyspark_from_build.sh",
    "chars": 2573,
    "preview": "#!/bin/bash\n# Copyright (c) 2022-2025, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lice"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/runtests.py",
    "chars": 871,
    "preview": "# Copyright (c) 2022, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/CMakeLists.txt",
    "chars": 17976,
    "preview": "#=============================================================================\n# Copyright (c) 2021-2026, NVIDIA CORPORA"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/benchmarks/CMakeLists.txt",
    "chars": 1700,
    "preview": "#=============================================================================\n# Copyright (c) 2021-2022, NVIDIA CORPORA"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/benchmarks/cosine_similarity/cosine_similarity_benchmark.cpp",
    "chars": 3364,
    "preview": "/*\n * Copyright (c) 2021-2022, NVIDIA CORPORATION.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/benchmarks/fixture/benchmark_fixture.hpp",
    "chars": 2809,
    "preview": "/*\n * Copyright (c) 2021-2022, NVIDIA CORPORATION.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/benchmarks/synchronization/synchronization.cpp",
    "chars": 1951,
    "preview": "/*\n * Copyright (c) 2021-2022, NVIDIA CORPORATION.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/benchmarks/synchronization/synchronization.hpp",
    "chars": 3467,
    "preview": "/*\n * Copyright (c) 2021-2022, NVIDIA CORPORATION.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/src/CosineSimilarityJni.cpp",
    "chars": 3596,
    "preview": "/*\n * Copyright (c) 2021-2022, NVIDIA CORPORATION.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/src/StringWordCountJni.cpp",
    "chars": 2718,
    "preview": "/*\n * Copyright (c) 2021-2022, NVIDIA CORPORATION.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/src/cosine_similarity.cu",
    "chars": 6624,
    "preview": "/*\n * Copyright (c) 2021-2022, NVIDIA CORPORATION.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/src/cosine_similarity.hpp",
    "chars": 1406,
    "preview": "/*\n * Copyright (c) 2021-2022, NVIDIA CORPORATION.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/src/string_word_count.cu",
    "chars": 3151,
    "preview": "/*\n * Copyright (c) 2021-2022, NVIDIA CORPORATION.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/src/string_word_count.hpp",
    "chars": 997,
    "preview": "/*\n * Copyright (c) 2021-2022, NVIDIA CORPORATION.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/hive/DecimalFraction.java",
    "chars": 4238,
    "preview": "/*\n * Copyright (c) 2021-2022, NVIDIA CORPORATION.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/hive/StringWordCount.java",
    "chars": 3015,
    "preview": "/*\n * Copyright (c) 2021-2022, NVIDIA CORPORATION.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/hive/URLDecode.java",
    "chars": 2892,
    "preview": "/*\n * Copyright (c) 2020-2022, NVIDIA CORPORATION.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/hive/URLEncode.java",
    "chars": 4488,
    "preview": "/*\n * Copyright (c) 2020-2022, NVIDIA CORPORATION.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/java/CosineSimilarity.java",
    "chars": 2970,
    "preview": "/*\n * Copyright (c) 2021-2022, NVIDIA CORPORATION.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/java/DecimalFraction.java",
    "chars": 2051,
    "preview": "/*\n * Copyright (c) 2021-2022, NVIDIA CORPORATION.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/java/NativeUDFExamplesLoader.java",
    "chars": 1202,
    "preview": "/*\n * Copyright (c) 2021-2022, NVIDIA CORPORATION.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/java/URLDecode.java",
    "chars": 2902,
    "preview": "/*\n * Copyright (c) 2021-2022, NVIDIA CORPORATION.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/java/URLEncode.java",
    "chars": 2481,
    "preview": "/*\n * Copyright (c) 2021-2022, NVIDIA CORPORATION.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/python/asserts.py",
    "chars": 21982,
    "preview": "# Copyright (c) 2020-2022, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/python/conftest.py",
    "chars": 10784,
    "preview": "# Copyright (c) 2020-2022, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/python/data_gen.py",
    "chars": 42444,
    "preview": "# Copyright (c) 2020-2022, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/python/rapids_udf_test.py",
    "chars": 5775,
    "preview": "# Copyright (c) 2020-2022, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/python/spark_init_internal.py",
    "chars": 4353,
    "preview": "# Copyright (c) 2020-2021, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/python/spark_session.py",
    "chars": 4530,
    "preview": "# Copyright (c) 2020-2022, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/scala/com/nvidia/spark/rapids/udf/scala/URLDecode.scala",
    "chars": 2538,
    "preview": "/*\n * Copyright (c) 2021-2022, NVIDIA CORPORATION.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");"
  },
  {
    "path": "examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/scala/com/nvidia/spark/rapids/udf/scala/URLEncode.scala",
    "chars": 2002,
    "preview": "/*\n * Copyright (c) 2021-2022, NVIDIA CORPORATION.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");"
  },
  {
    "path": "examples/XGBoost-Examples/.gitignore",
    "chars": 12,
    "preview": "samples.zip\n"
  },
  {
    "path": "examples/XGBoost-Examples/README.md",
    "chars": 6699,
    "preview": "# Spark XGBoost Examples\n\nSpark XGBoost examples here showcase the need for ETL+Training pipeline GPU acceleration.\nThe "
  },
  {
    "path": "examples/XGBoost-Examples/agaricus/.gitignore",
    "chars": 19,
    "preview": ".idea\ntarget\n*.iml\n"
  },
  {
    "path": "examples/XGBoost-Examples/agaricus/notebooks/python/agaricus-gpu.ipynb",
    "chars": 44577,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Introduction to XGBoost Spark wit"
  },
  {
    "path": "examples/XGBoost-Examples/agaricus/notebooks/scala/agaricus-gpu.ipynb",
    "chars": 18610,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Introduction to XGBoost Spark3.0 "
  },
  {
    "path": "examples/XGBoost-Examples/agaricus/pom.xml",
    "chars": 1725,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!--\n  ~ Copyright (c) 2019-2024, NVIDIA CORPORATION. All rights reserved.\n  ~\n  "
  },
  {
    "path": "examples/XGBoost-Examples/agaricus/python/com/__init__.py",
    "chars": 589,
    "preview": "# Copyright (c) 2024, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "examples/XGBoost-Examples/agaricus/python/com/nvidia/__init__.py",
    "chars": 589,
    "preview": "# Copyright (c) 2024, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "examples/XGBoost-Examples/agaricus/python/com/nvidia/spark/__init__.py",
    "chars": 589,
    "preview": "# Copyright (c) 2024, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "examples/XGBoost-Examples/agaricus/python/com/nvidia/spark/examples/__init__.py",
    "chars": 589,
    "preview": "# Copyright (c) 2024, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "examples/XGBoost-Examples/agaricus/python/com/nvidia/spark/examples/agaricus/__init__.py",
    "chars": 589,
    "preview": "# Copyright (c) 2024, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "examples/XGBoost-Examples/agaricus/python/com/nvidia/spark/examples/agaricus/main.py",
    "chars": 2685,
    "preview": "#\n# Copyright (c) 2019-2022, NVIDIA CORPORATION. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0"
  },
  {
    "path": "examples/XGBoost-Examples/agaricus/scala/src/com/nvidia/spark/examples/agaricus/Main.scala",
    "chars": 4104,
    "preview": "/*\n * Copyright (c) 2019-2022, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version"
  },
  {
    "path": "examples/XGBoost-Examples/aggregator/.gitignore",
    "chars": 25,
    "preview": ".idea\ntarget\n*.iml\n*.xml\n"
  },
  {
    "path": "examples/XGBoost-Examples/app-parameters/supported_xgboost_parameters_python.md",
    "chars": 3604,
    "preview": "Supported Parameters\n============================\n\nThis is a description of all the parameters available when you are ru"
  },
  {
    "path": "examples/XGBoost-Examples/app-parameters/supported_xgboost_parameters_scala.md",
    "chars": 2960,
    "preview": "Supported Parameters\n============================\n\nThis is a description of all the parameters available when you are ru"
  },
  {
    "path": "examples/XGBoost-Examples/assembly/assembly-no-scala.xml",
    "chars": 1357,
    "preview": "<!--\n  ~ Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.\n  ~\n  ~ Licensed under the Apache License, Version"
  },
  {
    "path": "examples/XGBoost-Examples/main.py",
    "chars": 665,
    "preview": "#\n# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the"
  },
  {
    "path": "examples/XGBoost-Examples/mortgage/.gitignore",
    "chars": 19,
    "preview": ".idea\ntarget\n*.iml\n"
  },
  {
    "path": "examples/XGBoost-Examples/mortgage/notebooks/python/MortgageETL+XGBoost.ipynb",
    "chars": 46350,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Dataset\\n\",\n    \"\\n\",\n    \"Datase"
  },
  {
    "path": "examples/XGBoost-Examples/mortgage/notebooks/python/MortgageETL.ipynb",
    "chars": 284830,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Prerequirement\\n\",\n    \"### 1. D"
  },
  {
    "path": "examples/XGBoost-Examples/mortgage/notebooks/python/cv-mortgage-gpu.ipynb",
    "chars": 60078,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Introduction to XGBoost-Spark Cro"
  },
  {
    "path": "examples/XGBoost-Examples/mortgage/notebooks/python/mortgage-gpu.ipynb",
    "chars": 32122,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Introduction to XGBoost Spark wit"
  },
  {
    "path": "examples/XGBoost-Examples/mortgage/notebooks/scala/mortgage-ETL.ipynb",
    "chars": 45829,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"e82e9fb4\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Introduction"
  },
  {
    "path": "examples/XGBoost-Examples/mortgage/notebooks/scala/mortgage-gpu.ipynb",
    "chars": 25664,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Introduction to XGBoost Spark wit"
  },
  {
    "path": "examples/XGBoost-Examples/mortgage/notebooks/scala/mortgage_gpu_crossvalidation.ipynb",
    "chars": 15047,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Mortgage CrossValidation with GPU"
  },
  {
    "path": "examples/XGBoost-Examples/mortgage/pom.xml",
    "chars": 1724,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!--\n  ~ Copyright (c) 2019-2024, NVIDIA CORPORATION. All rights reserved.\n  ~\n  "
  },
  {
    "path": "examples/XGBoost-Examples/mortgage/python/com/__init__.py",
    "chars": 589,
    "preview": "# Copyright (c) 2024, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "examples/XGBoost-Examples/mortgage/python/com/nvidia/__init__.py",
    "chars": 589,
    "preview": "# Copyright (c) 2024, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "examples/XGBoost-Examples/mortgage/python/com/nvidia/spark/__init__.py",
    "chars": 589,
    "preview": "# Copyright (c) 2024, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "examples/XGBoost-Examples/mortgage/python/com/nvidia/spark/examples/__init__.py",
    "chars": 589,
    "preview": "# Copyright (c) 2024, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "examples/XGBoost-Examples/mortgage/python/com/nvidia/spark/examples/mortgage/__init__.py",
    "chars": 589,
    "preview": "# Copyright (c) 2024, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "examples/XGBoost-Examples/mortgage/python/com/nvidia/spark/examples/mortgage/consts.py",
    "chars": 13274,
    "preview": "#\n# Copyright (c) 2019-2022, NVIDIA CORPORATION. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0"
  },
  {
    "path": "examples/XGBoost-Examples/mortgage/python/com/nvidia/spark/examples/mortgage/cross_validator_main.py",
    "chars": 3255,
    "preview": "#\n# Copyright (c) 2019-2022, NVIDIA CORPORATION. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0"
  },
  {
    "path": "examples/XGBoost-Examples/mortgage/python/com/nvidia/spark/examples/mortgage/etl.py",
    "chars": 9305,
    "preview": "#\n# Copyright (c) 2019-2022, NVIDIA CORPORATION. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0"
  },
  {
    "path": "examples/XGBoost-Examples/mortgage/python/com/nvidia/spark/examples/mortgage/etl_main.py",
    "chars": 1070,
    "preview": "#\n# Copyright (c) 2019-2021, NVIDIA CORPORATION. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0"
  },
  {
    "path": "examples/XGBoost-Examples/mortgage/python/com/nvidia/spark/examples/mortgage/main.py",
    "chars": 2456,
    "preview": "#\n# Copyright (c) 2019-2022, NVIDIA CORPORATION. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0"
  },
  {
    "path": "examples/XGBoost-Examples/mortgage/scala/src/com/nvidia/spark/examples/mortgage/CrossValidationMain.scala",
    "chars": 4521,
    "preview": "/*\n * Copyright (c) 2019-2022, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version"
  },
  {
    "path": "examples/XGBoost-Examples/mortgage/scala/src/com/nvidia/spark/examples/mortgage/ETLMain.scala",
    "chars": 3491,
    "preview": "/*\n * Copyright (c) 2019-2021, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version"
  },
  {
    "path": "examples/XGBoost-Examples/mortgage/scala/src/com/nvidia/spark/examples/mortgage/Main.scala",
    "chars": 3936,
    "preview": "/*\n * Copyright (c) 2019-2022, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version"
  },
  {
    "path": "examples/XGBoost-Examples/mortgage/scala/src/com/nvidia/spark/examples/mortgage/Mortgage.scala",
    "chars": 2212,
    "preview": "/*\n * Copyright (c) 2019-2022, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version"
  },
  {
    "path": "examples/XGBoost-Examples/mortgage/scala/src/com/nvidia/spark/examples/mortgage/XGBoostETL.scala",
    "chars": 25441,
    "preview": "/*\n * Copyright (c) 2019-2021, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version"
  },
  {
    "path": "examples/XGBoost-Examples/pack_pyspark_example.sh",
    "chars": 916,
    "preview": "#!/bin/bash\n# Copyright (c) 2024-2025, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lice"
  },
  {
    "path": "examples/XGBoost-Examples/pom.xml",
    "chars": 5742,
    "preview": "<?xml version='1.0' encoding='UTF-8'?>\n<!--\n  ~ Copyright (c) 2019-2025, NVIDIA CORPORATION. All rights reserved.\n  ~\n  "
  },
  {
    "path": "examples/XGBoost-Examples/taxi/.gitignore",
    "chars": 19,
    "preview": ".idea\ntarget\n*.iml\n"
  },
  {
    "path": "examples/XGBoost-Examples/taxi/notebooks/python/cv-taxi-gpu.ipynb",
    "chars": 23938,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Introduction to XGBoost-Spark Cro"
  },
  {
    "path": "examples/XGBoost-Examples/taxi/notebooks/python/taxi-ETL.ipynb",
    "chars": 10174,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"71bf747a\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Introduction"
  },
  {
    "path": "examples/XGBoost-Examples/taxi/notebooks/python/taxi-gpu.ipynb",
    "chars": 17153,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Introduction to XGBoost Spark3.1 "
  },
  {
    "path": "examples/XGBoost-Examples/taxi/notebooks/scala/taxi-ETL.ipynb",
    "chars": 16475,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"e0336840\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Introduction"
  },
  {
    "path": "examples/XGBoost-Examples/taxi/notebooks/scala/taxi-gpu.ipynb",
    "chars": 17924,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Introduction to XGBoost Spark wit"
  },
  {
    "path": "examples/XGBoost-Examples/taxi/notebooks/scala/taxi_gpu_crossvalidation.ipynb",
    "chars": 16018,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Taxi CrossValidation with GPU acc"
  },
  {
    "path": "examples/XGBoost-Examples/taxi/pom.xml",
    "chars": 1720,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!--\n  ~ Copyright (c) 2019-2024, NVIDIA CORPORATION. All rights reserved.\n  ~\n  "
  },
  {
    "path": "examples/XGBoost-Examples/taxi/python/com/__init__.py",
    "chars": 589,
    "preview": "# Copyright (c) 2024, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "examples/XGBoost-Examples/taxi/python/com/nvidia/__init__.py",
    "chars": 589,
    "preview": "# Copyright (c) 2024, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "examples/XGBoost-Examples/taxi/python/com/nvidia/spark/__init__.py",
    "chars": 589,
    "preview": "# Copyright (c) 2024, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "examples/XGBoost-Examples/taxi/python/com/nvidia/spark/examples/__init__.py",
    "chars": 589,
    "preview": "# Copyright (c) 2024, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "examples/XGBoost-Examples/taxi/python/com/nvidia/spark/examples/taxi/__init__.py",
    "chars": 589,
    "preview": "# Copyright (c) 2024, NVIDIA CORPORATION.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "examples/XGBoost-Examples/taxi/python/com/nvidia/spark/examples/taxi/consts.py",
    "chars": 2298,
    "preview": "#\n# Copyright (c) 2019-2022, NVIDIA CORPORATION. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0"
  },
  {
    "path": "examples/XGBoost-Examples/taxi/python/com/nvidia/spark/examples/taxi/cross_validator_main.py",
    "chars": 3146,
    "preview": "#\n# Copyright (c) 2019-2021, NVIDIA CORPORATION. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0"
  },
  {
    "path": "examples/XGBoost-Examples/taxi/python/com/nvidia/spark/examples/taxi/etl_main.py",
    "chars": 1669,
    "preview": "#\n# Copyright (c) 2019-2022, NVIDIA CORPORATION. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0"
  },
  {
    "path": "examples/XGBoost-Examples/taxi/python/com/nvidia/spark/examples/taxi/main.py",
    "chars": 2511,
    "preview": "#\n# Copyright (c) 2019-2022, NVIDIA CORPORATION. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0"
  },
  {
    "path": "examples/XGBoost-Examples/taxi/python/com/nvidia/spark/examples/taxi/pre_process.py",
    "chars": 3054,
    "preview": "#\n# Copyright (c) 2019-2021, NVIDIA CORPORATION. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0"
  },
  {
    "path": "examples/XGBoost-Examples/taxi/scala/src/com/nvidia/spark/examples/taxi/CrossValidationMain.scala",
    "chars": 4378,
    "preview": "/*\n * Copyright (c) 2019-2022, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version"
  },
  {
    "path": "examples/XGBoost-Examples/taxi/scala/src/com/nvidia/spark/examples/taxi/ETLMain.scala",
    "chars": 3340,
    "preview": "/*\n * Copyright (c) 2019-2021, NVIDIA CORPORATION. All rights reserved.\n *\n * Licensed under the Apache License, Version"
  }
]

// ... and 77 more files (download for full content)

About this extraction

This page contains the full source code of the NVIDIA/spark-rapids-examples GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 277 files (3.5 MB), approximately 925.1k tokens, and a symbol index with 420 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo