Full Code of G-Research/spark-extension for AI

master 65c3dda4a96b cached

128 files

857.8 KB

229.2k tokens

297 symbols

1 requests

Download .txt

Showing preview only (907K chars total). Download the full file or copy to clipboard to get everything.

Repository: G-Research/spark-extension
Branch: master
Commit: 65c3dda4a96b
Files: 128
Total size: 857.8 KB

Directory structure:
gitextract_5zjc6xfa/

├── .github/
│   ├── actions/
│   │   ├── build-whl/
│   │   │   └── action.yml
│   │   ├── check-compat/
│   │   │   └── action.yml
│   │   ├── prime-caches/
│   │   │   └── action.yml
│   │   ├── test-jvm/
│   │   │   └── action.yml
│   │   ├── test-python/
│   │   │   └── action.yml
│   │   └── test-release/
│   │       └── action.yml
│   ├── dependabot.yml
│   ├── show-spark-versions.sh
│   └── workflows/
│       ├── build-jvm.yml
│       ├── build-python.yml
│       ├── build-snapshots.yml
│       ├── check.yml
│       ├── ci.yml
│       ├── clear-caches.yaml
│       ├── prepare-release.yml
│       ├── prime-caches.yml
│       ├── publish-release.yml
│       ├── publish-snapshot.yml
│       ├── test-jvm.yml
│       ├── test-python.yml
│       ├── test-release.yml
│       ├── test-results.yml
│       └── test-snapshots.yml
├── .gitignore
├── .scalafmt.conf
├── CHANGELOG.md
├── CONDITIONAL.md
├── DIFF.md
├── GROUPS.md
├── HISTOGRAM.md
├── LICENSE
├── MAINTAINERS.md
├── PARQUET.md
├── PARTITIONING.md
├── PYSPARK-DEPS.md
├── README.md
├── RELEASE.md
├── ROW_NUMBER.md
├── SECURITY.md
├── build-whl.sh
├── bump-version.sh
├── examples/
│   └── python-deps/
│       ├── Dockerfile
│       ├── docker-compose.yml
│       └── example.py
├── pom.xml
├── python/
│   ├── README.md
│   ├── gresearch/
│   │   ├── __init__.py
│   │   └── spark/
│   │       ├── __init__.py
│   │       ├── diff/
│   │       │   ├── __init__.py
│   │       │   └── comparator/
│   │       │       └── __init__.py
│   │       └── parquet/
│   │           └── __init__.py
│   ├── pyproject.toml
│   ├── pyspark/
│   │   └── jars/
│   │       └── .gitignore
│   ├── setup.py
│   └── test/
│       ├── __init__.py
│       ├── spark_common.py
│       ├── test_diff.py
│       ├── test_histogram.py
│       ├── test_job_description.py
│       ├── test_jvm.py
│       ├── test_package.py
│       ├── test_parquet.py
│       └── test_row_number.py
├── release.sh
├── set-version.sh
├── src/
│   ├── main/
│   │   ├── scala/
│   │   │   └── uk/
│   │   │       └── co/
│   │   │           └── gresearch/
│   │   │               ├── package.scala
│   │   │               └── spark/
│   │   │                   ├── BuildVersion.scala
│   │   │                   ├── Histogram.scala
│   │   │                   ├── RowNumbers.scala
│   │   │                   ├── SparkVersion.scala
│   │   │                   ├── UnpersistHandle.scala
│   │   │                   ├── diff/
│   │   │                   │   ├── App.scala
│   │   │                   │   ├── Diff.scala
│   │   │                   │   ├── DiffComparators.scala
│   │   │                   │   ├── DiffOptions.scala
│   │   │                   │   ├── comparator/
│   │   │                   │   │   ├── DefaultDiffComparator.scala
│   │   │                   │   │   ├── DiffComparator.scala
│   │   │                   │   │   ├── DurationDiffComparator.scala
│   │   │                   │   │   ├── EpsilonDiffComparator.scala
│   │   │                   │   │   ├── EquivDiffComparator.scala
│   │   │                   │   │   ├── MapDiffComparator.scala
│   │   │                   │   │   ├── NullSafeEqualDiffComparator.scala
│   │   │                   │   │   ├── TypedDiffComparator.scala
│   │   │                   │   │   └── WhitespaceDiffComparator.scala
│   │   │                   │   └── package.scala
│   │   │                   ├── group/
│   │   │                   │   └── package.scala
│   │   │                   ├── package.scala
│   │   │                   └── parquet/
│   │   │                       ├── ParquetMetaDataUtil.scala
│   │   │                       └── package.scala
│   │   ├── scala-spark-3.2/
│   │   │   └── uk/
│   │   │       └── co/
│   │   │           └── gresearch/
│   │   │               └── spark/
│   │   │                   └── parquet/
│   │   │                       └── SplitFile.scala
│   │   ├── scala-spark-3.3/
│   │   │   └── uk/
│   │   │       └── co/
│   │   │           └── gresearch/
│   │   │               └── spark/
│   │   │                   └── parquet/
│   │   │                       └── SplitFile.scala
│   │   ├── scala-spark-3.5/
│   │   │   ├── org/
│   │   │   │   └── apache/
│   │   │   │       └── spark/
│   │   │   │           └── sql/
│   │   │   │               └── extension/
│   │   │   │                   └── package.scala
│   │   │   └── uk/
│   │   │       └── co/
│   │   │           └── gresearch/
│   │   │               └── spark/
│   │   │                   └── Backticks.scala
│   │   └── scala-spark-4.0/
│   │       ├── org/
│   │       │   └── apache/
│   │       │       └── spark/
│   │       │           └── sql/
│   │       │               └── extension/
│   │       │                   └── package.scala
│   │       └── uk/
│   │           └── co/
│   │               └── gresearch/
│   │                   └── spark/
│   │                       ├── Backticks.scala
│   │                       └── parquet/
│   │                           └── SplitFile.scala
│   └── test/
│       ├── files/
│       │   ├── encrypted1.parquet
│       │   ├── encrypted2.parquet
│       │   ├── nested.parquet
│       │   └── test.parquet/
│       │       ├── file1.parquet
│       │       └── file2.parquet
│       ├── java/
│       │   └── uk/
│       │       └── co/
│       │           └── gresearch/
│       │               └── test/
│       │                   ├── SparkJavaTests.java
│       │                   └── diff/
│       │                       ├── DiffJavaTests.java
│       │                       ├── JavaValue.java
│       │                       └── JavaValueAs.java
│       ├── resources/
│       │   ├── log4j.properties
│       │   └── log4j2.properties
│       ├── scala/
│       │   └── uk/
│       │       └── co/
│       │           └── gresearch/
│       │               ├── spark/
│       │               │   ├── GroupBySuite.scala
│       │               │   ├── HistogramSuite.scala
│       │               │   ├── SparkSuite.scala
│       │               │   ├── SparkTestSession.scala
│       │               │   ├── WritePartitionedSuite.scala
│       │               │   ├── diff/
│       │               │   │   ├── AppSuite.scala
│       │               │   │   ├── DiffComparatorSuite.scala
│       │               │   │   ├── DiffOptionsSuite.scala
│       │               │   │   ├── DiffSuite.scala
│       │               │   │   └── examples/
│       │               │   │       └── Examples.scala
│       │               │   ├── group/
│       │               │   │   └── GroupSuite.scala
│       │               │   ├── parquet/
│       │               │   │   └── ParquetSuite.scala
│       │               │   └── test/
│       │               │       └── package.scala
│       │               └── test/
│       │                   ├── ClasspathSuite.scala
│       │                   ├── Spec.scala
│       │                   └── Suite.scala
│       ├── scala-spark-3/
│       │   └── uk/
│       │       └── co/
│       │           └── gresearch/
│       │               └── spark/
│       │                   └── SparkSuiteHelper.scala
│       └── scala-spark-4/
│           └── uk/
│               └── co/
│                   └── gresearch/
│                       └── spark/
│                           └── SparkSuiteHelper.scala
├── test-release.py
├── test-release.scala
└── test-release.sh

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/actions/build-whl/action.yml
================================================
name: 'Build Whl'
author: 'EnricoMi'
description: 'A GitHub Action that builds pyspark-extension package'

inputs:
  spark-version:
    description: Spark version, e.g. 3.4.0, 3.4.0-SNAPSHOT, or 4.0.0-preview1
    required: true
  scala-version:
    description: Scala version, e.g. 2.12.15
    required: true
  spark-compat-version:
    description: Spark compatibility version, e.g. 3.4
    required: true
  scala-compat-version:
    description: Scala compatibility version, e.g. 2.12
    required: true
  java-compat-version:
    description: Java compatibility version, e.g. 8
    required: true
  python-version:
    description: Python version, e.g. 3.8
    required: true

runs:
  using: 'composite'
  steps:
  - name: Fetch Binaries Artifact
    uses: actions/download-artifact@v4
    with:
      name: Binaries-${{ inputs.spark-compat-version }}-${{ inputs.scala-compat-version }}
      path: .

  - name: Set versions in pom.xml
    run: |
      ./set-version.sh ${{ inputs.spark-version }} ${{ inputs.scala-version }}
      git diff
    shell: bash

  - name: Make this work with PySpark preview versions
    if: contains(inputs.spark-version, 'preview')
    run: |
      sed -i -e 's/f"\(pyspark~=.*\)"/f"\1.dev1"/' -e 's/f"\({spark_compat_version}.0\)"/"${{ inputs.spark-version }}"/g' python/setup.py
      git diff python/setup.py
    shell: bash

  - name: Restore Maven packages cache
    if: github.event_name != 'schedule'
    uses: actions/cache/restore@v4
    with:
      path: ~/.m2/repository
      key: ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}
      restore-keys: |
        ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}
        ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-

  - name: Setup JDK ${{ inputs.java-compat-version }}
    uses: actions/setup-java@v4
    with:
      java-version: ${{ inputs.java-compat-version }}
      distribution: 'zulu'

  - name: Fetch Release Test Dependencies
    run: |
      # Fetch Release Test Dependencies
      echo "::group::mvn dependency:get"
      mvn dependency:get -Dtransitive=false -Dartifact=org.apache.parquet:parquet-hadoop:1.16.0:jar:tests
      echo "::endgroup::"
    shell: bash

  - name: Setup Python
    uses: actions/setup-python@v5
    with:
      python-version: ${{ inputs.python-version }}

  - name: Install Python dependencies
    run: |
      # Install Python dependencies
      echo "::group::mvn compile"
      python -m pip install --upgrade pip build twine
      echo "::endgroup::"
    shell: bash

  - name: Build whl
    run: |
      # Build whl
      echo "::group::build-whl.sh"
      ./build-whl.sh
      echo "::endgroup::"
    shell: bash

  - name: Test whl
    run: |
      # Test whl
      echo "::group::test-release.py"
      twine check python/dist/*
      # .dev1 allows this to work with preview versions
      pip install python/dist/*.whl "pyspark~=${{ inputs.spark-compat-version }}.0.dev1"
      python test-release.py
      echo "::endgroup::"
    shell: bash

  - name: Upload whl
    uses: actions/upload-artifact@v4
    with:
      name: Whl (Spark ${{ inputs.spark-compat-version }} Scala ${{ inputs.scala-compat-version }})
      path: |
        python/dist/*.whl

  - name: Build whl with mvn
    env:
      JDK_JAVA_OPTIONS: --add-exports java.base/sun.nio.ch=ALL-UNNAMED --add-exports java.base/sun.util.calendar=ALL-UNNAMED
    run: |
      # Build whl with mvn
      rm -rf target python/dist python/pyspark_extension.egg-info pyspark/jars/*.jar
      echo "::group::build-whl.sh"
      ./build-whl.sh
      echo "::endgroup::"
    shell: bash

branding:
  icon: 'check-circle'
  color: 'green'


================================================
FILE: .github/actions/check-compat/action.yml
================================================
name: 'Check'
author: 'EnricoMi'
description: 'A GitHub Action that checks compatibility of spark-extension'

inputs:
  spark-version:
    description: Spark version, e.g. 3.4.0 or 3.4.0-SNAPSHOT
    required: true
  scala-version:
    description: Scala version, e.g. 2.12.15
    required: true
  spark-compat-version:
    description: Spark compatibility version, e.g. 3.4
    required: true
  scala-compat-version:
    description: Scala compatibility version, e.g. 2.12
    required: true
  package-version:
    description: Spark-Extension version to check against
    required: true

runs:
  using: 'composite'
  steps:
  - name: Fetch Binaries Artifact
    uses: actions/download-artifact@v4
    with:
      name: Binaries-${{ inputs.spark-compat-version }}-${{ inputs.scala-compat-version }}
      path: .

  - name: Set versions in pom.xml
    run: |
      ./set-version.sh ${{ inputs.spark-version }} ${{ inputs.scala-version }}
      git diff
    shell: bash

  - name: Restore Maven packages cache
    if: github.event_name != 'schedule'
    uses: actions/cache/restore@v4
    with:
      path: ~/.m2/repository
      key: ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}
      restore-keys: |
        ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}
        ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-

  - name: Setup JDK 1.8
    uses: actions/setup-java@v4
    with:
      java-version: '8'
      distribution: 'zulu'

  - name: Install Checker
    run: |
      # Install Checker
      echo "::group::apt update install"
      sudo apt update
      sudo apt install japi-compliance-checker
      echo "::endgroup::"
    shell: bash

  - name: Release exists
    id: exists
    continue-on-error: true
    run: |
      # Release exists
      curl --head --fail https://repo1.maven.org/maven2/uk/co/gresearch/spark/spark-extension_${{ inputs.scala-compat-version }}/${{ inputs.package-version }}-${{ inputs.spark-compat-version }}/spark-extension_${{ inputs.scala-compat-version }}-${{ inputs.package-version }}-${{ inputs.spark-compat-version }}.jar
    shell: bash

  - name: Fetch package
    if: steps.exists.outcome == 'success'
    run: |
      # Fetch package
      echo "::group::mvn dependency:get"
      mvn dependency:get -Dtransitive=false -DremoteRepositories -Dartifact=uk.co.gresearch.spark:spark-extension_${{ inputs.scala-compat-version }}:${{ inputs.package-version }}-${{ inputs.spark-compat-version }}
      echo "::endgroup::"
    shell: bash

  - name: Check
    if: steps.exists.outcome == 'success'
    continue-on-error: ${{ github.ref == 'refs/heads/master' }}
    run: |
      # Check
      echo "::group::japi-compliance-checker"
      ls -lah ~/.m2/repository/uk/co/gresearch/spark/spark-extension_${{ inputs.scala-compat-version }}/${{ inputs.package-version }}-${{ inputs.spark-compat-version }}/spark-extension_${{ inputs.scala-compat-version }}-${{ inputs.package-version }}-${{ inputs.spark-compat-version }}.jar target/spark-extension*.jar
      japi-compliance-checker ~/.m2/repository/uk/co/gresearch/spark/spark-extension_${{ inputs.scala-compat-version }}/${{ inputs.package-version }}-${{ inputs.spark-compat-version }}/spark-extension_${{ inputs.scala-compat-version }}-${{ inputs.package-version }}-${{ inputs.spark-compat-version }}.jar target/spark-extension*.jar
      echo "::endgroup::"
    shell: bash

  - name: Upload Report
    uses: actions/upload-artifact@v4
    if: always() && steps.exists.outcome == 'success'
    with:
      name: Compat-Report-${{ inputs.spark-compat-version }}
      path: compat_reports/spark-extension/*

branding:
  icon: 'check-circle'
  color: 'green'


================================================
FILE: .github/actions/prime-caches/action.yml
================================================
name: 'Prime caches'
author: 'EnricoMi'
description: 'A GitHub Action that primes caches'

inputs:
  spark-version:
    description: Spark version, e.g. 3.4.0 or 3.4.0-SNAPSHOT
    required: true
  scala-version:
    description: Scala version, e.g. 2.12.15
    required: true
  spark-compat-version:
    description: Spark compatibility version, e.g. 3.4
    required: true
  scala-compat-version:
    description: Scala compatibility version, e.g. 2.12
    required: true
  java-compat-version:
    description: Java compatibility version, e.g. 8
    required: true
  hadoop-version:
    description: Hadoop version, e.g. 2.7 or 2
    required: true

runs:
  using: 'composite'
  steps:
  - name: Set versions in pom.xml
    run: |
      ./set-version.sh ${{ inputs.spark-version }} ${{ inputs.scala-version }}
      git diff
    shell: bash

  - name: Check Maven packages cache
    id: mvn-build-cache
    uses: actions/cache/restore@v4
    with:
      lookup-only: true
      path: ~/.m2/repository
      key: ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}

  - name: Check Spark Binaries cache
    id: spark-binaries-cache
    uses: actions/cache/restore@v4
    with:
      lookup-only: true
      path: ~/spark
      key: ${{ runner.os }}-spark-binaries-${{ inputs.spark-version }}-${{ inputs.scala-compat-version }}

  - name: Prepare priming caches
    id: setup
    run: |
      # Prepare priming caches
      if [[ "${{ inputs.spark-version }}" == *"-SNAPSHOT" ]] || [[ -z "${{ steps.mvn-build-cache.outputs.cache-hit }}" ]]; then
        echo "prime-mvn-cache=true" >> "$GITHUB_ENV"
        echo "prime-some-cache=true" >> "$GITHUB_ENV"
      fi;
      if [[ "${{ inputs.spark-version }}" == *"-SNAPSHOT" ]] || [[ -z "${{ steps.spark-binaries-cache.outputs.cache-hit }}" ]]; then
        echo "prime-spark-cache=true" >> "$GITHUB_ENV"
        echo "prime-some-cache=true" >> "$GITHUB_ENV"
      fi;
    shell: bash

  - name: Setup JDK ${{ inputs.java-compat-version }}
    if: env.prime-some-cache
    uses: actions/setup-java@v4
    with:
      java-version: ${{ inputs.java-compat-version }}
      distribution: 'zulu'

  - name: Build
    if: env.prime-mvn-cache
    env:
      JDK_JAVA_OPTIONS: --add-exports java.base/sun.nio.ch=ALL-UNNAMED --add-exports java.base/sun.util.calendar=ALL-UNNAMED
    run: |
      # Build
      echo "::group::mvn dependency:go-offline"
      mvn --batch-mode dependency:go-offline
      echo "::endgroup::"
    shell: bash

  - name: Save Maven packages cache
    if: env.prime-mvn-cache
    uses: actions/cache/save@v4
    with:
      path: ~/.m2/repository
      key: ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}-${{ github.run_id }}

  - name: Setup Spark Binaries
    if: env.prime-spark-cache && ! contains(inputs.spark-version, '-SNAPSHOT')
    env:
      SPARK_PACKAGE: spark-${{ inputs.spark-version }}/spark-${{ inputs.spark-version }}-bin-hadoop${{ inputs.hadoop-version }}${{ startsWith(inputs.spark-version, '3.') && inputs.scala-compat-version == '2.13' && '-scala2.13' || '' }}.tgz
    run: |
      wget --progress=dot:giga "https://www.apache.org/dyn/closer.lua/spark/${SPARK_PACKAGE}?action=download" -O - | tar -xzC "${{ runner.temp }}"
      archive=$(basename "${SPARK_PACKAGE}") bash -c "mv -v "${{ runner.temp }}/\${archive/%.tgz/}" ~/spark"
    shell: bash

  - name: Save Spark Binaries cache
    if: env.prime-spark-cache && ! contains(inputs.spark-version, '-SNAPSHOT')
    uses: actions/cache/save@v4
    with:
      path: ~/spark
      key: ${{ runner.os }}-spark-binaries-${{ inputs.spark-version }}-${{ inputs.scala-compat-version }}-${{ github.run_id }}

branding:
  icon: 'check-circle'
  color: 'green'


================================================
FILE: .github/actions/test-jvm/action.yml
================================================
name: 'Test JVM'
author: 'EnricoMi'
description: 'A GitHub Action that tests JVM spark-extension'

inputs:
  spark-version:
    description: Spark version, e.g. 3.4.0, 3.4.0-SNAPSHOT or 4.0.0-preview1
    required: true
  spark-compat-version:
    description: Spark compatibility version, e.g. 3.4
    required: true
  spark-archive-url:
    description: The URL to download the Spark binary distribution
    required: false
  scala-version:
    description: Scala version, e.g. 2.12.15
    required: true
  scala-compat-version:
    description: Scala compatibility version, e.g. 2.12
    required: true
  hadoop-version:
    description: Hadoop version, e.g. 2.7 or 2
    required: true
  java-compat-version:
    description: Java compatibility version, e.g. 8
    required: true

runs:
  using: 'composite'
  steps:
  - name: Fetch Binaries Artifact
    uses: actions/download-artifact@v4
    with:
      name: Binaries-${{ inputs.spark-compat-version }}-${{ inputs.scala-compat-version }}
      path: .

  - name: Set versions in pom.xml
    run: |
      ./set-version.sh ${{ inputs.spark-version }} ${{ inputs.scala-version }}
      git diff
    shell: bash

  - name: Restore Spark Binaries cache
    if: github.event_name != 'schedule' && ! contains(inputs.spark-version, '-SNAPSHOT')
    uses: actions/cache/restore@v4
    with:
      path: ~/spark
      key: ${{ runner.os }}-spark-binaries-${{ inputs.spark-version }}-${{ inputs.scala-compat-version }}
      restore-keys: |
        ${{ runner.os }}-spark-binaries-${{ inputs.spark-version }}-${{ inputs.scala-compat-version }}

  - name: Setup Spark Binaries
    if: ( ! contains(inputs.spark-version, '-SNAPSHOT') )
    env:
      SPARK_PACKAGE: spark-${{ inputs.spark-version }}/spark-${{ inputs.spark-version }}-bin-hadoop${{ inputs.hadoop-version }}${{ startsWith(inputs.spark-version, '3.') && inputs.scala-compat-version == '2.13' && '-scala2.13' || '' }}.tgz
    run: |
      # Setup Spark Binaries
      if [[ ! -e ~/spark ]]
      then
        url="${{ inputs.spark-archive-url }}"
        wget --progress=dot:giga "${url:-https://www.apache.org/dyn/closer.lua/spark/${SPARK_PACKAGE}?action=download}" -O - | tar -xzC "${{ runner.temp }}"
        archive=$(basename "${SPARK_PACKAGE}") bash -c "mv -v "${{ runner.temp }}/\${archive/%.tgz/}" ~/spark"
      fi
      echo "SPARK_HOME=$(cd ~/spark; pwd)" >> $GITHUB_ENV
    shell: bash

  - name: Restore Maven packages cache
    if: github.event_name != 'schedule'
    uses: actions/cache/restore@v4
    with:
      path: ~/.m2/repository
      key: ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}
      restore-keys: |
        ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}
        ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-

  - name: Setup JDK ${{ inputs.java-compat-version }}
    uses: actions/setup-java@v4
    with:
      java-version: ${{ inputs.java-compat-version }}
      distribution: 'zulu'

  - name: Scala and Java Tests
    env:
      JDK_JAVA_OPTIONS: --add-exports java.base/sun.nio.ch=ALL-UNNAMED --add-exports java.base/sun.util.calendar=ALL-UNNAMED
    run: |
      # Scala and Java Tests
      echo "::group::mvn test"
      mvn --batch-mode --update-snapshots -Dspotless.check.skip test integration-test
      echo "::endgroup::"
    shell: bash

  - name: Upload Test Results
    if: always()
    uses: actions/upload-artifact@v4
    with:
      name: JVM Test Results (Spark ${{ inputs.spark-version }} Scala ${{ inputs.scala-version }})
      path: |
        target/surefire-*reports/*.xml

branding:
  icon: 'check-circle'
  color: 'green'


================================================
FILE: .github/actions/test-python/action.yml
================================================
name: 'Test Python'
author: 'EnricoMi'
description: 'A GitHub Action that tests Python spark-extension'

# pyspark is not available for snapshots or scala other than 2.12
# we would have to compile spark from sources for this, not worth it
# so this action only works with scala 2.12 and non-snapshot spark versions
inputs:
  spark-version:
    description: Spark version, e.g. 3.4.0 or 4.0.0-preview1
    required: true
  scala-version:
    description: Scala version, e.g. 2.12.15
    required: true
  spark-compat-version:
    description: Spark compatibility version, e.g. 3.4
    required: true
  spark-archive-url:
    description: The URL to download the Spark binary distribution
    required: false
  spark-package-repo:
    description: The URL of an alternate maven repository to fetch Spark packages
    required: false
  scala-compat-version:
    description: Scala compatibility version, e.g. 2.12
    required: true
  java-compat-version:
    description: Java compatibility version, e.g. 8
    required: true
  hadoop-version:
    description: Hadoop version, e.g. 2.7 or 2
    required: true
  python-version:
    description: Python version, e.g. 3.8
    required: true

runs:
  using: 'composite'
  steps:
  - name: Fetch Binaries Artifact
    uses: actions/download-artifact@v4
    with:
      name: Binaries-${{ inputs.spark-compat-version }}-${{ inputs.scala-compat-version }}
      path: .

  - name: Set versions in pom.xml
    run: |
      ./set-version.sh ${{ inputs.spark-version }} ${{ inputs.scala-version }}
      git diff

      SPARK_EXTENSION_VERSION=$(grep --max-count=1 "<version>.*</version>" pom.xml | sed -E -e "s/\s*<[^>]+>//g")
      echo "SPARK_EXTENSION_VERSION=$SPARK_EXTENSION_VERSION" | tee -a "$GITHUB_ENV"
    shell: bash

  - name: Make this work with PySpark preview versions
    if: contains(inputs.spark-version, 'preview')
    run: |
      sed -i -e 's/\({spark_compat_version}.0\)"/\1.dev1"/' python/setup.py
      git diff python/setup.py
    shell: bash

  - name: Restore Spark Binaries cache
    if: github.event_name != 'schedule' && ( startsWith(inputs.spark-version, '3.') && inputs.scala-compat-version == '2.12' || startsWith(inputs.spark-version, '4.') ) && ! contains(inputs.spark-version, '-SNAPSHOT')
    uses: actions/cache/restore@v4
    with:
      path: ~/spark
      key: ${{ runner.os }}-spark-binaries-${{ inputs.spark-version }}-${{ inputs.scala-compat-version }}
      restore-keys: |
        ${{ runner.os }}-spark-binaries-${{ inputs.spark-version }}-${{ inputs.scala-compat-version }}

  - name: Setup Spark Binaries
    if: ( startsWith(inputs.spark-version, '3.') && inputs.scala-compat-version == '2.12' || startsWith(inputs.spark-version, '4.') ) && ! contains(inputs.spark-version, '-SNAPSHOT')
    env:
      SPARK_PACKAGE: spark-${{ inputs.spark-version }}/spark-${{ inputs.spark-version }}-bin-hadoop${{ inputs.hadoop-version }}${{ startsWith(inputs.spark-version, '3.') && inputs.scala-compat-version == '2.13' && '-scala2.13' || '' }}.tgz
    run: |
      # Setup Spark Binaries
      if [[ ! -e ~/spark ]]
      then
        url="${{ inputs.spark-archive-url }}"
        wget --progress=dot:giga "${url:-https://www.apache.org/dyn/closer.lua/spark/${SPARK_PACKAGE}?action=download}" -O - | tar -xzC "${{ runner.temp }}"
        archive=$(basename "${SPARK_PACKAGE}") bash -c "mv -v "${{ runner.temp }}/\${archive/%.tgz/}" ~/spark"
      fi
      echo "SPARK_BIN_HOME=$(cd ~/spark; pwd)" >> $GITHUB_ENV
    shell: bash

  - name: Restore Maven packages cache
    if: github.event_name != 'schedule'
    uses: actions/cache/restore@v4
    with:
      path: ~/.m2/repository
      key: ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}
      restore-keys: |
        ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}
        ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-

  - name: Setup JDK ${{ inputs.java-compat-version }}
    uses: actions/setup-java@v4
    with:
      java-version: ${{ inputs.java-compat-version }}
      distribution: 'zulu'

  - name: Setup Python
    uses: actions/setup-python@v5
    with:
      python-version: ${{ inputs.python-version }}

  - name: Install Python dependencies
    run: |
      # Install Python dependencies
      echo "::group::pip install"
      python -m venv .pytest-venv
      .pytest-venv/bin/python -m pip install --upgrade pip
      .pytest-venv/bin/pip install pypandoc
      .pytest-venv/bin/pip install -e python/[test]
      echo "::endgroup::"

      PYSPARK_HOME=$(.pytest-venv/bin/python -c "import os; import pyspark; print(os.path.dirname(pyspark.__file__))")
      PYSPARK_BIN_HOME="$(cd ".pytest-venv/"; pwd)"
      PYSPARK_PYTHON="$PYSPARK_BIN_HOME/bin/python"
      echo "PYSPARK_HOME=$PYSPARK_HOME" | tee -a "$GITHUB_ENV"
      echo "PYSPARK_BIN_HOME=$PYSPARK_BIN_HOME" | tee -a "$GITHUB_ENV"
      echo "PYSPARK_PYTHON=$PYSPARK_PYTHON" | tee -a "$GITHUB_ENV"
    shell: bash

  - name: Prepare Poetry tests
    run: |
      # Prepare Poetry tests
      echo "::group::Prepare poetry tests"
      # install poetry in venv
      python -m venv .poetry-venv
      .poetry-venv/bin/python -m pip install poetry
      # env var needed by poetry tests
      echo "POETRY_PYTHON=$PWD/.poetry-venv/bin/python" | tee -a "$GITHUB_ENV"

      # clone example poetry project
      git clone https://github.com/Textualize/rich.git .rich
      cd .rich
      git reset --hard 20024635c06c22879fd2fd1e380ec4cccd9935dd
      # env var needed by poetry tests
      echo "RICH_SOURCES=$PWD" | tee -a "$GITHUB_ENV"
      echo "::endgroup::"
    shell: bash

  - name: Python Unit Tests
    env:
      SPARK_HOME: ${{ env.PYSPARK_HOME }}
      PYTHONPATH: python/test
    run: |
      .pytest-venv/bin/python -m pytest python/test --junit-xml test-results/pytest-$(date +%s.%N)-$RANDOM.xml
    shell: bash

  - name: Install Spark Extension
    run: |
      # Install Spark Extension
      echo "::group::mvn install"
      mvn --batch-mode --update-snapshots install -Dspotless.check.skip -DskipTests -Dmaven.test.skip=true -Dgpg.skip
      echo "::endgroup::"
    shell: bash

  - name: Start Spark Connect
    id: spark-connect
    if: ( contains('3.4,3.5', inputs.spark-compat-version) && inputs.scala-compat-version == '2.12' || startsWith(inputs.spark-version, '4.') ) && ! contains(inputs.spark-version, '-SNAPSHOT')
    env:
      SPARK_HOME: ${{ env.SPARK_BIN_HOME }}
      CONNECT_GRPC_BINDING_ADDRESS: 127.0.0.1
      CONNECT_GRPC_BINDING_PORT: 15002
    run: |
      # Start Spark Connect
      for attempt in {1..10}; do
        $SPARK_HOME/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_${{ inputs.scala-compat-version }}:${{ inputs.spark-version }} --repositories "${{ inputs.spark-package-repo }}"
        sleep 10
        for log in $SPARK_HOME/logs/spark-*-org.apache.spark.sql.connect.service.SparkConnectServer-*.out; do
          echo "::group::Spark Connect server log: $log"
          eoc="EOC-$RANDOM"
          echo "::stop-commands::$eoc"
          cat "$log" || true
          echo "::$eoc::"
          echo "::endgroup::"
        done

        if netstat -an | grep 15002; then
          break;
        fi
        echo "::warning title=Starting Spark Connect server failed::Attempt #$attempt to start Spark Connect server failed"
        $SPARK_HOME/sbin/stop-connect-server.sh --packages org.apache.spark:spark-connect_${{ inputs.scala-compat-version }}:${{ inputs.spark-version }}
        sleep 5
      done

      if ! netstat -an | grep 15002; then
        echo "::error title=Starting Spark Connect server failed::All attempts to start Spark Connect server failed"
        exit 1
      fi
    shell: bash

  - name: Python Unit Tests (Spark Connect)
    if: steps.spark-connect.outcome == 'success'
    env:
      SPARK_HOME: ${{ env.PYSPARK_HOME }}
      PYTHONPATH: python/test
      TEST_SPARK_CONNECT_SERVER: sc://127.0.0.1:15002
    run: |
      # Python Unit Tests (Spark Connect)

      echo "::group::pip install"
      # .dev1 allows this to work with preview versions
      .pytest-venv/bin/pip install "pyspark[connect]~=${{ inputs.spark-compat-version }}.0.dev1"
      echo "::endgroup::"

      .pytest-venv/bin/python -m pytest python/test --junit-xml test-results-connect/pytest-$(date +%s.%N)-$RANDOM.xml
    shell: bash

  - name: Stop Spark Connect
    if: always() && steps.spark-connect.outcome == 'success'
    env:
      SPARK_HOME: ${{ env.SPARK_BIN_HOME }}
    run: |
      # Stop Spark Connect
      $SPARK_HOME/sbin/stop-connect-server.sh
      for log in $SPARK_HOME/logs/spark-*-org.apache.spark.sql.connect.service.SparkConnectServer-*.out; do
        echo "::group::Spark Connect server log: $log"
        eoc="EOC-$RANDOM"
        echo "::stop-commands::$eoc"
        cat "$log" || true
        echo "::$eoc::"
        echo "::endgroup::"
      done
    shell: bash

  - name: Upload Test Results
    if: always()
    uses: actions/upload-artifact@v4
    with:
      name: Python Test Results (Spark ${{ inputs.spark-version }} Scala ${{ inputs.scala-version }} Python ${{ inputs.python-version }})
      path: |
        test-results/*.xml
        test-results-connect/*.xml

branding:
  icon: 'check-circle'
  color: 'green'


================================================
FILE: .github/actions/test-release/action.yml
================================================
name: 'Test Release'
author: 'EnricoMi'
description: 'A GitHub Action that tests spark-extension release'

# pyspark is not available for snapshots or scala other than 2.12
# we would have to compile spark from sources for this, not worth it
# so this action only works with scala 2.12 and non-snapshot spark versions
inputs:
  spark-version:
    description: Spark version, e.g. 3.4.0 or 4.0.0-preview1
    required: true
  scala-version:
    description: Scala version, e.g. 2.12.15
    required: true
  spark-compat-version:
    description: Spark compatibility version, e.g. 3.4
    required: true
  spark-archive-url:
    description: The URL to download the Spark binary distribution
    required: false
  scala-compat-version:
    description: Scala compatibility version, e.g. 2.12
    required: true
  java-compat-version:
    description: Java compatibility version, e.g. 8
    required: true
  hadoop-version:
    description: Hadoop version, e.g. 2.7 or 2
    required: true
  python-version:
    description: Python version, e.g. 3.8
    default: ''
    required: false

runs:
  using: 'composite'
  steps:
  - name: Fetch Binaries Artifact
    uses: actions/download-artifact@v4
    with:
      name: Binaries-${{ inputs.spark-compat-version }}-${{ inputs.scala-compat-version }}
      path: .

  - name: Set versions in pom.xml
    run: |
      ./set-version.sh ${{ inputs.spark-version }} ${{ inputs.scala-version }}
      git diff

      SPARK_EXTENSION_VERSION=$(grep --max-count=1 "<version>.*</version>" pom.xml | sed -E -e "s/\s*<[^>]+>//g")
      echo "SPARK_EXTENSION_VERSION=$SPARK_EXTENSION_VERSION" | tee -a "$GITHUB_ENV"
    shell: bash

  - name: Restore Spark Binaries cache
    if: github.event_name != 'schedule'
    uses: actions/cache/restore@v4
    with:
      path: ~/spark
      key: ${{ runner.os }}-spark-binaries-${{ inputs.spark-version }}-${{ inputs.scala-compat-version }}
      restore-keys: |
        ${{ runner.os }}-spark-binaries-${{ inputs.spark-version }}-${{ inputs.scala-compat-version }}

  - name: Setup Spark Binaries
    env:
      SPARK_PACKAGE: spark-${{ inputs.spark-version }}/spark-${{ inputs.spark-version }}-bin-hadoop${{ inputs.hadoop-version }}${{ startsWith(inputs.spark-version, '3.') && inputs.scala-compat-version == '2.13' && '-scala2.13' || '' }}.tgz
    run: |
      # Setup Spark Binaries
      if [[ ! -e ~/spark ]]
      then
        url="${{ inputs.spark-archive-url }}"
        wget --progress=dot:giga "${url:-https://www.apache.org/dyn/closer.lua/spark/${SPARK_PACKAGE}?action=download}" -O - | tar -xzC "${{ runner.temp }}"
        archive=$(basename "${SPARK_PACKAGE}") bash -c "mv -v "${{ runner.temp }}/\${archive/%.tgz/}" ~/spark"
      fi
      echo "SPARK_BIN_HOME=$(cd ~/spark; pwd)" >> $GITHUB_ENV
    shell: bash

  - name: Restore Maven packages cache
    if: github.event_name != 'schedule'
    uses: actions/cache/restore@v4
    with:
      path: ~/.m2/repository
      key: ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}
      restore-keys: |
        ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}
        ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-

  - name: Setup JDK ${{ inputs.java-compat-version }}
    uses: actions/setup-java@v4
    with:
      java-version: ${{ inputs.java-compat-version }}
      distribution: 'zulu'

  - name: Diff App test
    env:
      SPARK_HOME: ${{ env.SPARK_BIN_HOME }}
    run: |
      # Diff App test
      echo "::group::spark-submit"
      $SPARK_HOME/bin/spark-submit --packages com.github.scopt:scopt_${{ inputs.scala-compat-version }}:4.1.0 target/spark-extension_*.jar --format parquet --id id src/test/files/test.parquet/file1.parquet src/test/files/test.parquet/file2.parquet diff.parquet
      echo
      echo "::endgroup::"

      echo "::group::spark-shell"
      $SPARK_HOME/bin/spark-shell <<< 'val df = spark.read.parquet("diff.parquet").orderBy($"id").groupBy($"diff").count; df.show; if (df.count != 2) sys.exit(1)'
      echo
      echo "::endgroup::"
    shell: bash

  - name: Install Spark Extension
    run: |
      # Install Spark Extension
      echo "::group::mvn install"
      mvn --batch-mode --update-snapshots install -Dspotless.check.skip -DskipTests -Dmaven.test.skip=true -Dgpg.skip
      echo "::endgroup::"
    shell: bash

  - name: Fetch Release Test Dependencies
    run: |
      # Fetch Release Test Dependencies
      echo "::group::mvn dependency:get"
      mvn dependency:get -Dtransitive=false -Dartifact=org.apache.parquet:parquet-hadoop:1.16.0:jar:tests
      echo "::endgroup::"
    shell: bash

  - name: Scala Release Test
    env:
      SPARK_HOME: ${{ env.SPARK_BIN_HOME }}
    run: |
      # Scala Release Test
      echo "::group::spark-shell"
      $SPARK_BIN_HOME/bin/spark-shell --packages uk.co.gresearch.spark:spark-extension_${{ inputs.scala-compat-version }}:$SPARK_EXTENSION_VERSION --jars ~/.m2/repository/org/apache/parquet/parquet-hadoop/1.16.0/parquet-hadoop-1.16.0-tests.jar < test-release.scala
      echo
      echo "::endgroup::"
    shell: bash

  - name: Setup Python
    uses: actions/setup-python@v5
    if: inputs.python-version != ''
    with:
      python-version: ${{ inputs.python-version }}

  - name: Python Release Test
    if: inputs.python-version != ''
    env:
      SPARK_HOME: ${{ env.SPARK_BIN_HOME }}
    run: |
      # Python Release Test
      echo "::group::spark-submit"
      $SPARK_BIN_HOME/bin/spark-submit --packages uk.co.gresearch.spark:spark-extension_${{ inputs.scala-compat-version }}:$SPARK_EXTENSION_VERSION test-release.py
      echo
      echo "::endgroup::"
    shell: bash

  - name: Fetch Whl Artifact
    if: inputs.python-version != ''
    uses: actions/download-artifact@v4
    with:
      name: Whl (Spark ${{ inputs.spark-compat-version }} Scala ${{ inputs.scala-compat-version }})
      path: .

  - name: Install Python dependencies
    if: inputs.python-version != ''
    run: |
      # Install Python dependencies
      echo "::group::pip install"
      python -m venv .pytest-venv
      .pytest-venv/bin/python -m pip install --upgrade pip
      .pytest-venv/bin/pip install pypandoc
      .pytest-venv/bin/pip install $(ls pyspark_extension-*.whl)[test]
      echo "::endgroup::"

      PYSPARK_HOME=$(.pytest-venv/bin/python -c "import os; import pyspark; print(os.path.dirname(pyspark.__file__))")
      PYSPARK_BIN_HOME="$(cd ".pytest-venv/"; pwd)"
      PYSPARK_PYTHON="$PYSPARK_BIN_HOME/bin/python"
      echo "PYSPARK_HOME=$PYSPARK_HOME" | tee -a "$GITHUB_ENV"
      echo "PYSPARK_BIN_HOME=$PYSPARK_BIN_HOME" | tee -a "$GITHUB_ENV"
      echo "PYSPARK_PYTHON=$PYSPARK_PYTHON" | tee -a "$GITHUB_ENV"
    shell: bash

  - name: PySpark Release Test
    if: inputs.python-version != ''
    run: |
      .pytest-venv/bin/python3 test-release.py
    shell: bash

  - name: Python Integration Tests
    if: inputs.python-version != ''
    env:
      SPARK_HOME: ${{ env.PYSPARK_HOME }}
      PYTHONPATH: python:python/test
    run: |
      # Python Integration Tests
      source .pytest-venv/bin/activate
      find python/test -name 'test*.py' > tests
      while read test
      do
        echo "::group::spark-submit $test"
        if ! $PYSPARK_BIN_HOME/bin/spark-submit --master "local[2]" --packages uk.co.gresearch.spark:spark-extension_${{ inputs.scala-compat-version }}:$SPARK_EXTENSION_VERSION "$test" test-results-submit
        then
          state="fail"
        fi
        echo
        echo "::endgroup::"
      done < tests
      if [[ "$state" == "fail" ]]; then exit 1; fi
    shell: bash

  - name: Upload Test Results
    if: always() && inputs.python-version != ''
    uses: actions/upload-artifact@v4
    with:
      name: Python Release Test Results (Spark ${{ inputs.spark-version }} Scala ${{ inputs.scala-version }} Python ${{ inputs.python-version }})
      path: |
        test-results-submit/*.xml

branding:
  icon: 'check-circle'
  color: 'green'


================================================
FILE: .github/dependabot.yml
================================================
version: 2
updates:
  - package-ecosystem: "github-actions"
    directory: "/"
    schedule:
      interval: "monthly"

  - package-ecosystem: "maven"
    directory: "/"
    schedule:
      interval: "daily"


================================================
FILE: .github/show-spark-versions.sh
================================================
#!/bin/bash

base=$(cd "$(dirname "$0")"; pwd)

grep -- "-version" "$base"/workflows/prime-caches.yml | sed -e "s/ -//g" -e "s/ //g" -e "s/'//g" | grep -v -e "matrix" -e "]" | while read line
do
  IFS=":" read var compat_version <<< "$line"
  if [[ "$var" == "spark-compat-version" ]]
  then
    while read line
    do
      IFS=":" read var patch_version <<< "$line"
      if [[ "$var" == "spark-patch-version" ]]
      then
        echo -n "spark-version: $compat_version.$patch_version"
        read line
        if [[ "$line" == "spark-snapshot-version:true" ]]
        then
          echo "-SNAPSHOT"
        else
          echo
        fi
        break
      fi
    done
  fi
done > "$base"/workflows/prime-caches.yml.tmp

grep spark-version "$base"/workflows/*.yml "$base"/workflows/prime-caches.yml.tmp | cut -d : -f 2- | sed -e "s/^[ -]*//" -e "s/'//g" -e 's/{"params": {"//g' -e 's/params: {//g' -e 's/"//g' -e "s/,.*//" | grep "^spark-version" | grep -v "matrix" | sort | uniq



================================================
FILE: .github/workflows/build-jvm.yml
================================================
name: Build JVM

on:
  workflow_call:

jobs:
  build:
    name: Build (Spark ${{ matrix.spark-version }} Scala ${{ matrix.scala-version }})
    runs-on: ubuntu-latest

    strategy:
      fail-fast: false
      matrix:
        include:
          - spark-version: '3.2.4'
            spark-compat-version: '3.2'
            scala-compat-version: '2.12'
            scala-version: '2.12.15'
            java-compat-version: '8'
            hadoop-version: '2.7'
          - spark-version: '3.3.4'
            spark-compat-version: '3.3'
            scala-compat-version: '2.12'
            scala-version: '2.12.15'
            java-compat-version: '8'
            hadoop-version: '3'
          - spark-version: '3.4.4'
            spark-compat-version: '3.4'
            scala-compat-version: '2.12'
            scala-version: '2.12.17'
            java-compat-version: '8'
            hadoop-version: '3'
          - spark-version: '3.5.8'
            spark-compat-version: '3.5'
            scala-compat-version: '2.12'
            scala-version: '2.12.18'
            java-compat-version: '8'
            hadoop-version: '3'

          - spark-version: '3.2.4'
            spark-compat-version: '3.2'
            scala-compat-version: '2.13'
            scala-version: '2.13.5'
            java-compat-version: '8'
            hadoop-version: '3.2'
          - spark-version: '3.3.4'
            spark-compat-version: '3.3'
            scala-compat-version: '2.13'
            scala-version: '2.13.8'
            java-compat-version: '8'
            hadoop-version: '3'
          - spark-version: '3.4.4'
            spark-compat-version: '3.4'
            scala-compat-version: '2.13'
            scala-version: '2.13.8'
            java-compat-version: '8'
            hadoop-version: '3'
          - spark-version: '3.5.8'
            spark-compat-version: '3.5'
            scala-compat-version: '2.13'
            scala-version: '2.13.8'
            java-compat-version: '8'
            hadoop-version: '3'
          - spark-version: '4.0.2'
            spark-compat-version: '4.0'
            scala-compat-version: '2.13'
            scala-version: '2.13.16'
            java-compat-version: '17'
            hadoop-version: '3'
          - spark-version: '4.1.1'
            spark-compat-version: '4.1'
            scala-compat-version: '2.13'
            scala-version: '2.13.17'
            java-compat-version: '17'
            hadoop-version: '3'
          - spark-version: '4.2.0-preview3'
            spark-compat-version: '4.2'
            scala-compat-version: '2.13'
            scala-version: '2.13.18'
            java-compat-version: '17'
            hadoop-version: '3'

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Build
        uses: ./.github/actions/build
        with:
          spark-version: ${{ matrix.spark-version }}
          scala-version: ${{ matrix.scala-version }}
          spark-compat-version: ${{ matrix.spark-compat-version }}
          scala-compat-version: ${{ matrix.scala-compat-version }}
          java-compat-version: ${{ matrix.java-compat-version }}
          hadoop-version: ${{ matrix.hadoop-version }}


================================================
FILE: .github/workflows/build-python.yml
================================================
name: Build Python

on:
  workflow_call:

jobs:
  # pyspark<4 is not available for snapshots or scala other than 2.12
  whl:
    name: Build whl (Spark ${{ matrix.spark-version }} Scala ${{ matrix.scala-version }})
    runs-on: ubuntu-latest

    strategy:
      fail-fast: false
      matrix:
        include:
          - spark-compat-version: '3.2'
            spark-version: '3.2.4'
            scala-compat-version: '2.12'
            scala-version: '2.12.15'
            java-compat-version: '8'
            python-version: '3.9'
          - spark-compat-version: '3.3'
            spark-version: '3.3.4'
            scala-compat-version: '2.12'
            scala-version: '2.12.15'
            java-compat-version: '8'
            python-version: '3.9'
          - spark-compat-version: '3.4'
            spark-version: '3.4.4'
            scala-compat-version: '2.12'
            scala-version: '2.12.17'
            java-compat-version: '8'
            python-version: '3.9'
          - spark-compat-version: '3.5'
            spark-version: '3.5.8'
            scala-compat-version: '2.12'
            scala-version: '2.12.18'
            java-compat-version: '8'
            python-version: '3.9'
          - spark-compat-version: '4.0'
            spark-version: '4.0.2'
            scala-compat-version: '2.13'
            scala-version: '2.13.16'
            java-compat-version: '17'
            python-version: '3.9'
          - spark-version: '4.1.1'
            spark-compat-version: '4.1'
            scala-compat-version: '2.13'
            scala-version: '2.13.17'
            java-compat-version: '17'
            hadoop-version: '3'
            python-version: '3.10'
          - spark-version: '4.2.0-preview3'
            spark-compat-version: '4.2'
            scala-compat-version: '2.13'
            scala-version: '2.13.18'
            java-compat-version: '17'
            hadoop-version: '3'
            python-version: '3.10'

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Build
        uses: ./.github/actions/build-whl
        with:
          spark-version: ${{ matrix.spark-version }}
          scala-version: ${{ matrix.scala-version }}
          spark-compat-version: ${{ matrix.spark-compat-version }}
          scala-compat-version: ${{ matrix.scala-compat-version }}
          java-compat-version: ${{ matrix.java-compat-version }}
          python-version: ${{ matrix.python-version }}


================================================
FILE: .github/workflows/build-snapshots.yml
================================================
name: Build Snapshots

on:
  workflow_call:

jobs:
  build:
    name: Build (Spark ${{ matrix.spark-version }} Scala ${{ matrix.scala-version }})
    runs-on: ubuntu-latest

    strategy:
      fail-fast: false
      matrix:
        include:
          - spark-compat-version: '3.2'
            spark-version: '3.2.5-SNAPSHOT'
            scala-compat-version: '2.12'
            scala-version: '2.12.15'
            java-compat-version: '8'
          - spark-compat-version: '3.3'
            spark-version: '3.3.5-SNAPSHOT'
            scala-compat-version: '2.12'
            scala-version: '2.12.15'
            java-compat-version: '8'
          - spark-compat-version: '3.4'
            spark-version: '3.4.5-SNAPSHOT'
            scala-compat-version: '2.12'
            scala-version: '2.12.17'
            java-compat-version: '8'
          - spark-compat-version: '3.5'
            spark-version: '3.5.9-SNAPSHOT'
            scala-compat-version: '2.12'
            scala-version: '2.12.18'
            java-compat-version: '8'

          - spark-compat-version: '3.2'
            spark-version: '3.2.5-SNAPSHOT'
            scala-compat-version: '2.13'
            scala-version: '2.13.5'
            java-compat-version: '8'
          - spark-compat-version: '3.3'
            spark-version: '3.3.5-SNAPSHOT'
            scala-compat-version: '2.13'
            scala-version: '2.13.8'
            java-compat-version: '8'
          - spark-compat-version: '3.4'
            spark-version: '3.4.5-SNAPSHOT'
            scala-compat-version: '2.13'
            scala-version: '2.13.8'
            java-compat-version: '8'
          - spark-compat-version: '3.5'
            spark-version: '3.5.9-SNAPSHOT'
            scala-compat-version: '2.13'
            scala-version: '2.13.8'
            java-compat-version: '8'
          - spark-compat-version: '4.0'
            spark-version: '4.0.3-SNAPSHOT'
            scala-compat-version: '2.13'
            scala-version: '2.13.16'
            java-compat-version: '17'
          - spark-compat-version: '4.1'
            spark-version: '4.1.2-SNAPSHOT'
            scala-compat-version: '2.13'
            scala-version: '2.13.17'
            java-compat-version: '17'
          - spark-compat-version: '4.2'
            spark-version: '4.2.0-SNAPSHOT'
            scala-compat-version: '2.13'
            scala-version: '2.13.18'
            java-compat-version: '17'

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Build
        uses: ./.github/actions/build
        with:
          spark-version: ${{ matrix.spark-version }}
          scala-version: ${{ matrix.scala-version }}
          spark-compat-version: ${{ matrix.spark-compat-version }}-SNAPSHOT
          scala-compat-version: ${{ matrix.scala-compat-version }}
          java-compat-version: ${{ matrix.java-compat-version }}


================================================
FILE: .github/workflows/check.yml
================================================
name: Check

on:
  workflow_call:

jobs:
  lint:
    name: Scala lint
    runs-on: ubuntu-latest

    steps:
      - name: Checkout
        uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Setup JDK ${{ inputs.java-compat-version }}
        uses: actions/setup-java@v4
        with:
          java-version: '11'
          distribution: 'zulu'

      - name: Check
        id: check
        run: |
          mvn --batch-mode --update-snapshots spotless:check
        shell: bash

      - name: Changes
        if: failure() && steps.check.outcome == 'failure'
        run: |
          mvn --batch-mode --update-snapshots spotless:apply
          git diff
        shell: bash

  config:
    name: Configure compat
    runs-on: ubuntu-latest
    outputs:
      major-version: ${{ steps.versions.outputs.major-version }}
      release-version: ${{ steps.versions.outputs.release-version }}
      release-major-version: ${{ steps.versions.outputs.release-major-version }}

    steps:
      - name: Checkout
        uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Get versions
        id: versions
        run: |
          version=$(grep -m1 version pom.xml | sed -e "s/<[^>]*>//g" -e "s/ //g")
          echo "version: $version"
          echo "major-version: ${version/.*/}"
          echo "version=$version" >> "$GITHUB_OUTPUT"
          echo "major-version=${version/.*/}" >> "$GITHUB_OUTPUT"
          release_version=$(git tag | grep "^v" | sort --version-sort | tail -n1 | sed "s/^v//")
          echo "release-version: $release_version"
          echo "release-major-version: ${release_version/.*/}"
          echo "release-version=$release_version" >> "$GITHUB_OUTPUT"
          echo "release-major-version=${release_version/.*/}" >> "$GITHUB_OUTPUT"
        shell: bash

  compat:
    name: Compat (Spark ${{ matrix.spark-compat-version }} Scala ${{ matrix.scala-compat-version }})
    needs: config
    runs-on: ubuntu-latest
    if: needs.config.outputs.major-version == needs.config.outputs.release-major-version

    strategy:
      fail-fast: false
      matrix:
        include:
          - spark-compat-version: '3.2'
            spark-version: '3.2.4'
            scala-compat-version: '2.12'
            scala-version: '2.12.15'
          - spark-compat-version: '3.3'
            spark-version: '3.3.4'
            scala-compat-version: '2.12'
            scala-version: '2.12.15'
          - spark-compat-version: '3.4'
            scala-compat-version: '2.12'
            scala-version: '2.12.17'
            spark-version: '3.4.4'
          - spark-compat-version: '3.5'
            scala-compat-version: '2.12'
            scala-version: '2.12.18'
            spark-version: '3.5.8'
          - spark-compat-version: '4.0'
            scala-compat-version: '2.13'
            scala-version: '2.13.16'
            spark-version: '4.0.2'
          - spark-compat-version: '4.1'
            scala-compat-version: '2.13'
            scala-version: '2.13.17'
            spark-version: '4.1.1'

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Check
        uses: ./.github/actions/check-compat
        with:
          spark-version: ${{ matrix.spark-version }}
          scala-version: ${{ matrix.scala-version }}
          spark-compat-version: ${{ matrix.spark-compat-version }}
          scala-compat-version: ${{ matrix.scala-compat-version }}
          package-version: ${{ needs.config.outputs.release-version }}


================================================
FILE: .github/workflows/ci.yml
================================================
name: CI

on:
  schedule:
    - cron: '0 8 */10 * *'
  push:
    branches:
      - 'master'
    tags:
      - '*'
  merge_group:
  pull_request:
  workflow_dispatch:

jobs:
  event_file:
    name: "Event File"
    runs-on: ubuntu-latest
    steps:
      - name: Upload
        uses: actions/upload-artifact@v4
        with:
          name: Event File
          path: ${{ github.event_path }}

  build-jvm:
    name: "Build JVM"
    uses: "./.github/workflows/build-jvm.yml"
  build-snapshots:
    name: "Build Snapshots"
    uses: "./.github/workflows/build-snapshots.yml"
  build-python:
    name: "Build Python"
    needs: build-jvm
    uses: "./.github/workflows/build-python.yml"

  test-jvm:
    name: "Test JVM"
    needs: build-jvm
    uses: "./.github/workflows/test-jvm.yml"
  test-python:
    name: "Test Python"
    needs: build-jvm
    uses: "./.github/workflows/test-python.yml"
  test-snapshots-jvm:
    name: "Test Snapshots"
    needs: build-snapshots
    uses: "./.github/workflows/test-snapshots.yml"
  test-release:
    name: "Test Release"
    needs: build-jvm
    uses: "./.github/workflows/test-release.yml"

  check:
    name: "Check"
    needs: build-jvm
    uses: "./.github/workflows/check.yml"

  # A single job that succeeds if all jobs listed under 'needs' succeed.
  # This allows to configure a single job as a required check.
  # The 'needed' jobs then can be changed through pull-requests.
  test_success:
    name: "Test success"
    if: always()
    runs-on: ubuntu-latest
    # the if clauses below have to reflect the number of jobs listed here
    needs: [build-jvm, build-python, test-jvm, test-python, test-release]
    env:
      RESULTS: ${{ join(needs.*.result, ',') }}

    steps:
      - name: "Success"
        # we expect all required jobs to have success result
        if: env.RESULTS == 'success,success,success,success,success'
        run: true
        shell: bash
      - name: "Failure"
        # we expect all required jobs to have success result, fail otherwise
        if: env.RESULTS != 'success,success,success,success,success'
        run: false
        shell: bash


================================================
FILE: .github/workflows/clear-caches.yaml
================================================
name: Clear caches

on:
  workflow_dispatch:

permissions:
  actions: write

jobs:
  clear-cache:
    runs-on: ubuntu-latest
    steps:
      - name: Clear caches
        uses: actions/github-script@v7
        with:
          script: |
            const caches = await github.paginate(
              github.rest.actions.getActionsCacheList.endpoint.merge({
                owner: context.repo.owner,
                repo: context.repo.repo,
              })
            )
            for (const cache of caches) {
              console.log(cache)
              github.rest.actions.deleteActionsCacheById({
                owner: context.repo.owner,
                repo: context.repo.repo,
                cache_id: cache.id,
              })
            }



================================================
FILE: .github/workflows/prepare-release.yml
================================================
name: Prepare release

on:
  workflow_dispatch:
    inputs:
      github_release_latest:
        description: 'Make the created GitHub release the latest'
        required: false
        default: true
        type: boolean

jobs:
  get-version:
    name: Get version
    runs-on: ubuntu-latest
    outputs:
      release-tag: ${{ steps.versions.outputs.release-tag }}
      is-snapshot: ${{ steps.versions.outputs.is-snapshot }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Get versions
        id: versions
        run: |
          # get release version
          version=$(grep --max-count=1 "<version>.*</version>" pom.xml | sed -E -e "s/\s*<[^>]+>//g" -e "s/-SNAPSHOT//" -e "s/-[0-9.]+//g")
          is_snapshot=$(if grep -q "<version>.*-SNAPSHOT</version>" pom.xml; then echo "true"; else echo "false"; fi)

          # share versions
          echo "release-tag=v${version}" >> "$GITHUB_OUTPUT"
          echo "is-snapshot=$is_snapshot" >> "$GITHUB_OUTPUT"

  prepare-release:
    name: Prepare release
    runs-on: ubuntu-latest
    if: ( ! github.event.repository.fork )
    needs: get-version
    # secrets are provided by environment
    environment:
      name: tagged
      url: 'https://github.com/G-Research/spark-extension?version=${{ needs.get-version.outputs.release-tag }}'

    steps:
      - name: Create GitHub App token
        uses: actions/create-github-app-token@v2
        id: app-token
        with:
          app-id: ${{ vars.APP_ID }}
          private-key: ${{ secrets.PRIVATE_KEY }}
          # required to push to a branch
          permission-contents: write

      - name: Get GitHub App User ID
        id: get-user-id
        run: echo "user-id=$(gh api "/users/${{ steps.app-token.outputs.app-slug }}[bot]" --jq .id)" >> "$GITHUB_OUTPUT"
        env:
          GH_TOKEN: ${{ steps.app-token.outputs.token }}

      - name: Checkout code
        uses: actions/checkout@v4
        with:
          token: ${{ steps.app-token.outputs.token }}
          fetch-depth: 0

      - name: Check branch setup
        run: |
          # Check branch setup
          if [[ "$GITHUB_REF" != "refs/heads/master" ]] && [[ "$GITHUB_REF" != "refs/heads/master-"* ]]
          then
            echo "This workflow must be run on master or master-* branch, not $GITHUB_REF"
            exit 1
          fi

      - name: Tag and bump version
        if: needs.get-version.outputs.is-snapshot
        run: |
          # check for unreleased entry in CHANGELOG.md
          readarray -t changes < <(grep -A 100 "^## \[UNRELEASED\] - YYYY-MM-DD" CHANGELOG.md | grep -B 100 --max-count=1 -E "^## \[[0-9.]+\]" | grep "^-")
          if [ ${#changes[@]} -eq 0 ]
          then
            echo "Did not find any changes in CHANGELOG.md under '## [UNRELEASED] - YYYY-MM-DD'"
            exit 1
          fi

          # get latest and release version
          latest=$(grep --max-count=1 "<version>.*</version>" README.md | sed -E -e "s/\s*<[^>]+>//g" -e "s/-[0-9.]+//g")
          version=$(grep --max-count=1 "<version>.*</version>" pom.xml | sed -E -e "s/\s*<[^>]+>//g" -e "s/-SNAPSHOT//" -e "s/-[0-9.]+//g")

          # update changlog
          echo "Releasing ${#changes[@]} changes as version $version:"
          for (( i=0; i<${#changes[@]}; i++ )); do echo "${changes[$i]}" ; done
          sed -i "s/## \[UNRELEASED\] - YYYY-MM-DD/## [$version] - $(date +%Y-%m-%d)/" CHANGELOG.md
          sed -i -e "s/$latest-/$version-/g" -e "s/$latest\./$version./g" README.md PYSPARK-DEPS.md python/README.md
          ./set-version.sh $version

          # configure git so we can commit changes
          git config --global user.name '${{ steps.app-token.outputs.app-slug }}[bot]'
          git config --global user.email '${{ steps.get-user-id.outputs.user-id }}+${{ steps.app-token.outputs.app-slug }}[bot]@users.noreply.github.com'

          # commit changes to local repo
          echo "Committing release to local git"
          git add pom.xml python/setup.py CHANGELOG.md README.md PYSPARK-DEPS.md python/README.md
          git commit -m "Releasing $version"
          git tag -a "v${version}" -m "Release v${version}"

          # bump version
          # define function to bump version
          function next_version {
            local version=$1
            local branch=$2

            patch=${version/*./}
            majmin=${version%.${patch}}

            if [[ $branch == "master" ]]
            then
              # minor version bump
              if [[ $version != *".0" ]]
              then
                echo "version is patch version, should be M.m.0: $version" >&2
                exit 1
              fi
              maj=${version/.*/}
              min=${majmin#${maj}.}
              next=${maj}.$((min+1)).0
              echo "$next"
            else
              # patch version bump
              next=${majmin}.$((patch+1))
              echo "$next"
            fi
          }

          # get next version
          pkg_version="${version/-*/}"
          branch=$(git rev-parse --abbrev-ref HEAD)
          next_pkg_version="$(next_version "$pkg_version" "$branch")"

          # bump the version
          echo "Bump version to $next_pkg_version"
          ./set-version.sh $next_pkg_version-SNAPSHOT

          # commit changes to local repo
          echo "Committing release to local git"
          git commit -a -m "Post-release version bump to $next_pkg_version"

          # push all commits and tag to origin
          echo "Pushing release commit and tag to origin"
          git push origin "$GITHUB_REF_NAME" "v${version}" --tags
          # NOTE: This push will not trigger a CI as we are using GITHUB_TOKEN to push
          # More info on: https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow

  github-release:
    name: Create GitHub release
    runs-on: ubuntu-latest
    needs:
      - get-version
      - prepare-release
    permissions:
      contents: write # required to create release

    steps:
      - name: Checkout release tag
        uses: actions/checkout@v4
        with:
          ref: ${{ needs.get-version.outputs.release-tag }}

      - name: Extract release notes
        id: release-notes
        run: |
          awk '/^## /{if(seen==1)exit; seen++} seen' CHANGELOG.md > ./release-notes.txt

          # Grab release name
          name=$(grep -m 1 "^## " CHANGELOG.md | sed "s/^## //")
          echo "release_name=$name" >> $GITHUB_OUTPUT

          # provide release notes file path as output
          echo "release_notes_path=release-notes.txt" >> $GITHUB_OUTPUT

      - name: Publish GitHub release
        uses: ncipollo/release-action@2c591bcc8ecdcd2db72b97d6147f871fcd833ba5
        id: github-release
        with:
          name: ${{ steps.release-notes.outputs.release_name }}
          bodyFile: ${{ steps.release-notes.outputs.release_notes_path }}
          makeLatest: ${{ inputs.github_release_latest }}
          tag: ${{ needs.get-version.outputs.release-tag }}
          token: ${{ github.token }}


================================================
FILE: .github/workflows/prime-caches.yml
================================================
name: Prime caches

on:
  workflow_dispatch:

jobs:
  prime:
    name: Spark ${{ matrix.spark-compat-version }}.${{ matrix.spark-patch-version }}${{ matrix.spark-snapshot-version && '-SNAPSHOT' }} Scala ${{ matrix.scala-version }}
    runs-on: ubuntu-latest

    strategy:
      fail-fast: false
      # keep in-sync with .github/workflows/test-jvm.yml
      matrix:
        include:
          - spark-compat-version: '3.2'
            scala-compat-version: '2.12'
            scala-version: '2.12.15'
            spark-patch-version: '4'
            hadoop-version: '2.7'
          - spark-compat-version: '3.3'
            scala-compat-version: '2.12'
            scala-version: '2.12.15'
            spark-patch-version: '4'
            hadoop-version: '3'
          - spark-compat-version: '3.4'
            scala-compat-version: '2.12'
            scala-version: '2.12.17'
            spark-patch-version: '4'
            hadoop-version: '3'
          - spark-compat-version: '3.5'
            scala-compat-version: '2.12'
            scala-version: '2.12.18'
            spark-patch-version: '8'
            hadoop-version: '3'

          - spark-compat-version: '3.2'
            scala-compat-version: '2.13'
            scala-version: '2.13.5'
            spark-patch-version: '4'
            hadoop-version: '3.2'
          - spark-compat-version: '3.3'
            scala-compat-version: '2.13'
            scala-version: '2.13.8'
            spark-patch-version: '4'
            hadoop-version: '3'
          - spark-compat-version: '3.4'
            scala-compat-version: '2.13'
            scala-version: '2.13.8'
            spark-patch-version: '4'
            hadoop-version: '3'
          - spark-compat-version: '3.5'
            scala-compat-version: '2.13'
            scala-version: '2.13.8'
            spark-patch-version: '8'
            hadoop-version: '3'
          - spark-compat-version: '4.0'
            scala-compat-version: '2.13'
            scala-version: '2.13.16'
            spark-patch-version: '2'
            java-compat-version: '17'
            hadoop-version: '3'
          - spark-compat-version: '4.1'
            scala-compat-version: '2.13'
            scala-version: '2.13.17'
            spark-patch-version: '1'
            java-compat-version: '17'
            hadoop-version: '3'
          - spark-compat-version: '4.2'
            scala-compat-version: '2.13'
            scala-version: '2.13.18'
            spark-patch-version: '0-preview3'
            java-compat-version: '17'
            hadoop-version: '3'

          - spark-compat-version: '3.2'
            scala-compat-version: '2.12'
            scala-version: '2.12.15'
            spark-patch-version: '5'
            spark-snapshot-version: true
            hadoop-version: '2.7'
          - spark-compat-version: '3.3'
            scala-compat-version: '2.12'
            scala-version: '2.12.15'
            spark-patch-version: '5'
            spark-snapshot-version: true
            hadoop-version: '3'
          - spark-compat-version: '3.4'
            scala-compat-version: '2.12'
            scala-version: '2.12.17'
            spark-patch-version: '5'
            spark-snapshot-version: true
            hadoop-version: '3'
          - spark-compat-version: '3.5'
            scala-compat-version: '2.12'
            scala-version: '2.12.18'
            spark-patch-version: '9'
            spark-snapshot-version: true
            hadoop-version: '3'

          - spark-compat-version: '3.2'
            scala-compat-version: '2.13'
            scala-version: '2.13.5'
            spark-patch-version: '5'
            spark-snapshot-version: true
            hadoop-version: '3.2'
          - spark-compat-version: '3.3'
            scala-compat-version: '2.13'
            scala-version: '2.13.8'
            spark-patch-version: '5'
            spark-snapshot-version: true
            hadoop-version: '3'
          - spark-compat-version: '3.4'
            scala-compat-version: '2.13'
            scala-version: '2.13.8'
            spark-patch-version: '5'
            spark-snapshot-version: true
            hadoop-version: '3'
          - spark-compat-version: '3.5'
            scala-compat-version: '2.13'
            scala-version: '2.13.8'
            spark-patch-version: '9'
            spark-snapshot-version: true
            hadoop-version: '3'
          - spark-compat-version: '4.0'
            scala-compat-version: '2.13'
            scala-version: '2.13.16'
            spark-patch-version: '3'
            spark-snapshot-version: true
            hadoop-version: '3'
          - spark-compat-version: '4.1'
            scala-compat-version: '2.13'
            scala-version: '2.13.17'
            spark-patch-version: '2'
            spark-snapshot-version: true
            hadoop-version: '3'
          - spark-compat-version: '4.2'
            scala-compat-version: '2.13'
            scala-version: '2.13.18'
            spark-patch-version: '0'
            spark-snapshot-version: true
            hadoop-version: '3'

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Prime caches
        uses: ./.github/actions/prime-caches
        with:
          spark-version: ${{ matrix.spark-compat-version }}.${{ matrix.spark-patch-version }}${{ matrix.spark-snapshot-version && '-SNAPSHOT' }}
          scala-version: ${{ matrix.scala-version }}
          spark-compat-version: ${{ matrix.spark-compat-version }}
          scala-compat-version: ${{ matrix.scala-compat-version }}
          hadoop-version: ${{ matrix.hadoop-version }}
          java-compat-version: '8'


================================================
FILE: .github/workflows/publish-release.yml
================================================
name: Publish release

on:
  workflow_dispatch:
    inputs:
      versions:
        required: true
        type: string
        description: 'Example: {"include": [{"params": {"spark-version": "4.0.0","scala-version": "2.13.16"}}]}'
        default: |
          {
            "include": [
              {"params": {"spark-version": "3.2.4", "scala-version": "2.12.15", "java-compat-version": "8"}},
              {"params": {"spark-version": "3.3.4", "scala-version": "2.12.15", "java-compat-version": "8"}},
              {"params": {"spark-version": "3.4.4", "scala-version": "2.12.17", "java-compat-version": "8"}},
              {"params": {"spark-version": "3.5.8", "scala-version": "2.12.18", "java-compat-version": "8"}},
              {"params": {"spark-version": "3.2.4", "scala-version": "2.13.5", "java-compat-version": "8"}},
              {"params": {"spark-version": "3.3.4", "scala-version": "2.13.8", "java-compat-version": "8"}},
              {"params": {"spark-version": "3.4.4", "scala-version": "2.13.8", "java-compat-version": "8"}},
              {"params": {"spark-version": "3.5.8", "scala-version": "2.13.8", "java-compat-version": "8"}},
              {"params": {"spark-version": "4.0.2", "scala-version": "2.13.16", "java-compat-version": "17"}},
              {"params": {"spark-version": "4.1.1", "scala-version": "2.13.17", "java-compat-version": "17"}}
            ]
          }

env:
  # PySpark 3 versions only work with Python 3.9
  PYTHON_VERSION: "3.9"

jobs:
  get-version:
    name: Get version
    runs-on: ubuntu-latest
    outputs:
      release-tag: ${{ steps.versions.outputs.release-tag }}
      is-snapshot: ${{ steps.versions.outputs.is-snapshot }}
    steps:
      - name: Checkout release tag
        uses: actions/checkout@v4

      - name: Get versions
        id: versions
        run: |
          # get release version
          version=$(grep --max-count=1 "<version>.*</version>" pom.xml | sed -E -e "s/\s*<[^>]+>//g" -e "s/-SNAPSHOT//" -e "s/-[0-9.]+//g")
          is_snapshot=$(if grep -q "<version>.*-SNAPSHOT</version>" pom.xml; then echo "true"; else echo "false"; fi)

          # share versions
          echo "release-tag=v${version}" >> "$GITHUB_OUTPUT"
          echo "is-snapshot=$is_snapshot" >> "$GITHUB_OUTPUT"

      - name: Check tag setup
        run: |
          # Check tag setup
          if [[ "$GITHUB_REF" != "refs/tags/v"* ]]
          then
            echo "This workflow must be run on a tag, not $GITHUB_REF"
            exit 1
          fi

          if [ "${{ steps.versions.outputs.is-snapshot }}" == "true" ]
          then
            echo "This is a tagged SNAPSHOT version. This is not allowed for release!"
            exit 1
          fi

          if [ "${{ github.ref_name }}" != "${{ steps.versions.outputs.release-tag }}" ]
          then
            echo "The version in the pom.xml is ${{ steps.versions.outputs.release-tag }}"
            echo "This tag is ${{ github.ref_name }}, which is different!"
            exit 1
          fi
      - name: Show matrix
        run: |
          echo '${{ github.event.inputs.versions }}' | jq .

  maven-release:
    name: Publish maven release (Spark ${{ matrix.params.spark-version }}, Scala ${{ matrix.params.scala-version }})
    runs-on: ubuntu-latest
    needs: get-version
    if: ( ! github.event.repository.fork )
    # secrets are provided by environment
    environment:
      name: release
      # a different URL for each point in the matrix, but the same URLs accross commits
      url: 'https://github.com/G-Research/spark-extension?version=${{ needs.get-version.outputs.release-tag }}&spark=${{ matrix.params.spark-version }}&scala=${{ matrix.params.scala-version }}&package=maven'

    permissions: {}
    strategy:
      fail-fast: false
      matrix: ${{ fromJson(github.event.inputs.versions) }}

    steps:
      - name: Checkout release tag
        uses: actions/checkout@v4

      - name: Set up JDK and publish to Maven Central
        uses: actions/setup-java@3a4f6e1af504cf6a31855fa899c6aa5355ba6c12  # v4.7.0
        with:
          java-version: ${{ matrix.params.java-compat-version }}
          distribution: 'corretto'
          server-id: central
          server-username: MAVEN_USERNAME
          server-password: MAVEN_PASSWORD
          gpg-private-key: ${{ secrets.MAVEN_GPG_PRIVATE_KEY }}
          gpg-passphrase: MAVEN_GPG_PASSPHRASE

      - name: Inspect GPG
        run: gpg -k

      - name: Restore Maven packages cache
        id: cache-maven
        uses: actions/cache/restore@v4
        with:
          path: ~/.m2/repository
          key: ${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}-${{ hashFiles('pom.xml') }}
          restore-keys: |
            ${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}-${{ hashFiles('pom.xml') }}
            ${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}-

      - name: Publish maven artifacts
        id: publish-maven
        run: |
          ./set-version.sh ${{ matrix.params.spark-version }} ${{ matrix.params.scala-version }}
          mvn clean deploy -Dsign -Dspotless.check.skip -DskipTests -Dmaven.test.skip=true
        env:
          MAVEN_USERNAME: ${{ secrets.MAVEN_USERNAME }}
          MAVEN_PASSWORD: ${{ secrets.MAVEN_PASSWORD }}
          MAVEN_GPG_PASSPHRASE: ${{ secrets.MAVEN_GPG_PASSPHRASE}}

  pypi-release:
    name: Publish PyPi release (Spark ${{ matrix.params.spark-version }}, Scala ${{ matrix.params.scala-version }})
    runs-on: ubuntu-latest
    needs: get-version
    if: ( ! github.event.repository.fork )
    # secrets are provided by environment
    environment:
      name: release
      # a different URL for each point in the matrix, but the same URLs accross commits
      url: 'https://github.com/G-Research/spark-extension?version=${{ needs.get-version.outputs.release-tag }}&spark=${{ matrix.params.spark-version }}&scala=${{ matrix.params.scala-version }}&package=pypi'

    permissions:
      id-token: write # required for PiPy publish
    strategy:
      fail-fast: false
      matrix: ${{ fromJson(github.event.inputs.versions) }}

    steps:
      - name: Checkout release tag
        uses: actions/checkout@v4

      - name: Set up JDK
        uses: actions/setup-java@3a4f6e1af504cf6a31855fa899c6aa5355ba6c12  # v4.7.0
        with:
          java-version: ${{ matrix.params.java-compat-version }}
          distribution: 'corretto'

      - uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}

      - name: Restore Maven packages cache
        id: cache-maven
        uses: actions/cache/restore@v4
        with:
          path: ~/.m2/repository
          key: ${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}-${{ hashFiles('pom.xml') }}
          restore-keys: |
            ${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}-${{ hashFiles('pom.xml') }}
            ${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}-

      - name: Build maven artifacts
        id: maven
        if: startsWith(matrix.params.spark-version, '3.') && startsWith(matrix.params.scala-version, '2.12.') || startsWith(matrix.params.spark-version, '4.') && startsWith(matrix.params.scala-version, '2.13.')
        run: |
          ./set-version.sh ${{ matrix.params.spark-version }} ${{ matrix.params.scala-version }}
          mvn clean package -Dspotless.check.skip -DskipTests -Dmaven.test.skip=true

      - name: Prepare PyPi package
        id: prepare-pypi-package
        if: steps.maven.outcome == 'success'
        run: |
          ./build-whl.sh

      - name: Publish package distributions to PyPI
        uses: pypa/gh-action-pypi-publish@release/v1
        if: steps.prepare-pypi-package.outcome == 'success'
        with:
          packages-dir: python/dist
          skip-existing: true
          verbose: true


================================================
FILE: .github/workflows/publish-snapshot.yml
================================================
name: Publish snapshot

on:
  workflow_dispatch:
  push:
    branches: ["master"]

env:
  PYTHON_VERSION: "3.10"

jobs:
  check-version:
    name: Check SNAPSHOT version
    if: ( ! github.event.repository.fork )
    runs-on: ubuntu-latest
    permissions: {}
    outputs:
      is-snapshot: ${{ steps.check.outputs.is-snapshot }}

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Check if this is a SNAPSHOT version
        id: check
        run: |
          # check is snapshot version
          if grep -q "<version>.*-SNAPSHOT</version>" pom.xml
          then
            echo "Version in pom IS a SNAPSHOT version"
            echo "is-snapshot=true" >> "$GITHUB_OUTPUT"
          else
            echo "Version in pom is NOT a SNAPSHOT version"
            echo "is-snapshot=false" >> "$GITHUB_OUTPUT"
          fi

  snapshot:
    name: Snapshot Spark ${{ matrix.params.spark-version }} Scala ${{ matrix.params.scala-version }}
    needs: check-version
    # when we release from master, this workflow will see a commit that does not have a SNAPSHOT version
    # we want this workflow to skip over that commit
    if: needs.check-version.outputs.is-snapshot == 'true'
    runs-on: ubuntu-latest
    # secrets are provided by environment
    environment:
      name: snapshot
      # a different URL for each point in the matrix, but the same URLs accross commits
      url: 'https://github.com/G-Research/spark-extension?spark=${{ matrix.params.spark-version }}&scala=${{ matrix.params.scala-version }}&snapshot'
    permissions: {}
    strategy:
      fail-fast: false
      matrix:
        include:
          - params: {"spark-version": "3.2.4", "scala-version": "2.12.15", "scala-compat-version": "2.12", "java-compat-version": "8"}
          - params: {"spark-version": "3.3.4", "scala-version": "2.12.15", "scala-compat-version": "2.12", "java-compat-version": "8"}
          - params: {"spark-version": "3.4.4", "scala-version": "2.12.17", "scala-compat-version": "2.12", "java-compat-version": "8"}
          - params: {"spark-version": "3.5.8", "scala-version": "2.12.18", "scala-compat-version": "2.12", "java-compat-version": "8"}
          - params: {"spark-version": "3.2.4", "scala-version": "2.13.5", "scala-compat-version": "2.13", "java-compat-version": "8"}
          - params: {"spark-version": "3.3.4", "scala-version": "2.13.8", "scala-compat-version": "2.13", "java-compat-version": "8"}
          - params: {"spark-version": "3.4.4", "scala-version": "2.13.8", "scala-compat-version": "2.13", "java-compat-version": "8"}
          - params: {"spark-version": "3.5.8", "scala-version": "2.13.8", "scala-compat-version": "2.13", "java-compat-version": "8"}
          - params: {"spark-version": "4.0.2", "scala-version": "2.13.16", "scala-compat-version": "2.13", "java-compat-version": "17"}
          - params: {"spark-version": "4.1.1", "scala-version": "2.13.17", "scala-compat-version": "2.13", "java-compat-version": "17"}

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up JDK and publish to Maven Central
        uses: actions/setup-java@3a4f6e1af504cf6a31855fa899c6aa5355ba6c12  # v4.7.0
        with:
          java-version: ${{ matrix.params.java-compat-version }}
          distribution: 'corretto'
          server-id: central
          server-username: MAVEN_USERNAME
          server-password: MAVEN_PASSWORD
          gpg-private-key: ${{ secrets.MAVEN_GPG_PRIVATE_KEY }}
          gpg-passphrase: MAVEN_GPG_PASSPHRASE

      - name: Inspect GPG
        run: gpg -k

      - uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}

      - name: Restore Maven packages cache
        id: cache-maven
        uses: actions/cache/restore@v4
        with:
          path: ~/.m2/repository
          key: ${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}-${{ hashFiles('pom.xml') }}
          restore-keys: |
            ${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}-${{ hashFiles('pom.xml') }}
            ${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}-


      - name: Publish snapshot
        run: |
          ./set-version.sh ${{ matrix.params.spark-version }} ${{ matrix.params.scala-version }}
          mvn clean deploy -Dsign -Dspotless.check.skip -DskipTests -Dmaven.test.skip=true
        env:
          MAVEN_USERNAME: ${{ secrets.MAVEN_USERNAME }}
          MAVEN_PASSWORD: ${{ secrets.MAVEN_PASSWORD }}
          MAVEN_GPG_PASSPHRASE: ${{ secrets.MAVEN_GPG_PASSPHRASE}}

      - name: Prepare PyPi package to test snapshot
        if: ${{ matrix.params.scala-version }} == 2.12*
        run: |
          # Build whl
          ./build-whl.sh

      - name: Restore Spark Binaries cache
        uses: actions/cache/restore@v4
        with:
          path: ~/spark
          key: ${{ runner.os }}-spark-binaries-${{ matrix.params.spark-version }}-${{ matrix.params.scala-compat-version }}
          restore-keys: |
            ${{ runner.os }}-spark-binaries-${{ matrix.params.spark-version }}-${{ matrix.params.scala-compat-version }}

      - name: Rename Spark Binaries cache
        run: |
          mv ~/spark ./spark-${{ matrix.params.spark-version }}-${{ matrix.params.scala-compat-version }}

      - name: Test snapshot
        id: test-package
        run: |
          # Test the snapshot (needs whl)
          ./test-release.sh


================================================
FILE: .github/workflows/test-jvm.yml
================================================
name: Test JVM

on:
  workflow_call:

jobs:
  test:
    name: Test (Spark ${{ matrix.spark-compat-version }}.${{ matrix.spark-patch-version }} Scala ${{ matrix.scala-version }})
    runs-on: ubuntu-latest

    strategy:
      fail-fast: false
      # keep in-sync with .github/workflows/prime-caches.yml
      matrix:
        include:
          - spark-compat-version: '3.2'
            scala-compat-version: '2.12'
            scala-version: '2.12.15'
            spark-patch-version: '4'
            java-compat-version: '8'
            hadoop-version: '2.7'
          - spark-compat-version: '3.3'
            scala-compat-version: '2.12'
            scala-version: '2.12.15'
            spark-patch-version: '4'
            java-compat-version: '8'
            hadoop-version: '3'
          - spark-compat-version: '3.4'
            scala-compat-version: '2.12'
            scala-version: '2.12.17'
            spark-patch-version: '4'
            java-compat-version: '8'
            hadoop-version: '3'
          - spark-compat-version: '3.5'
            scala-compat-version: '2.12'
            scala-version: '2.12.18'
            spark-patch-version: '7'
            java-compat-version: '8'
            hadoop-version: '3'

          - spark-compat-version: '3.2'
            scala-compat-version: '2.13'
            scala-version: '2.13.5'
            spark-patch-version: '4'
            java-compat-version: '8'
            hadoop-version: '3.2'
          - spark-compat-version: '3.3'
            scala-compat-version: '2.13'
            scala-version: '2.13.8'
            spark-patch-version: '4'
            java-compat-version: '8'
            hadoop-version: '3'
          - spark-compat-version: '3.4'
            scala-compat-version: '2.13'
            scala-version: '2.13.8'
            spark-patch-version: '4'
            java-compat-version: '8'
            hadoop-version: '3'
          - spark-compat-version: '3.5'
            scala-compat-version: '2.13'
            scala-version: '2.13.8'
            spark-patch-version: '7'
            java-compat-version: '8'
            hadoop-version: '3'
          - spark-compat-version: '4.0'
            scala-compat-version: '2.13'
            scala-version: '2.13.16'
            spark-patch-version: '2'
            java-compat-version: '17'
            hadoop-version: '3'
          - spark-compat-version: '4.1'
            scala-compat-version: '2.13'
            scala-version: '2.13.17'
            spark-patch-version: '1'
            java-compat-version: '17'
            hadoop-version: '3'
          - spark-compat-version: '4.2'
            scala-compat-version: '2.13'
            scala-version: '2.13.18'
            spark-patch-version: '0-preview3'
            java-compat-version: '17'
            hadoop-version: '3'

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Test
        uses: ./.github/actions/test-jvm
        env:
          CI_SLOW_TESTS: 1
        with:
          spark-version: ${{ matrix.spark-compat-version }}.${{ matrix.spark-patch-version }}
          scala-version: ${{ matrix.scala-version }}
          spark-compat-version: ${{ matrix.spark-compat-version }}
          spark-archive-url: ${{ matrix.spark-archive-url }}
          scala-compat-version: ${{ matrix.scala-compat-version }}
          java-compat-version: ${{ matrix.java-compat-version }}
          hadoop-version: ${{ matrix.hadoop-version }}


================================================
FILE: .github/workflows/test-python.yml
================================================
name: Test Python

on:
  workflow_call:

jobs:
  # pyspark is not available for snapshots or scala other than 2.12
  # we would have to compile spark from sources for this, not worth it
  test:
    name: Test (Spark ${{ matrix.spark-version }} Scala ${{ matrix.scala-version }} Python ${{ matrix.python-version }})
    runs-on: ubuntu-latest

    strategy:
      fail-fast: false
      matrix:
        spark-compat-version: ['3.2', '3.3', '3.4', '3.5', '4.0']
        python-version: ['3.9', '3.10', '3.11', '3.12', '3.13']

        include:
          - spark-compat-version: '3.2'
            spark-version: '3.2.4'
            scala-compat-version: '2.12'
            scala-version: '2.12.15'
            java-compat-version: '8'
            hadoop-version: '2.7'
          - spark-compat-version: '3.3'
            spark-version: '3.3.4'
            scala-compat-version: '2.12'
            scala-version: '2.12.15'
            java-compat-version: '8'
            hadoop-version: '3'
          - spark-compat-version: '3.4'
            spark-version: '3.4.4'
            scala-compat-version: '2.12'
            scala-version: '2.12.17'
            java-compat-version: '8'
            hadoop-version: '3'
          - spark-compat-version: '3.5'
            spark-version: '3.5.8'
            scala-compat-version: '2.12'
            scala-version: '2.12.18'
            java-compat-version: '8'
            hadoop-version: '3'
          - spark-compat-version: '4.0'
            spark-version: '4.0.2'
            scala-compat-version: '2.13'
            scala-version: '2.13.16'
            java-compat-version: '17'
            hadoop-version: '3'
          - spark-compat-version: '4.1'
            spark-version: '4.1.1'
            scala-compat-version: '2.13'
            scala-version: '2.13.17'
            java-compat-version: '17'
            hadoop-version: '3'
            python-version: '3.10'
          - spark-compat-version: '4.2'
            spark-version: '4.2.0-preview3'
            scala-compat-version: '2.13'
            scala-version: '2.13.18'
            java-compat-version: '17'
            hadoop-version: '3'
            python-version: '3.10'

        exclude:
          - spark-compat-version: '3.2'
            python-version: '3.10'
          - spark-compat-version: '3.2'
            python-version: '3.11'
          - spark-compat-version: '3.2'
            python-version: '3.12'
          - spark-compat-version: '3.2'
            python-version: '3.13'

          - spark-compat-version: '3.3'
            python-version: '3.11'
          - spark-compat-version: '3.3'
            python-version: '3.12'
          - spark-compat-version: '3.3'
            python-version: '3.13'

          - spark-compat-version: '3.4'
            python-version: '3.12'
          - spark-compat-version: '3.4'
            python-version: '3.13'

          - spark-compat-version: '3.5'
            python-version: '3.12'
          - spark-compat-version: '3.5'
            python-version: '3.13'

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Test
        uses: ./.github/actions/test-python
        with:
          spark-version: ${{ matrix.spark-version }}
          scala-version: ${{ matrix.scala-version }}
          spark-compat-version: ${{ matrix.spark-compat-version }}
          spark-archive-url: ${{ matrix.spark-archive-url }}
          spark-package-repo: ${{ matrix.spark-package-repo }}
          scala-compat-version: ${{ matrix.scala-compat-version }}
          java-compat-version: ${{ matrix.java-compat-version }}
          hadoop-version: ${{ matrix.hadoop-version }}
          python-version: ${{ matrix.python-version }}


================================================
FILE: .github/workflows/test-release.yml
================================================
name: Test release

on:
  workflow_call:

jobs:
  test:
    name: Test Release Spark ${{ matrix.spark-version }} Scala ${{ matrix.scala-version }}
    runs-on: ubuntu-latest

    strategy:
      fail-fast: false
      matrix:
        include:
          - spark-compat-version: '3.2'
            spark-version: '3.2.4'
            scala-compat-version: '2.12'
            scala-version: '2.12.15'
            java-compat-version: '8'
            hadoop-version: '2.7'
            python-version: '3.9'
          - spark-compat-version: '3.3'
            spark-version: '3.3.4'
            scala-compat-version: '2.12'
            scala-version: '2.12.15'
            java-compat-version: '8'
            hadoop-version: '3'
            python-version: '3.10'
          - spark-compat-version: '3.4'
            spark-version: '3.4.4'
            scala-compat-version: '2.12'
            scala-version: '2.12.17'
            java-compat-version: '8'
            hadoop-version: '3'
            python-version: '3.11'
          - spark-compat-version: '3.5'
            spark-version: '3.5.8'
            scala-compat-version: '2.12'
            scala-version: '2.12.18'
            java-compat-version: '8'
            hadoop-version: '3'
            python-version: '3.11'

          - spark-compat-version: '3.2'
            spark-version: '3.2.4'
            scala-compat-version: '2.13'
            scala-version: '2.13.5'
            java-compat-version: '8'
            hadoop-version: '3.2'
          - spark-compat-version: '3.3'
            spark-version: '3.3.4'
            scala-compat-version: '2.13'
            scala-version: '2.13.8'
            java-compat-version: '8'
            hadoop-version: '3'
          - spark-compat-version: '3.4'
            spark-version: '3.4.4'
            scala-compat-version: '2.13'
            scala-version: '2.13.8'
            java-compat-version: '8'
            hadoop-version: '3'
          - spark-compat-version: '3.5'
            spark-version: '3.5.8'
            scala-compat-version: '2.13'
            scala-version: '2.13.8'
            java-compat-version: '8'
            hadoop-version: '3'
          - spark-compat-version: '4.0'
            spark-version: '4.0.2'
            scala-compat-version: '2.13'
            scala-version: '2.13.16'
            java-compat-version: '17'
            hadoop-version: '3'
            python-version: '3.13'
          - spark-compat-version: '4.1'
            spark-version: '4.1.1'
            scala-compat-version: '2.13'
            scala-version: '2.13.17'
            java-compat-version: '17'
            hadoop-version: '3'
            python-version: '3.13'
          - spark-compat-version: '4.2'
            spark-version: '4.2.0-preview3'
            scala-compat-version: '2.13'
            scala-version: '2.13.18'
            java-compat-version: '17'
            hadoop-version: '3'
            python-version: '3.13'

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Test
        uses: ./.github/actions/test-release
        with:
          spark-version: ${{ matrix.spark-version }}
          scala-version: ${{ matrix.scala-version }}
          spark-compat-version: ${{ matrix.spark-compat-version }}
          spark-archive-url: ${{ matrix.spark-archive-url }}
          scala-compat-version: ${{ matrix.scala-compat-version }}
          java-compat-version: ${{ matrix.java-compat-version }}
          hadoop-version: ${{ matrix.hadoop-version }}
          python-version: ${{ matrix.python-version }}



================================================
FILE: .github/workflows/test-results.yml
================================================
name: Test Results

on:
  workflow_run:
    workflows: ["CI"]
    types:
      - completed
permissions: {}

jobs:
  publish-test-results:
    name: Publish Test Results
    runs-on: ubuntu-latest
    if: github.event.workflow_run.conclusion != 'skipped'
    permissions:
      checks: write
      pull-requests: write

    steps:
      - name: Download and Extract Artifacts
        uses: dawidd6/action-download-artifact@09f2f74827fd3a8607589e5ad7f9398816f540fe
        with:
          run_id: ${{ github.event.workflow_run.id }}
          name: "^Event File$| Test Results "
          name_is_regexp: true
          path: artifacts

      - name: Publish Test Results
        uses: EnricoMi/publish-unit-test-result-action@v2
        with:
          commit: ${{ github.event.workflow_run.head_sha }}
          event_file: artifacts/Event File/event.json
          event_name: ${{ github.event.workflow_run.event }}
          files: "artifacts/* Test Results*/**/*.xml"


================================================
FILE: .github/workflows/test-snapshots.yml
================================================
name: Test Snapshots

on:
  workflow_call:

jobs:
  test:
    name: Test (Spark ${{ matrix.spark-version }} Scala ${{ matrix.scala-version }})
    runs-on: ubuntu-latest

    strategy:
      fail-fast: false
      matrix:
        include:
          - spark-compat-version: '3.2'
            spark-version: '3.2.5-SNAPSHOT'
            scala-compat-version: '2.12'
            scala-version: '2.12.15'
            java-compat-version: '8'
          - spark-compat-version: '3.3'
            spark-version: '3.3.5-SNAPSHOT'
            scala-compat-version: '2.12'
            scala-version: '2.12.15'
            java-compat-version: '8'
          - spark-compat-version: '3.4'
            spark-version: '3.4.5-SNAPSHOT'
            scala-compat-version: '2.12'
            scala-version: '2.12.17'
            java-compat-version: '8'
          - spark-compat-version: '3.5'
            spark-version: '3.5.9-SNAPSHOT'
            scala-compat-version: '2.12'
            scala-version: '2.12.18'
            java-compat-version: '8'

          - spark-compat-version: '3.2'
            spark-version: '3.2.5-SNAPSHOT'
            scala-compat-version: '2.13'
            scala-version: '2.13.5'
            java-compat-version: '8'
          - spark-compat-version: '3.3'
            spark-version: '3.3.5-SNAPSHOT'
            scala-compat-version: '2.13'
            scala-version: '2.13.8'
            java-compat-version: '8'
          - spark-compat-version: '3.4'
            spark-version: '3.4.5-SNAPSHOT'
            scala-compat-version: '2.13'
            scala-version: '2.13.8'
            java-compat-version: '8'
          - spark-compat-version: '3.5'
            spark-version: '3.5.9-SNAPSHOT'
            scala-compat-version: '2.13'
            scala-version: '2.13.8'
            java-compat-version: '8'
          - spark-compat-version: '4.0'
            spark-version: '4.0.3-SNAPSHOT'
            scala-compat-version: '2.13'
            scala-version: '2.13.16'
            java-compat-version: '17'
          - spark-compat-version: '4.1'
            spark-version: '4.1.2-SNAPSHOT'
            scala-compat-version: '2.13'
            scala-version: '2.13.17'
            java-compat-version: '17'
          - spark-compat-version: '4.1'
            spark-version: '4.2.0-SNAPSHOT'
            scala-compat-version: '2.13'
            scala-version: '2.13.18'
            java-compat-version: '17'

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Test
        uses: ./.github/actions/test-jvm
        env:
          CI_SLOW_TESTS: 1
        with:
          spark-version: ${{ matrix.spark-version }}
          scala-version: ${{ matrix.scala-version }}
          spark-compat-version: ${{ matrix.spark-compat-version }}-SNAPSHOT
          scala-compat-version: ${{ matrix.scala-compat-version }}
          java-compat-version: ${{ matrix.java-compat-version }}


================================================
FILE: .gitignore
================================================
# use glob syntax.
syntax: glob
*.ser
*.class
*~
*.bak
#*.off
*.old

# eclipse conf file
.settings
.classpath
.project
.manager
.scala_dependencies

# idea
.idea
*.iml

# building
target
build
null
tmp*
temp*
dist
test-output
build.log

# other scm
.svn
.CVS
.hg*

# switch to regexp syntax.
#  syntax: regexp
#  ^\.pc/

#SHITTY output not in target directory
build.log

# project specific
python/**/__pycache__
spark-*
.cache


================================================
FILE: .scalafmt.conf
================================================
version = 3.7.17
runner.dialect = scala213
rewrite.trailingCommas.style = keep
docstrings.style = Asterisk
maxColumn = 120



================================================
FILE: CHANGELOG.md
================================================
# Changelog
All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

## [2.15.0] - 2025-12-13

### Added
- Support encrypted parquet files (#324)

### Changed
- Remove support for Spark 3.0 and Spark 3.1 (#332)
- Make all undocumented unintended public API parts private (#331)
- Reading Parquet metadata can use Parquet Hadoop version different to version coming with Spark (#330)

## [2.14.2] - 2025-07-21

### Changed
- Fixed release process (#320)

## [2.14.1] - 2025-07-17

### Changed
- Fixed release process (#319)

## [2.14.0] - 2025-07-17

### Added
- Support for Spark 4.0 (#269, #272, #293)

### Changed
- Improve backticks (#265)

  New: This escapes backticks that already exist in column names.

  Change: This does not quote columns that only contain letters, numbers
  and underscores, which were quoted before.
- Move Python dependencies into `setup.py`, build jar from `setup.py` (#301)

## [2.13.0] - 2024-11-04

### Fixes
- Support diff for Spark Connect implemened via PySpark Dataset API (#251)

### Added
- Add ignore columns to diff in Python API (#252)
- Check that the Java / Scala package is installed when needed by Python (#250)

## [2.12.0] - 2024-04-26

### Fixes

- Diff change column should respect comparators (#238)

### Changed

- Make create_temporary_dir work with pyspark-extension only (#222).
  This allows [installing PIP packages and Poetry projects](PYSPARK-DEPS.md)
  via pure Python spark-extension package (Maven package not required any more).
- Add map diff comparator to Python API (#226)

## [2.11.0] - 2024-01-04

### Added

- Add count_null aggregate function (#206)
- Support reading parquet schema (#208)
- Add more columns to reading parquet metadata (#209, #211)
- Provide groupByKey shortcuts for groupBy.as (#213)
- Allow to install PIP packages into PySpark job (#215)
- Allow to install Poetry projects into PySpark job (#216)

## [2.10.0] - 2023-09-27

### Fixed

- Update setup.py to include parquet methods in python package (#191)

### Added

- Add --statistics option to diff app (#189)
- Add --filter option to diff app (#190)

## [2.9.0] - 2023-08-23

### Added

- Add key order sensitive map comparator (#187)

### Changed

- Use dataset encoder rather than implicit value encoder for implicit dataset extension class (#183)

### Fixed

- Fix key-sensitivity in map comparator (#186)

## [2.8.0] - 2023-05-24

### Added

- Add method to set and automatically unset Spark job description. (#172)
- Add column function that converts between .Net (C#, F#, Visual Basic) `DateTime.Ticks` and Spark timestamp / Unix epoch timestamps. (#153)

## [2.7.0] - 2023-05-05

### Added

- Spark app to diff files or tables and write result back to file or table. (#160)
- Add null value count to `parquetBlockColumns` and `parquet_block_columns`. (#162)
- Add `parallelism` argument to Parquet metadata methods. (#164)

### Changed

- Change data type of column name in `parquetBlockColumns` and `parquet_block_columns` to array of strings.
  Cast to string to get earlier behaviour (string column name). (#162)

## [2.6.0] - 2023-04-11

### Added

-  Add reader for parquet metadata. (#154)

## [2.5.0] - 2023-03-23

### Added

- Add whitespace agnostic diff comparator. (#137)
- Add Python whl package build. (#151)

## [2.4.0] - 2022-12-08

### Added

- Allow for custom diff equality. (#127)

### Fixed

- Fix Python API calling into Scala code. (#132)

## [2.3.0] - 2022-10-26

### Added

- Add diffWith to Scala, Java and Python Diff API. (#109)

### Changed

- Diff similar Datasets with ignoreColumns. Before, only similar DataFrame could be diffed with ignoreColumns. (#111)

### Fixed

- Cache before writing via partitionedBy to work around SPARK-40588. Unpersist via UnpersistHandle. (#124)

## [2.2.0] - 2022-07-21

### Added
- Add (global) row numbers transformation to Scala, Java and Python API. (#97)

### Removed
- Removed support for Pyton 3.6

## [2.1.0] - 2022-04-07

### Added
- Add sorted group methods to Dataset. (#76)

## [2.0.0] - 2021-10-29

### Added
- Add support for Spark 3.2 and Scala 2.13.
- Support to ignore columns in diff API. (#63)

### Removed
- Removed support for Spark 2.4.

## [1.3.3] - 2020-12-17

### Added
- Add support for Spark 3.1.

## [1.3.2] - 2020-12-17

### Changed
- Refine conditional transformation helper methods.

## [1.3.1] - 2020-12-10

### Changed
- Refine conditional transformation helper methods.

## [1.3.0] - 2020-12-07

### Added
- Add transformation to compute histogram. (#26)
- Add conditional transformation helper methods. (#27)
- Add partitioned writing helpers that simplifies writing optimally ordered partitioned data. (#29)

## [1.2.0] - 2020-10-06

### Added
- Add diff modes (#22): column-by-column, side-by-side, left and right side diff modes.
- Adds sparse mode (#23): diff DataFrame contains only changed values.

## [1.1.0] - 2020-08-24

### Added
- Add Python API for Diff transformation.
- Add change column to Diff transformation providing column names of all changed columns in a row.
- Add fluent methods to change immutable diff options.
- Add `backticks` method to handle column names that contain dots (`.`).

## [1.0.0] - 2020-03-12

### Added
- Add Diff transformation for Datasets.


================================================
FILE: CONDITIONAL.md
================================================
# DataFrame Transformations

The Spark `Dataset` API allows for chaining transformations as in the following example:

```scala
ds.where($"id" === 1)
  .withColumn("state", lit("new"))
  .orderBy($"timestamp")
```

When you define additional transformation functions, the `Dataset` API allows you to
also fluently call into those:

```scala
def transformation(df: DataFrame): DataFrame = df.distinct

ds.transform(transformation)
```

Here are some methods that extend this principle to conditional calls.

## Conditional Transformations

You can run a transformation after checking a condition with a chain of fluent transformation calls:

```scala
import uk.co.gresearch._

val condition = true

val result =
  ds.where($"id" === 1)
    .withColumn("state", lit("new"))
    .when(condition).call(transformation)
    .orderBy($"timestamp")
```

rather than

```scala
val condition = true

val filteredDf = ds.where($"id" === 1)
                   .withColumn("state", lit("new"))
val condDf = if (condition) ds.call(transformation) else ds
val result = ds.orderBy($"timestamp")
```

In case you need an else transformation as well, try:

```scala
import uk.co.gresearch._

val condition = true

val result =
  ds.where($"id" === 1)
    .withColumn("state", lit("new"))
    .on(condition).either(transformation).or(other)
    .orderBy($"timestamp")
```

## Fluent and conditional functions elsewhere

The same fluent notation works for instances other than `Dataset` or `DataFrame`, e.g.
for the `DataFrameWriter`:

```scala
def writeData[T](writer: DataFrameWriter[T]): Unit = { ... }

ds.write
  .when(compress).call(_.option("compression", "gzip"))
  .call(writeData)
```


================================================
FILE: DIFF.md
================================================
# Spark Diff

Add the following `import` to your Scala code:

```scala
import uk.co.gresearch.spark.diff._
```

or this `import` to your Python code:

```python
# noinspection PyUnresolvedReferences
from gresearch.spark.diff import *
```

This adds a `diff` transformation to `Dataset` and `DataFrame` that computes the differences between two datasets / dataframes,
i.e. which rows of one dataset / dataframes to _add_, _delete_ or _change_ to get to the other dataset / dataframes.

For example, in Scala

```scala
val left = Seq((1, "one"), (2, "two"), (3, "three")).toDF("id", "value")
val right = Seq((1, "one"), (2, "Two"), (4, "four")).toDF("id", "value")
```

or in Python:

```python
left = spark.createDataFrame([(1, "one"), (2, "two"), (3, "three")], ["id", "value"])
right = spark.createDataFrame([(1, "one"), (2, "Two"), (4, "four")], ["id", "value"])
```

diffing becomes as easy as:

```scala
left.diff(right).show()
```

|diff |id   |value  |
|:---:|:---:|:-----:|
|    N|    1|    one|
|    D|    2|    two|
|    I|    2|    Two|
|    D|    3|  three|
|    I|    4|   four|

With columns that provide unique identifiers per row (here `id`), the diff looks like:

```scala
left.diff(right, "id").show()
```

|diff |id   |left_value|right_value|
|:---:|:---:|:--------:|:---------:|
|    N|    1|       one|        one|
|    C|    2|       two|        Two|
|    D|    3|     three|     *null*|
|    I|    4|    *null*|       four|


Equivalent alternative is this hand-crafted transformation (Scala)

```scala
left.withColumn("exists", lit(1)).as("l")
  .join(right.withColumn("exists", lit(1)).as("r"),
    $"l.id" <=> $"r.id",
    "fullouter")
  .withColumn("diff",
    when($"l.exists".isNull, "I").
      when($"r.exists".isNull, "D").
      when(!($"l.value" <=> $"r.value"), "C").
      otherwise("N"))
  .show()
```

Statistics on the differences can be obtained by

```scala
left.diff(right, "id").groupBy("diff").count().show()
```

|diff  |count  |
|:----:|:-----:|
|     N|      1|
|     I|      1|
|     D|      1|
|     C|      1|

The `diff` transformation can optionally provide a *change column* that lists all non-id column names that have changed.
This column is an array of strings and only set for `"N"` and `"C"`action rows; it is *null* for `"I"` and `"D"`action rows.

|diff |changes|id   |left_value|right_value|
|:---:|:-----:|:---:|:--------:|:---------:|
|    N|     []|    1|       one|        one|
|    C|[value]|    2|       two|        Two|
|    D| *null*|    3|     three|     *null*|
|    I| *null*|    4|    *null*|       four|

## Features

This `diff` transformation provides the following features:
* id columns are optional
* provides typed `diffAs` and `diffWith` transformations
* supports *null* values in id and non-id columns
* detects *null* value insertion / deletion
* [configurable](#configuring-diff) via `DiffOptions`:
  * diff column name (default: `"diff"`), if default name exists in diff result schema
  * diff action labels (defaults: `"N"`, `"I"`, `"D"`, `"C"`), allows custom diff notation,<br/> e.g. Unix diff left-right notation (<, >) or git before-after format (+, -, -+)
  * [custom equality operators](#comparators-equality) (e.g. double comparison with epsilon threshold)
  * [different diff result formats](#diffing-modes)
  * [sparse diffing mode](#sparse-mode)
* optionally provides a *change column* that lists all non-id column names that have changed (only for `"C"` action rows)
* guarantees that no duplicate columns exist in the result, throws a readable exception otherwise

## Configuring Diff

Diffing can be configured via an optional `DiffOptions` instance (see [Methods](#methods) below).

|option              |default  |description|
|--------------------|:-------:|-----------|
|`diffColumn`        |`"diff"` |The 'diff column' provides the action or diff value encoding if the respective row has been inserted, changed, deleted or has not been changed at all.|
|`leftColumnPrefix`  |`"left"` |Non-id columns of the 'left' dataset are prefixed with this prefix.|
|`rightColumnPrefix` |`"right"`|Non-id columns of the 'right' dataset are prefixed with this prefix.|
|`insertDiffValue`   |`"I"`    |Inserted rows are marked with this string in the 'diff column'.|
|`changeDiffValue`   |`"C"`    |Changed rows are marked with this string in the 'diff column'.|
|`deleteDiffValue`   |`"D"`    |Deleted rows are marked with this string in the 'diff column'.|
|`nochangeDiffValue` |`"N"`    |Unchanged rows are marked with this string in the 'diff column'.|
|`changeColumn`      |*none*   |An array with the names of all columns that have changed values is provided in this column (only for unchanged and changed rows, *null* otherwise).|
|`diffMode`          |`DiffModes.Default`|Configures the diff output format. For details see [Diff Modes](#diff-modes) section below.|
|`sparseMode`        |`false`  |When `true`, only values that have changed are provided on left and right side, `null` is used for un-changed values.|
|`defaultComparator` |`DiffComparators.default()`|The default equality for all value columns.|
|`dataTypeComparators`|_empty_ |Map from data types to comparators.|
|`columnNameComparators`|_empty_|Map from column names to comparators.|

Either construct an instance via the constructor …

```scala
// Scala
import uk.co.gresearch.spark.diff.{DiffOptions, DiffMode}
val options = DiffOptions("d", "l", "r", "i", "c", "d", "n", Some("changes"), DiffMode.Default, false)
```

```python
# Python
from gresearch.spark.diff import DiffOptions, DiffMode
options = DiffOptions("d", "l", "r", "i", "c", "d", "n", "changes", DiffMode.Default, False)
```

… or via the `.with*` methods. The former requires most options to be specified, whereas the latter
only requires the ones that deviate from the default. And it is more readable.

Start from the default options `DiffOptions.default` and customize as follows:

```scala
// Scala
import uk.co.gresearch.spark.diff.{DiffOptions, DiffMode, DiffComparators}

val options = DiffOptions.default
  .withDiffColumn("d")
  .withLeftColumnPrefix("l")
  .withRightColumnPrefix("r")
  .withInsertDiffValue("i")
  .withChangeDiffValue("c")
  .withDeleteDiffValue("d")
  .withNochangeDiffValue("n")
  .withChangeColumn("changes")
  .withDiffMode(DiffMode.Default)
  .withSparseMode(true)
  .withDefaultComparator(DiffComparators.epsilon(0.001))
  .withComparator(DiffComparators.epsilon(0.001), DoubleType)
  .withComparator(DiffComparators.epsilon(0.001), "float_column")
```

```python
# Python
from pyspark.sql.types import DoubleType
from gresearch.spark.diff import DiffOptions, DiffMode, DiffComparators

options = DiffOptions() \
  .with_diff_column("d") \
  .with_left_column_prefix("l") \
  .with_right_column_prefix("r") \
  .with_insert_diff_value("i") \
  .with_change_diff_value("c") \
  .with_delete_diff_value("d") \
  .with_nochange_diff_value("n") \
  .with_change_column("changes") \
  .with_diff_mode(DiffMode.Default) \
  .with_sparse_mode(True) \
  .with_default_comparator(DiffComparators.epsilon(0.01)) \
  .with_data_type_comparator(DiffComparators.epsilon(0.001), DoubleType()) \
  .with_column_name_comparator(DiffComparators.epsilon(0.001), "float_column")
```
### Diffing Modes

The result of the diff transformation can have the following formats:

- *column by column*: The non-id columns are arranged column by column, i.e. for each non-id column
                      there are two columns next to each other in the diff result, one from the left
                      and one from the right dataset. This is useful to easily compare the values
                      for each column.
- *side by side*: The non-id columns from the left and right dataset are are arranged side by side,
                  i.e. first there are all columns from the left dataset, then from the right one.
                  This is useful to visually compare the datasets as a whole, especially in conjunction
                  with the sparse mode.
- *left side*: Only the columns of the left dataset are present in the diff output. This mode
               provides the left dataset as is, annotated with diff action and optional changed column names. 
- *right side*: Only the columns of the right dataset are present in the diff output. This mode
                provides the right dataset as given, as well as the diff action that has been applied to it.
                This serves as a patch that, applied to the left dataset, results in the right dataset.

With the following two datasets `left` and `right`:

```scala
case class Value(id: Int, value: Option[String], label: Option[String])

val left = Seq(
  Value(1, Some("one"), None),
  Value(2, Some("two"), Some("number two")),
  Value(3, Some("three"), Some("number three")),
  Value(4, Some("four"), Some("number four")),
  Value(5, Some("five"), Some("number five")),
).toDS

val right = Seq(
  Value(1, Some("one"), Some("one")),
  Value(2, Some("Two"), Some("number two")),
  Value(3, Some("Three"), Some("number Three")),
  Value(4, Some("four"), Some("number four")),
  Value(6, Some("six"), Some("number six")),
).toDS
```

the diff modes produce the following outputs:

#### Column by Column

|diff |id   |left_value|right_value|left_label  |right_label |
|:---:|:---:|:--------:|:---------:|:----------:|:----------:|
|C    |1    |one       |one        |*null*      |one         |
|C    |2    |two       |Two        |number two  |number two  |
|C    |3    |three     |Three      |number three|number Three|
|N    |4    |four      |four       |number four |number four |
|D    |5    |five      |null       |number five |*null*      |
|I    |6    |*null*    |six        |*null*      |number six  |

#### Side by Side

|diff |id   |left_value|left_label  |right_value|right_label |
|:---:|:---:|:--------:|:----------:|:---------:|:----------:|
|C    |1    |one       |*null*      |one        |one         |
|C    |2    |two       |number two  |Two        |number two  |
|C    |3    |three     |number three|Three      |number Three|
|N    |4    |four      |number four |four       |number four |
|D    |5    |five      |number five |null       |*null*      |
|I    |6    |*null*    |*null*      |six        |number six  |

#### Left Side

|diff |id   |value|label       |
|:---:|:---:|:---:|:----------:|
|C    |1    |one  |null        |
|C    |2    |two  |number two  |
|C    |3    |three|number three|
|N    |4    |four |number four |
|D    |5    |five |number five |
|I    |6    |null |null        |

#### Right Side

|diff |id   |value|label       |
|:---:|:---:|:---:|:----------:|
|C    |1    |one  |one         |
|C    |2    |Two  |number two  |
|C    |3    |Three|number Three|
|N    |4    |four |number four |
|D    |5    |null |null        |
|I    |6    |six  |number six  |

### Sparse Mode

The diff modes above can be combined with sparse mode. In sparse mode, only values that differ between
the two datasets are in the diff result, all other values are `null`.

Above [Column by Column](#column-by-column) example would look in sparse mode as follows:

|diff |id   |left_value|right_value|left_label  |right_label |
|:---:|:---:|:--------:|:---------:|:----------:|:----------:|
|C    |1    |null      |null       |null        |one         |
|C    |2    |two       |Two        |null        |null        |
|C    |3    |three     |Three      |number three|number Three|
|N    |4    |null      |null       |null        |null        |
|D    |5    |five      |null       |number five |null        |
|I    |6    |null      |six        |null        |number six  |


### Comparators (Equality)

Values are compared for equality with the default `<=>` operator, which considers values
equal when both sides are `null`, or both sides are not `null` and equal.

The following alternative comparators are provided:

|Comparator|Description|
|:---------|:----------|
|`DiffComparators.epsilon(epsilon)`|Two values are equal when they are at most `epsilon` apart.<br/><br/>The comparator can be configured to use `epsilon` as an absolute (`.asAbsolute()`) threshold, or as relative (`.asRelative()`) to the larger value. Further, the threshold itself can be considered equal (`.asInclusive()`) or not equal (`.asExclusive()`):<ul><li>`DiffComparators.epsilon(epsilon).asAbsolute().asInclusive()`:<br/>`x` and `y` are equal iff `abs(x - y) ≤ epsilon`</li><li>`DiffComparators.epsilon(epsilon).asAbsolute().asExclusive()`:<br/>`x` and `y` are equal iff `abs(x - y) < epsilon`</li><li>`DiffComparators.epsilon(epsilon).asRelative().asInclusive()`:<br/>`x` and `y` are equal iff `abs(x - y) ≤ epsilon * max(abs(x), abs(y))`</li><li>`DiffComparators.epsilon(epsilon).asRelative().asExclusive()`:<br/>`x` and `y` are equal iff `abs(x - y) < epsilon * max(abs(x), abs(y))`</li></ul>|
|`DiffComparators.string()`|Two `StringType` values are compared while ignoring white space differences. For this comparison, sequences of whitespaces are collapesed into single whitespaces, leading and trailing whitespaces are removed. With `DiffComparators.string(false)`, string values are compared with the default comparator.|
|`DiffComparators.duration(duration)`|Two `DateType` or `TimestampType` values are equal when they are at most `duration` apart. That duration is an instance of `java.time.Duration`.<br/><br/>The comparator can be configured to consider `duration` as equal (`.asInclusive()`) or not equal (`.asExclusive()`):<ul><li>`DiffComparators.duration(duration).asInclusive()`:<br/>`x` and `y` are equal iff `x - y ≤ duration`</li><li>`DiffComparators.duration(duration).asExclusive()`:<br/>`x` and `y` are equal iff `x - y < duration`</li></lu>|
|`DiffComparators.map[K,V](keyOrderSensitive)` (Scala only)<br/>`DiffComparators.map(keyType, valueType, keyOrderSensitive)`|Two `Map[K,V]` values are equal when they match in all their keys and values. With `keyOrderSensitive=true`, the order of the keys matters, with `keyOrderSensitive=false` (default), the order of keys is ignored.|

An example:

    val left = Seq((1, 1.0), (2, 2.0), (3, 3.0)).toDF("id", "value")
    val right = Seq((1, 1.0), (2, 2.02), (3, 3.05)).toDF("id", "value")
    left.diff(right, "id").show()

|diff| id|left_value|right_value|
|----|---|----------|-----------|
|   N|  1|       1.0|        1.0|
|   C|  2|       2.0|       2.02|
|   C|  3|       3.0|       3.05|

The second and third rows are considered `"C"`hanged because `2.0 != 2.02` and `3.0 != 3.05`, respectively.

With an inclusive relative epsilon of 1%, `2.0 != 2.02` is considered equal, while `3.0 != 3.05` is still not equal:

    val options = DiffOptions.default
      .withComparator(DiffComparators.epsilon(0.01).asRelative().asInclusive(), DoubleType)
    left.diff(right, options, "id").show()

|diff| id|left_value|right_value|
|----|---|----------|-----------|
|   N|  1|       1.0|        1.0|
|   N|  2|       2.0|       2.02|
|   C|  3|       3.0|       3.05|

The user can provide custom comparator implementations by implementing `scala.math.Equiv[T]`
or `uk.co.gresearch.spark.diff.DiffComparator`:

    val intEquiv: Equiv[Int] = (x: Int, y: Int) => x == null && y == null || x != null && y != null && x.equals(y)
    val anyEquiv: Equiv[Any] = (x: Any, y: Any) => x == null && y == null || x != null && y != null && x.equals(y)

    val comparator: DiffComparator = (left: Column, right: Column) => left <=> right

    import spark.implicits._

    val options = DiffOptions.default
      .withComparator(intEquiv)
      .withComparator(anyEquiv, LongType, DoubleType)
      .withComparator(anyEquiv, "column1", "column2")

      .withComparator(comparator, StringType, FloatType)
      .withComparator(comparator, "column3", "column4")


## Methods (Scala)

All Scala methods come in two variants, one without (as shown below) and one with an `options: DiffOptions` argument.

* `def diff(other: Dataset[T], idColumns: String*): DataFrame`
* `def diff[U](other: Dataset[U], idColumns: Seq[String], ignoreColumns: Seq[String]): DataFrame`


* `def diffAs[V](other: Dataset[T], idColumns: String*)(implicit diffEncoder: Encoder[V]): Dataset[V]`
* `def diffAs[U, V](other: Dataset[U], idColumns: Seq[String], ignoreColumns: Seq[String])(implicit diffEncoder: Encoder[V]): Dataset[V]`
* `def diffAs[V](other: Dataset[T], diffEncoder: Encoder[U], idColumns: String*): Dataset[V]`
* `def diffAs[U, V](other: Dataset[U], diffEncoder: Encoder[U], idColumns: Seq[String], ignoreColumns: Seq[String]): Dataset[V]`


* `def diffWith(other: Dataset[T], idColumns: String*): Dataset[(String, T, T)]`
* `def diffWith[U](other: Dataset[U], idColumns: Seq[String], ignoreColumns: Seq[String]): Dataset[(String, T, U)]`

## Methods (Java)

* `Dataset<Row> Diff.of[T](Dataset<T> left, Dataset<T> right, String... idColumns)`
* `Dataset<Row> Diff.of[T, U](Dataset<T> left, Dataset<U> right, List<String> idColumns, List<String> ignoreColumns)`


* `Dataset<V> Diff.ofAs[T, V](Dataset<T> left, Dataset<T> right, Encoder<V> diffEncoder, String... idColumns)`
* `Dataset<V> Diff.ofAs[T, U, V](Dataset<T> left, Dataset<U> right, Encoder<V> diffEncoder, List<String> idColumns, List<String> ignoreColumns)`


* `Dataset<Tuple3<String, T, T>> Diff.ofWith[T](Dataset<T> left, Dataset<T> right, String... idColumns)`
* `Dataset<Tuple3<String, T, U>> Diff.ofWith[T](Dataset<T> left, Dataset<U> right, List<String> idColumns, List<String> ignoreColumns)`

Given a `DiffOptions`, a customized `Differ` can be instantiated as `Differ differ = new Differ(options)`:

* `Dataset<Row> Differ.diff[T](Dataset<T> left, Dataset<T> right, String... idColumns)`
* `Dataset<Row> Differ.diff[T, U](Dataset<T> left, Dataset<U> right, List<String> idColumns, List<String> ignoreColumns)`


* `Dataset<U> Differ.diffAs[T, V](Dataset<T> left, Dataset<T> right, Encoder<V> diffEncoder, String... idColumns)`
* `Dataset<U> Differ.diffAs[T, U, V](Dataset<T> left, Dataset<U> right, Encoder<V> diffEncoder, List<String> idColumns, List<String> ignoreColumns)`


* `Dataset<Row> Differ.diffWith[T](Dataset<T> left, Dataset<T> right, String... idColumns)`
* `Dataset<Row> Differ.diffWith[T, U](Dataset<T> left, Dataset<U> right, List<String> idColumns, List<String> ignoreColumns)`

## Methods (Python)

* `def diff(self: DataFrame, other: DataFrame, *id_columns: str) -> DataFrame`
* `def diff(self: DataFrame, other: DataFrame, id_columns: List[str], ignore_columns: List[str]) -> DataFrame`
* `def diff(self: DataFrame, other: DataFrame, options: DiffOptions, *id_columns: str) -> DataFrame`
* `def diff(self: DataFrame, other: DataFrame, options: DiffOptions, id_columns: List[str], ignore_columns: List[str]) -> DataFrame`
* `def diffwith(self: DataFrame, other: DataFrame, *id_columns: str) -> DataFrame:`
* `def diffwith(self: DataFrame, other: DataFrame, id_columns: List[str], ignore_columns: List[str]) -> DataFrame`
* `def diffwith(self: DataFrame, other: DataFrame, options: DiffOptions, *id_columns: str) -> DataFrame:`
* `def diffwith(self: DataFrame, other: DataFrame, options: DiffOptions, id_columns: List[str], ignore_columns: List[str]) -> DataFrame`

## Diff Spark application

There is also a Spark application that can be used to create a diff DataFrame. The application reads two DataFrames
`left` and `right` from files or tables, executes the diff transformation and writes the result DataFrame to a file or table.
The Diff app can be run via `spark-submit`:

```shell
# Scala 2.12
spark-submit --packages com.github.scopt:scopt_2.12:4.1.0 spark-extension_2.12-2.7.0-3.4.jar --help

# Scala 2.13
spark-submit --packages com.github.scopt:scopt_2.13:4.1.0 spark-extension_2.13-2.7.0-3.4.jar --help
```

```
Spark Diff app (2.10.0-3.4)

Usage: spark-extension_2.13-2.10.0-3.4.jar [options] left right diff

  left                     file path (requires format option) or table name to read left dataframe
  right                    file path (requires format option) or table name to read right dataframe
  diff                     file path (requires format option) or table name to write diff dataframe

Examples:

  - Diff CSV files 'left.csv' and 'right.csv' and write result into CSV file 'diff.csv':
    spark-submit --packages com.github.scopt:scopt_2.13:4.1.0 spark-extension_2.13-2.10.0-3.4.jar --format csv left.csv right.csv diff.csv

  - Diff CSV file 'left.csv' with Parquet file 'right.parquet' with id column 'id', and write result into Hive table 'diff':
    spark-submit --packages com.github.scopt:scopt_2.13:4.1.0 spark-extension_2.13-2.10.0-3.4.jar --left-format csv --right-format parquet --hive --id id left.csv right.parquet diff

Spark session
  --master <master>        Spark master (local, yarn, ...), not needed with spark-submit
  --app-name <app-name>    Spark application name
  --hive                   enable Hive support to read from and write to Hive tables

Input and output
  -f, --format <format>    input and output file format (csv, json, parquet, ...)
  --left-format <format>   left input file format (csv, json, parquet, ...)
  --right-format <format>  right input file format (csv, json, parquet, ...)
  --output-format <formt>  output file format (csv, json, parquet, ...)

  -s, --schema <schema>    input schema
  --left-schema <schema>   left input schema
  --right-schema <schema>  right input schema

  --left-option:key=val    left input option
  --right-option:key=val   right input option
  --output-option:key=val  output option

  --id <name>              id column name
  --ignore <name>          ignore column name
  --save-mode <save-mode>  save mode for writing output (Append, Overwrite, ErrorIfExists, Ignore, default ErrorIfExists)
  --filter <filter>        Filters for rows with these diff actions, with default diffing options use 'N', 'I', 'D', or 'C' (see 'Diffing options' section)
  --statistics             Only output statistics on how many rows exist per diff action (see 'Diffing options' section)

Diffing options
  --diff-column <name>     column name for diff column (default 'diff')
  --left-prefix <prefix>   prefix for left column names (default 'left')
  --right-prefix <prefix>  prefix for right column names (default 'right')
  --insert-value <value>   value for insertion (default 'I')
  --change-value <value>   value for change (default 'C')
  --delete-value <value>   value for deletion (default 'D')
  --no-change-value <val>  value for no change (default 'N')
  --change-column <name>   column name for change column (default is no such column)
  --diff-mode <mode>       diff mode (ColumnByColumn, SideBySide, LeftSide, RightSide, default ColumnByColumn)
  --sparse                 enable sparse diff

General
  --help                   prints this usage text
```

### Examples

Diff CSV files `left.csv` and `right.csv` and write result into CSV file `diff.csv`:
```shell
spark-submit --packages com.github.scopt:scopt_2.13:4.1.0 spark-extension_2.13-2.7.0-3.4.jar --format csv left.csv right.csv diff.csv
```

Diff CSV file `left.csv` with Parquet file `right.parquet` with id column `id`, and write result into Hive table `diff`:
```shell
spark-submit --packages com.github.scopt:scopt_2.13:4.1.0 spark-extension_2.13-2.7.0-3.4.jar --left-format csv --right-format parquet --hive --id id left.csv right.parquet diff
```


================================================
FILE: GROUPS.md
================================================
# Sorted Groups

Spark provides the ability to group rows by an arbitrary key,
while then providing an iterator for each of these groups.
This allows to iterate over groups that are too large to fit into memory:

```scala
import org.apache.spark.sql.Dataset

import spark.implicits._

case class Val(id: Int, seq: Int, value: Double)

val ds: Dataset[Val] = Seq(
  Val(1, 1, 1.1),
  Val(1, 2, 1.2),
  Val(1, 3, 1.3),

  Val(2, 1, 2.1),
  Val(2, 2, 2.2),
  Val(2, 3, 2.3),

  Val(3, 1, 3.1)
).reverse.toDS().repartition(3).cache()

// order of iterator IS NOT guaranteed
ds.groupByKey(v => v.id)
  .flatMapGroups((key, it) => it.zipWithIndex.map(v => (key, v._2, v._1.seq, v._1.value)))
  .toDF("key", "index", "seq", "value")
  .show(false)

+---+-----+---+-----+
|key|index|seq|value|
+---+-----+---+-----+
|1  |0    |3  |1.3  |
|1  |1    |2  |1.2  |
|1  |2    |1  |1.1  |
|2  |0    |1  |2.1  |
|2  |1    |3  |2.3  |
|2  |2    |2  |2.2  |
|3  |0    |1  |3.1  |
+---+-----+---+-----+
```

However, we have no control over the order of the group iterators.
If we want the iterators to be ordered according to `seq`, we can do the following:

```scala
import uk.co.gresearch.spark._

// the group key $"id" needs an ordering
implicit val ordering: Ordering.Int.type = Ordering.Int

// order of iterator IS guaranteed
ds.groupBySorted($"id")($"seq")
  .flatMapSortedGroups((key, it) => it.zipWithIndex.map(v => (key, v._2, v._1.seq, v._1.value)))
  .toDF("key", "index", "seq", "value")
  .show(false)

+---+-----+---+-----+
|key|index|seq|value|
+---+-----+---+-----+
|1  |0    |1  |1.1  |
|1  |1    |2  |1.2  |
|1  |2    |3  |1.3  |
|2  |0    |1  |2.1  |
|2  |1    |2  |2.2  |
|2  |2    |3  |2.3  |
|3  |0    |1  |3.1  |
+---+-----+---+-----+
```

Now, iterators are ordered according to `seq`, which is proven by the value of `index`,
that has been generated by `it.zipWithIndex`.

Instead of column expressions, we can also use lambdas to define group key and group order:
```scala
ds.groupByKeySorted(v => v.id)(v => v.seq)
  .flatMapSortedGroups((key, it) => it.zipWithIndex.map(v => (key, v._2, v._1.seq, v._1.value)))
  .toDF("key", "index", "seq", "value")
  .show(false)
```

**Note:** Using lambdas here hides from Spark which columns we use for grouping and sorting.
Query optimization cannot improve partitioning and sorting in this case. Use column expressions when possible.


================================================
FILE: HISTOGRAM.md
================================================
# Histogram

For a table `df` like

|user   |score|
|:-----:|:---:|
|Alice  |101  |
|Alice  |221  |
|Alice  |211  |
|Alice  |176  |
|Bob    |276  |
|Bob    |232  |
|Bon    |258  |
|Charlie|221  |

you can compute the histogram for each user

|user   |≤100 |≤200 |>200 |
|:-----:|:---:|:---:|:---:|
|Alice  |0    |2    |2    |
|Bob    |0    |0    |3    |
|Charlie|0    |0    |1    |

as follows:

    df.withColumn("≤100", when($"score" <= 100, 1).otherwise(0))
      .withColumn("≤200", when($"score" > 100 && $"score" <= 200, 1).otherwise(0))
      .withColumn(">200", when($"score" > 200, 1).otherwise(0))
      .groupBy($"user")
      .agg(
        sum($"≤100").as("≤100"),
        sum($"≤200").as("≤200"),
        sum($">200").as(">200")
      )
      .orderBy($"user")

Equivalent to that query is:

    import uk.co.gresearch.spark._

    df.histogram(Seq(100, 200), $"score", $"user").orderBy($"user")

The first argument is a sequence of thresholds, the second argument provides the value column.
The subsequent arguments refer to the aggregation columns (`groupBy`). Only aggregation columns
will be in the result DataFrame.

In Java, call:

    import uk.co.gresearch.spark.Histogram;

    Histogram.of(df, Arrays.asList(100, 200), new Column("score")), new Column("user")).orderBy($"user")

In Python, call:

    import gresearch.spark

    df.histogram([100, 200], 'user').orderBy('user')

Note that this feature is not supported in Python when connected with a [Spark Connect server](README.md#spark-connect-server).


================================================
FILE: LICENSE
================================================

                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright [yyyy] [name of copyright owner]

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


================================================
FILE: MAINTAINERS.md
================================================
## Current maintainers of the project

| Maintainer             | GitHub ID                                               |
| ---------------------- | ------------------------------------------------------- |
| Enrico Minack          | [EnricoMi](https://github.com/EnricoMi)                 |


================================================
FILE: PARQUET.md
================================================
# Parquet Metadata

The structure of Parquet files (the metadata, not the data stored in Parquet) can be inspected similar to [parquet-tools](https://pypi.org/project/parquet-tools/)
or [parquet-cli](https://pypi.org/project/parquet-cli/)
by reading from a simple Spark data source.

Parquet metadata can be read on [file level](#parquet-file-metadata),
[schema level](#parquet-file-schema),
[row group level](#parquet-block--rowgroup-metadata),
[column chunk level](#parquet-block-column-metadata) and
[Spark Parquet partition level](#parquet-partition-metadata).
Multiple files can be inspected at once.

Any location that can be read by Spark (`spark.read.parquet(…)`) can be inspected.
This means the path can point to a single Parquet file, a directory with Parquet files,
or multiple paths separated by a comma (`,`). Paths can contain wildcards like `*`.
Multiple files will be inspected in parallel and distributed by Spark.
No actual rows or values will be read from the Parquet files, only metadata, which is very fast.
This allows to inspect Parquet files that have different schemata with one `spark.read` operation.

First, import the new Parquet metadata data sources:

```scala
// Scala
import uk.co.gresearch.spark.parquet._
```

```python
# Python
import gresearch.spark.parquet
```

Then, the following metadata become available:

## Parquet file metadata

Read the metadata of Parquet files into a Dataframe:

```scala
// Scala
spark.read.parquetMetadata("/path/to/parquet").show()
```
```python
# Python
spark.read.parquet_metadata("/path/to/parquet").show()
```
```
+-------------+------+---------------+-----------------+----+-------+------+-----+--------------------+--------------------+-----------+--------------------+
|     filename|blocks|compressedBytes|uncompressedBytes|rows|columns|values|nulls|           createdBy|              schema| encryption|           keyValues|
+-------------+------+---------------+-----------------+----+-------+------+-----+--------------------+--------------------+-----------+--------------------+
|file1.parquet|     1|           1268|             1652| 100|      2|   200|    0|parquet-mr versio...|message spark_sch...|UNENCRYPTED|{org.apache.spark...|
|file2.parquet|     2|           2539|             3302| 200|      2|   400|    0|parquet-mr versio...|message spark_sch...|UNENCRYPTED|{org.apache.spark...|
+-------------+------+---------------+-----------------+----+-------+------+-----+--------------------+--------------------+-----------+--------------------+
```

The Dataframe provides the following per-file information:

|column            |type  | description                                                                    |
|:-----------------|:----:|:-------------------------------------------------------------------------------|
|filename          |string| The Parquet file name                                                          |
|blocks            |int   | Number of blocks / RowGroups in the Parquet file                               |
|compressedBytes   |long  | Number of compressed bytes of all blocks                                       |
|uncompressedBytes |long  | Number of uncompressed bytes of all blocks                                     |
|rows              |long  | Number of rows in the file                                                     |
|columns           |int   | Number of columns in the file                                                  |
|values            |long  | Number of values in the file                                                   |
|nulls             |long  | Number of null values in the file                                              |
|createdBy         |string| The createdBy string of the Parquet file, e.g. library used to write the file  |
|schema            |string| The schema                                                                     |
|encryption        |string| The encryption (requires `org.apache.parquet:parquet-hadoop:1.12.4` and above) |
|keyValues         |string-to-string map| Key-value data of the file                                       |

## Parquet file schema

Read the schema of Parquet files into a Dataframe:

```scala
// Scala
spark.read.parquetSchema("/path/to/parquet").show()
```
```python
# Python
spark.read.parquet_schema("/path/to/parquet").show()
```
```
+------------+----------+------------------+----------+------+------+----------------+--------------------+-----------+-------------+------------------+------------------+------------------+
|    filename|columnName|        columnPath|repetition|  type|length|    originalType|         logicalType|isPrimitive|primitiveType|    primitiveOrder|maxDefinitionLevel|maxRepetitionLevel|
+------------+----------+------------------+----------+------+------+----------------+--------------------+-----------+-------------+------------------+------------------+------------------+
|file.parquet|         a|               [a]|  REQUIRED| INT64|     0|            NULL|                NULL|       true|        INT64|TYPE_DEFINED_ORDER|                 0|                 0|
|file.parquet|         x|            [b, x]|  REQUIRED| INT32|     0|            NULL|                NULL|       true|        INT32|TYPE_DEFINED_ORDER|                 1|                 0|
|file.parquet|         y|            [b, y]|  REQUIRED|DOUBLE|     0|            NULL|                NULL|       true|       DOUBLE|TYPE_DEFINED_ORDER|                 1|                 0|
|file.parquet|         z|            [b, z]|  OPTIONAL| INT64|     0|TIMESTAMP_MICROS|TIMESTAMP(MICROS,...|       true|        INT64|TYPE_DEFINED_ORDER|                 2|                 0|
|file.parquet|   element|[c, list, element]|  OPTIONAL|BINARY|     0|            UTF8|              STRING|       true|       BINARY|TYPE_DEFINED_ORDER|                 3|                 1|
+------------+----------+------------------+----------+------+------+----------------+--------------------+-----------+-------------+------------------+------------------+------------------+
```

The Dataframe provides the following per-file information:

|column            |     type     | description                                                                       |
|:-----------------|:------------:|:----------------------------------------------------------------------------------|
|filename          |    string    | The Parquet file name                                                             |
|columnName        |    string    | The column name                                                                   |
|columnPath        | string array | The column path                                                                   |
|repetition        |    string    | The repetition                                                                    |
|type              |    string    | The data type                                                                     |
|length            |     int      | The length of the type                                                            |
|originalType      |   string     | The original type (requires `org.apache.parquet:parquet-hadoop:1.11.0` and above) |
|isPrimitive       |   boolean    | True if type is primitive                                                         |
|primitiveType     |    string    | The primitive type                                                                |
|primitiveOrder    |    string    | The order of the primitive type                                                   |
|maxDefinitionLevel|     int      | The max definition level                                                          |
|maxRepetitionLevel|     int      | The max repetition level                                                          |

## Parquet block / RowGroup metadata

Read the metadata of Parquet blocks / RowGroups into a Dataframe:

```scala
// Scala
spark.read.parquetBlocks("/path/to/parquet").show()
```
```python
# Python
spark.read.parquet_blocks("/path/to/parquet").show()
```
```
+-------------+-----+----------+---------------+-----------------+----+-------+------+-----+
|     filename|block|blockStart|compressedBytes|uncompressedBytes|rows|columns|values|nulls|
+-------------+-----+----------+---------------+-----------------+----+-------+------+-----+
|file1.parquet|    1|         4|           1269|             1651| 100|      2|   200|    0|
|file2.parquet|    1|         4|           1268|             1652| 100|      2|   200|    0|
|file2.parquet|    2|      1273|           1270|             1651| 100|      2|   200|    0|
+-------------+-----+----------+---------------+-----------------+----+-------+------+-----+
```

|column            |type  |description                                    |
|:-----------------|:----:|:----------------------------------------------|
|filename          |string|The Parquet file name                          |
|block             |int   |Block / RowGroup number starting at 1          |
|blockStart        |long  |Start position of the block in the Parquet file|
|compressedBytes   |long  |Number of compressed bytes in block            |
|uncompressedBytes |long  |Number of uncompressed bytes in block          |
|rows              |long  |Number of rows in block                        |
|columns           |int   |Number of columns in block                     |
|values            |long  |Number of values in block                      |
|nulls             |long  |Number of null values in block                 |

## Parquet block column metadata

Read the metadata of Parquet block columns into a Dataframe:

```scala
// Scala
spark.read.parquetBlockColumns("/path/to/parquet").show()
```
```python
# Python
spark.read.parquet_block_columns("/path/to/parquet").show()
```
```
+-------------+-----+------+------+-------------------+-------------------+--------------------+------------------+-----------+---------------+-----------------+------+-----+
|     filename|block|column| codec|               type|          encodings|            minValue|          maxValue|columnStart|compressedBytes|uncompressedBytes|values|nulls|
+-------------+-----+------+------+-------------------+-------------------+--------------------+------------------+-----------+---------------+-----------------+------+-----+
|file1.parquet|    1|  [id]|SNAPPY|  required int64 id|[BIT_PACKED, PLAIN]|                   0|                99|          4|            437|              826|   100|    0|
|file1.parquet|    1| [val]|SNAPPY|required double val|[BIT_PACKED, PLAIN]|0.005067503372006343|0.9973357672164814|        441|            831|              826|   100|    0|
|file2.parquet|    1|  [id]|SNAPPY|  required int64 id|[BIT_PACKED, PLAIN]|                 100|               199|          4|            438|              825|   100|    0|
|file2.parquet|    1| [val]|SNAPPY|required double val|[BIT_PACKED, PLAIN]|0.010617521596503865| 0.999189783846449|        442|            831|              826|   100|    0|
|file2.parquet|    2|  [id]|SNAPPY|  required int64 id|[BIT_PACKED, PLAIN]|                 200|               299|       1273|            440|              826|   100|    0|
|file2.parquet|    2| [val]|SNAPPY|required double val|[BIT_PACKED, PLAIN]|0.011277044401634018| 0.970525681750662|       1713|            830|              825|   100|    0|
+-------------+-----+------+------+-------------------+-------------------+--------------------+------------------+-----------+---------------+-----------------+------+-----+
```

| column            |     type      | description                                                                                       |
|:------------------|:-------------:|:--------------------------------------------------------------------------------------------------|
| filename          |    string     | The Parquet file name                                                                             |
| block             |      int      | Block / RowGroup number starting at 1                                                             |
| column            | array<string> | Block / RowGroup column name                                                                      |
| codec             |    string     | The coded used to compress the block column values                                                |
| type              |    string     | The data type of the block column                                                                 |
| encodings         | array<string> | Encodings of the block column                                                                     |
| isEncrypted       |    boolean    | Whether block column is encrypted (requires `org.apache.parquet:parquet-hadoop:1.12.3` and above) |
| minValue          |    string     | Minimum value of this column in this block                                                        |
| maxValue          |    string     | Maximum value of this column in this block                                                        |
| columnStart       |     long      | Start position of the block column in the Parquet file                                            |
| compressedBytes   |     long      | Number of compressed bytes of this block column                                                   |
| uncompressedBytes |     long      | Number of uncompressed bytes of this block column                                                 |
| values            |     long      | Number of values in this block column                                                             |
| nulls             |     long      | Number of null values in this block column                                                        |

## Parquet partition metadata

Read the metadata of how Spark partitions Parquet files into a Dataframe:

```scala
// Scala
spark.read.parquetPartitions("/path/to/parquet").show()
```
```python
# Python
spark.read.parquet_partitions("/path/to/parquet").show()
```
```
+---------+-----+----+------+------+---------------+-----------------+----+-------+------+-----+-------------+----------+
|partition|start| end|length|blocks|compressedBytes|uncompressedBytes|rows|columns|values|nulls|     filename|fileLength|
+---------+-----+----+------+------+---------------+-----------------+----+-------+------+-----+-------------+----------+
|        1|    0|1024|  1024|     1|           1268|             1652| 100|      2|   200|    0|file1.parquet|      1930|
|        2| 1024|1930|   906|     0|              0|                0|   0|      0|     0|    0|file1.parquet|      1930|
|        3|    0|1024|  1024|     1|           1269|             1651| 100|      2|   200|    0|file2.parquet|      3493|
|        4| 1024|2048|  1024|     1|           1270|             1651| 100|      2|   200|    0|file2.parquet|      3493|
|        5| 2048|3072|  1024|     0|              0|                0|   0|      0|     0|    0|file2.parquet|      3493|
|        6| 3072|3493|   421|     0|              0|                0|   0|      0|     0|    0|file2.parquet|      3493|
+---------+-----+----+------+------+---------------+-----------------+----+-------+------+-----+-------------+----------+
```

|column           |type  |description                                               |
|:----------------|:----:|:---------------------------------------------------------|
|partition        |int   |The Spark partition id                                    |
|start            |long  |The start position of the partition                       |
|end              |long  |The end position of the partition                         |
|length           |long  |The length of the partition                               |
|blocks           |int   |The number of Parquet blocks / RowGroups in this partition|
|compressedBytes  |long  |The number of compressed bytes in this partition          |
|uncompressedBytes|long  |The number of uncompressed bytes in this partition        |
|rows             |long  |The number of rows in this partition                      |
|columns          |int   |The number of columns in this partition                   |
|values           |long  |The number of values in this partition                    |
|nulls            |long  |The number of null values in this partition               |
|filename         |string|The Parquet file name                                     |
|fileLength       |long  |The length of the Parquet file                            |

## Performance

Retrieving Parquet metadata is parallelized and distributed by Spark. The result Dataframe
has as many partitions as there are Parquet files in the given `path`, but at most
`spark.sparkContext.defaultParallelism` partitions.

Each result partition reads Parquet metadata from its Parquet files sequentially,
while partitions are executed in parallel (depending on the number of Spark cores of your Spark job).

You can control the number of partitions via the `parallelism` parameter:

```scala
// Scala
spark.read.parquetMetadata(100, "/path/to/parquet")
spark.read.parquetSchema(100, "/path/to/parquet")
spark.read.parquetBlocks(100, "/path/to/parquet")
spark.read.parquetBlockColumns(100, "/path/to/parquet")
spark.read.parquetPartitions(100, "/path/to/parquet")
```
```python
# Python
spark.read.parquet_metadata("/path/to/parquet", parallelism=100)
spark.read.parquet_schema("/path/to/parquet", parallelism=100)
spark.read.parquet_blocks("/path/to/parquet", parallelism=100)
spark.read.parquet_block_columns("/path/to/parquet", parallelism=100)
spark.read.parquet_partitions("/path/to/parquet", parallelism=100)
```

## Encryption

Reading [encrypted Parquet is supported](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#columnar-encryption).
Files encrypted with [plaintext footer](https://github.com/apache/parquet-format/blob/master/Encryption.md#55-plaintext-footer-mode)
can be read without any encryption keys, while encrypted Parquet metadata are then show as `NULL` values in the result Dataframe.
Encrypted Parquet files with encrypted footer requires the footer encryption key only. No column encryption keys are needed.

## Known Issues

Note that this feature is not supported in Python when connected with a [Spark Connect server](README.md#spark-connect-server).


================================================
FILE: PARTITIONING.md
================================================
# Partitioned Writing

If you have ever used `Dataset[T].write.partitionBy`, here is how you can minimize the number of
written files and obtain same-size files.

Spark has two different concepts both referred to as partitioning. Central to Spark is the
concept of how a `Dataset[T]` is split into partitions where a Spark worker processes
a single partition at a time. This is the fundamental concept of how Spark scales with data.

When writing a `Dataset` `ds` to a file-based storage, that output file is actually a directory:

<!--
import java.sql.Timestamp
import java.sql.Timestamp

case class Value(id: Int, ts: Timestamp, property: String, value: String)
val ds = Seq(
  Value(1, Timestamp.valueOf("2020-07-01 12:00:00"), "label", "one"),
  Value(1, Timestamp.valueOf("2020-07-02 12:00:00"), "descr", "number one"),
  Value(1, Timestamp.valueOf("2020-07-03 12:00:00"), "label", "ONE"),
  Value(2, Timestamp.valueOf("2020-07-01 12:00:00"), "label", "two"),
  Value(2, Timestamp.valueOf("2020-07-03 12:00:00"), "label", "TWO"),
  Value(2, Timestamp.valueOf("2020-07-04 12:00:00"), "descr", "number two"),
  Value(3, Timestamp.valueOf("2020-07-03 12:00:00"), "label", "THREE"),
  Value(3, Timestamp.valueOf("2020-07-03 12:00:00"), "descr", "number three"),
  Value(4, Timestamp.valueOf("2020-07-01 12:00:00"), "label", "four"),
  Value(4, Timestamp.valueOf("2020-07-03 12:00:00"), "descr", "number four"),
  Value(5, Timestamp.valueOf("2020-07-01 12:00:00"), "label", "five"),
  Value(5, Timestamp.valueOf("2020-07-03 12:00:00"), "descr", "number five"),
  Value(6, Timestamp.valueOf("2020-07-01 12:00:00"), "label", "six"),
  Value(6, Timestamp.valueOf("2020-07-01 12:00:00"), "descr", "number six"),
).toDS()
-->

```scala
ds.write.csv("file.csv")
```

The directory structure looks like:

    file.csv
    file.csv/part-00000-7d34816f-bb53-4f44-ab9d-a62d570e5de0-c000.csv
    file.csv/part-00001-7d34816f-bb53-4f44-ab9d-a62d570e5de0-c000.csv
    file.csv/part-00002-7d34816f-bb53-4f44-ab9d-a62d570e5de0-c000.csv
    file.csv/part-00003-7d34816f-bb53-4f44-ab9d-a62d570e5de0-c000.csv
    file.csv/part-00004-7d34816f-bb53-4f44-ab9d-a62d570e5de0-c000.csv
    file.csv/_SUCCESS

When writing, the output can be `partitionBy` one or more columns of the `Dataset`.
For each distinct `value` in that column `col` an individual sub-directory is created in your output path.
The name is of the format `col=value`. Inside the sub-directory, multiple partitions exists,
all containing only data where column `col` has value `value`. To remove redundancy, those
files do not contain that column anymore.

    file.csv/property=descr/part-00001-8eb44de1-2c33-4f95-a1be-8d1b4e35eb4a.c000.csv
    file.csv/property=descr/part-00002-8eb44de1-2c33-4f95-a1be-8d1b4e35eb4a.c000.csv
    file.csv/property=descr/part-00003-8eb44de1-2c33-4f95-a1be-8d1b4e35eb4a.c000.csv
    file.csv/property=descr/part-00004-8eb44de1-2c33-4f95-a1be-8d1b4e35eb4a.c000.csv
    file.csv/property=label/part-00001-8eb44de1-2c33-4f95-a1be-8d1b4e35eb4a.c000.csv
    file.csv/property=label/part-00002-8eb44de1-2c33-4f95-a1be-8d1b4e35eb4a.c000.csv
    file.csv/property=label/part-00003-8eb44de1-2c33-4f95-a1be-8d1b4e35eb4a.c000.csv
    file.csv/property=label/part-00004-8eb44de1-2c33-4f95-a1be-8d1b4e35eb4a.c000.csv
    file.csv/_SUCCESS

Data that is mis-organized when written end up with the same number of files
in each of the sub-directories, even if some sub-directories contain only a fraction of
the number of rows than others. What you would like to have is have fewer files in smaller
and more files in larger partition sub-directories. Further, all files should have
roughly the same number of rows.

For this, you have to first range partition the `Dataset` according to your partition columns.

    ds.repartitionByRange($"property", $"id")
      .write
      .partitionBy("property")
      .csv("file.csv")

This organizes the data optimally for partition-writing them by column `property`.

    file.csv/property=descr/part-00000-6317db5e-5161-41f1-8227-ffeaf06a3e41.c000.csv
    file.csv/property=descr/part-00001-6317db5e-5161-41f1-8227-ffeaf06a3e41.c000.csv
    file.csv/property=label/part-00002-6317db5e-5161-41f1-8227-ffeaf06a3e41.c000.csv
    file.csv/property=label/part-00003-6317db5e-5161-41f1-8227-ffeaf06a3e41.c000.csv
    file.csv/property=label/part-00004-6317db5e-5161-41f1-8227-ffeaf06a3e41.c000.csv
    file.csv/_SUCCESS

This brings all rows with the same value in the `property` and `id` column into the same file.

If you need each file to further be sorted by additional columns, e.g. `ts`, then you can do this with `sortWithinPartitions`.

    ds.repartitionByRange($"property", $"id")
      .sortWithinPartitions($"property", $"id", $"ts")
      .cache    // this is needed for Spark 3.0 to 3.3 with AQE enabled: SPARK-40588
      .write
      .partitionBy("property")
      .csv("file.csv")

Sometimes you want to write-partition by some expression that is not a column of your data,
e.g. the date-representation of the `ts` column.

    ds.withColumn("date", $"ts".cast(DateType))
      .repartitionByRange($"date", $"id")
      .sortWithinPartitions($"date", $"id", $"ts")
      .cache    // this is needed for Spark 3.0 to 3.3 with AQE enabled: SPARK-40588
      .write
      .partitionBy("date")
      .csv("file.csv")

All those above constructs can be replaced with a single meaningful operation:

    ds.writePartitionedBy(Seq($"ts".cast(DateType).as("date")), Seq($"id"), Seq($"ts"))
      .csv("file.csv")

For Spark 3.0 to 3.3 with AQE enabled (see [SPARK-40588](https://issues.apache.org/jira/browse/SPARK-40588)),
`writePartitionedBy` has to cache an internally created DataFrame. This can be unpersisted after writing
is finished. Provide an `UnpersistHandle` for this purpose:

    val unpersist = UnpersistHandle()

    ds.writePartitionedBy(…, unpersistHandle = Some(unpersist))
      .csv("file.csv")

    unpersist()

More details about this issue can be found [here](https://www.gresearch.co.uk/blog/article/guaranteeing-in-partition-order-for-partitioned-writing-in-apache-spark/).

<!--
# Other Approaches

problems with `repartition()` instead of `repartitionByRange()`
problems with `repartitionByRange(cols).write.partitionBy(cols)`
-->

================================================
FILE: PYSPARK-DEPS.md
================================================
# PySpark dependencies

Using PySpark on a cluster requires all cluster nodes to have those Python packages installed that are required by the PySpark job.
Such a deployment can be cumbersome, especially when running in an interactive notebook.

The `spark-extension` package allows installing Python packages programmatically by the PySpark application itself (PySpark ≥ 3.1.0).
These packages are only accessible by that PySpark application, and they are removed on calling `spark.stop()`.

Either install the `spark-extension` Maven package, or the `pyspark-extension` PyPi package (on the driver only),
as described [here](README.md#using-spark-extension).

## Installing packages with `pip`

Python packages can be installed with `pip` as follows:

```python
# noinspection PyUnresolvedReferences
from gresearch.spark import *

spark.install_pip_package("pandas", "pyarrow")
```

Above example installs PIP packages `pandas` and `pyarrow` via `pip`. Method `install_pip_package` takes any `pip` command line argument:

```python
# install packages with version specs
spark.install_pip_package("pandas==1.4.3", "pyarrow~=8.0.0")

# install packages from package sources (e.g. git clone https://github.com/pandas-dev/pandas.git)
spark.install_pip_package("./pandas/")

# install packages from git repo
spark.install_pip_package("git+https://github.com/pandas-dev/pandas.git@main")

# use a pip cache directory to cache downloaded and built whl files
spark.install_pip_package("pandas", "pyarrow", "--cache-dir", "/home/user/.cache/pip")

# use an alternative index url (other than https://pypi.org/simple)
spark.install_pip_package("pandas", "pyarrow", "--index-url", "https://artifacts.company.com/pypi/simple")

# install pip packages quietly (only disables output of PIP)
spark.install_pip_package("pandas", "pyarrow", "--quiet")
```

## Installing Python projects with Poetry

Python projects can be installed from sources, including their dependencies, using [Poetry](https://python-poetry.org/):

```python
# noinspection PyUnresolvedReferences
from gresearch.spark import *

spark.install_poetry_project("../my-poetry-project/", poetry_python="../venv-poetry/bin/python")
```

## Example

This example uses `install_pip_package` in a Spark standalone cluster.

First checkout the example code:

```shell
git clone https://github.com/G-Research/spark-extension.git
cd spark-extension/examples/python-deps
```

Build a Docker image based on the official Spark release:
```shell
docker build -t spark-extension-example-docker .
```

Start the example Spark standalone cluster consisting of a Spark master and one worker:
```shell
docker compose -f docker-compose.yml up -d
```

Run the `example.py` Spark application on the example cluster:
```shell
docker exec spark-master spark-submit --master spark://master:7077 --packages uk.co.gresearch.spark:spark-extension_2.12:2.15.0-3.5 /example/example.py
```
The `--packages uk.co.gresearch.spark:spark-extension_2.12:2.15.0-3.5` argument
tells `spark-submit` to add the `spark-extension` Maven package to the Spark job.

Alternatively, install the `pyspark-extension` PyPi package via `pip install` and remove the `--packages` argument from `spark-submit`:
```shell
docker exec spark-master pip install --user pyspark_extension==2.11.1.3.5
docker exec spark-master spark-submit --master spark://master:7077 /example/example.py
```

This output proves that PySpark could call into the function `func`, wich only works when Pandas and PyArrow are installed:
```
+---+
| id|
+---+
|  0|
|  1|
|  2|
+---+
```

Test that `spark.install_pip_package("pandas", "pyarrow")` is really required by this example by removing this line from `example.py` …
```diff
 from pyspark.sql import SparkSession

 def main():
     spark = SparkSession.builder.appName("spark_app").getOrCreate()

     def func(df):
         return df

     from gresearch.spark import install_pip_package

-    spark.install_pip_package("pandas", "pyarrow")
     spark.range(0, 3, 1, 5).mapInPandas(func, "id long").show()

 if __name__ == "__main__":
     main()
```

… and running the `spark-submit` command again. The example does not work anymore,
because the Pandas and PyArrow packages are missing from the driver:
```
Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/pandas/utils.py", line 27, in require_minimum_pandas_version
ModuleNotFoundError: No module named 'pandas'
```

Finally, shutdown the example cluster:
```shell
docker compose -f docker-compose.yml down
```

## Known Issues

Note that this feature is not supported in Python when connected with a [Spark Connect server](README.md#spark-connect-server).


================================================
FILE: README.md
================================================
# Spark Extension

This project provides extensions to the [Apache Spark project](https://spark.apache.org/) in Scala and Python:

**[Diff](DIFF.md):** A `diff` transformation and application for `Dataset`s that computes the differences between
two datasets, i.e. which rows to _add_, _delete_ or _change_ to get from one dataset to the other.

**[SortedGroups](GROUPS.md):** A `groupByKey` transformation that groups rows by a key while providing
a **sorted** iterator for each group. Similar to `Dataset.groupByKey.flatMapGroups`, but with order guarantees
for the iterator.

**[Histogram](HISTOGRAM.md) [<sup>[*]</sup>](#spark-connect-server):** A `histogram` transformation that computes the histogram DataFrame for a value column.

**[Global Row Number](ROW_NUMBER.md) [<sup>[*]</sup>](#spark-connect-server):** A `withRowNumbers` transformation that provides the global row number w.r.t.
the current order of the Dataset, or any given order. In contrast to the existing SQL function `row_number`, which
requires a window spec, this transformation provides the row number across the entire Dataset without scaling problems.

**[Partitioned Writing](PARTITIONING.md):** The `writePartitionedBy` action writes your `Dataset` partitioned and
efficiently laid out with a single operation.

**[Inspect Parquet files](PARQUET.md) [<sup>[*]</sup>](#spark-connect-server):** The structure of Parquet files (the metadata, not the data stored in Parquet) can be inspected similar to [parquet-tools](https://pypi.org/project/parquet-tools/)
or [parquet-cli](https://pypi.org/project/parquet-cli/) by reading from a simple Spark data source.
This simplifies identifying why some Parquet files cannot be split by Spark into scalable partitions.

**[Install Python packages into PySpark job](PYSPARK-DEPS.md) [<sup>[*]</sup>](#spark-connect-server):** Install Python dependencies via PIP or Poetry programatically into your running PySpark job (PySpark ≥ 3.1.0):

```python
# noinspection PyUnresolvedReferences
from gresearch.spark import *

# using PIP
spark.install_pip_package("pandas==1.4.3", "pyarrow")
spark.install_pip_package("-r", "requirements.txt")

# using Poetry
spark.install_poetry_project("../my-poetry-project/", poetry_python="../venv-poetry/bin/python")
```

**[Fluent method call](CONDITIONAL.md):** `T.call(transformation: T => R): R`: Turns a transformation `T => R`,
that is not part of `T` into a fluent method call on `T`. This allows writing fluent code like:

```scala
import uk.co.gresearch._

i.doThis()
 .doThat()
 .call(transformation)
 .doMore()
```

**[Fluent conditional method call](CONDITIONAL.md):** `T.when(condition: Boolean).call(transformation: T => T): T`:
Perform a transformation fluently only if the given condition is true.
This allows writing fluent code like:

```scala
import uk.co.gresearch._

i.doThis()
 .doThat()
 .when(condition).call(transformation)
 .doMore()
```

**[Shortcut for groupBy.as](https://github.com/G-Research/spark-extension/pull/213#issue-2032837105)**: Calling `Dataset.groupBy(Column*).as[K, T]`
should be preferred over calling `Dataset.groupByKey(V => K)` whenever possible. The former allows Catalyst to exploit
existing partitioning and ordering of the Dataset, while the latter hides from Catalyst which columns are used to create the keys.
This can have a significant performance penalty.

<details>
<summary>Details:</summary>

The new column-expression-based `groupByKey[K](Column*)` method makes it easier to group by a column expression key. Instead of

    ds.groupBy($"id").as[Int, V]

use:

    ds.groupByKey[Int]($"id")
</details>

**Backticks:** `backticks(string: String, strings: String*): String)`: Encloses the given column name with backticks (`` ` ``) when needed.
This is a handy way to ensure column names with special characters like dots (`.`) work with `col()` or `select()`.

**Count null values:** `count_null(e: Column)`: an aggregation function like `count` that counts null values in column `e`.
This is equivalent to calling `count(when(e.isNull, lit(1)))`.

**.Net DateTime.Ticks[<sup>[*]</sup>](#spark-connect-server):** Convert .Net (C#, F#, Visual Basic) `DateTime.Ticks` into Spark timestamps, seconds and nanoseconds.

<details>
<summary>Available methods:</summary>

```scala
// Scala
dotNetTicksToTimestamp(Column): Column       // returns timestamp as TimestampType
dotNetTicksToUnixEpoch(Column): Column       // returns Unix epoch seconds as DecimalType
dotNetTicksToUnixEpochNanos(Column): Column  // returns Unix epoch nanoseconds as LongType
```

The reverse is provided by (all return `LongType` .Net ticks):
```scala
// Scala
timestampToDotNetTicks(Column): Column
unixEpochToDotNetTicks(Column): Column
unixEpochNanosToDotNetTicks(Column): Column
```

These methods are also available in Python:
```python
# Python
dotnet_ticks_to_timestamp(column_or_name)         # returns timestamp as TimestampType
dotnet_ticks_to_unix_epoch(column_or_name)        # returns Unix epoch seconds as DecimalType
dotnet_ticks_to_unix_epoch_nanos(column_or_name)  # returns Unix epoch nanoseconds as LongType

timestamp_to_dotnet_ticks(column_or_name)
unix_epoch_to_dotnet_ticks(column_or_name)
unix_epoch_nanos_to_dotnet_ticks(column_or_name)
```
</details>

**Spark temporary directory[<sup>[*]</sup>](#spark-connect-server)**: Create a temporary directory that will be removed on Spark application shutdown.

<details>
<summary>Examples:</summary>

Scala:
```scala
import uk.co.gresearch.spark.createTemporaryDir

val dir = createTemporaryDir("prefix")
```

Python:
```python
# noinspection PyUnresolvedReferences
from gresearch.spark import *

dir = spark.create_temporary_dir("prefix")
```
</details>

**Spark job description[<sup>[*]</sup>](#spark-connect-server):** Set Spark job description for all Spark jobs within a context.

<details>
<summary>Examples:</summary>

```scala
import uk.co.gresearch.spark._

implicit val session: SparkSession = spark

withJobDescription("parquet file") {
  val df = spark.read.parquet("data.parquet")
  val count = appendJobDescription("count") {
    df.count
  }
  appendJobDescription("write") {
    df.write.csv("data.csv")
  }
}
```

| Without job description  | With job description |
|:---:|:---:|
| ![](without-job-description.png "Spark job without description in UI") | ![](with-job-description.png "Spark job with description in UI") |

Note that setting a description in one thread while calling the action (e.g. `.count`) in a different thread
does not work, unless the different thread is spawned from the current thread _after_ the description has been set.

Working example with parallel collections:

```scala
import java.util.concurrent.ForkJoinPool
import scala.collection.parallel.CollectionConverters.seqIsParallelizable
import scala.collection.parallel.ForkJoinTaskSupport

val files = Seq("data1.csv", "data2.csv").par

val counts = withJobDescription("Counting rows") {
  // new thread pool required to spawn new threads from this thread
  // so that the job description is actually used
  files.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool())
  files.map(filename => spark.read.csv(filename).count).sum
}(spark)
```
</details>

## Using Spark Extension

The `spark-extension` package is available for all Spark 3.2, 3.3, 3.4 and 3.5 versions.
The package version has the following semantics: `spark-extension_{SCALA_COMPAT_VERSION}-{VERSION}-{SPARK_COMPAT_VERSION}`:

- `SCALA_COMPAT_VERSION`: Scala binary compatibility (minor) version. Available are `2.12` and `2.13`.
- `SPARK_COMPAT_VERSION`: Apache Spark binary compatibility (minor) version. Available are `3.2`, `3.3`, `3.4`, `3.5` and `4.0`.
- `VERSION`: The package version, e.g. `2.14.0`.

### SBT

Add this line to your `build.sbt` file:

```sbt
libraryDependencies += "uk.co.gresearch.spark" %% "spark-extension" % "2.15.0-3.5"
```

### Maven

Add this dependency to your `pom.xml` file:

```xml
<dependency>
  <groupId>uk.co.gresearch.spark</groupId>
  <artifactId>spark-extension_2.12</artifactId>
  <version>2.15.0-3.5</version>
</dependency>
```

### Gradle

Add this dependency to your `build.gradle` file:

```groovy
dependencies {
    implementation "uk.co.gresearch.spark:spark-extension_2.12:2.15.0-3.5"
}
```

### Spark Submit

Submit your Spark app with the Spark Extension dependency (version ≥1.1.0) as follows:

```shell script
spark-submit --packages uk.co.gresearch.spark:spark-extension_2.12:2.15.0-3.5 [jar]
```

Note: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depending on your Spark version.

### Spark Shell

Launch a Spark Shell with the Spark Extension dependency (version ≥1.1.0) as follows:

```shell script
spark-shell --packages uk.co.gresearch.spark:spark-extension_2.12:2.15.0-3.5
```

Note: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depending on your Spark Shell version.

### Python

#### PySpark API

Start a PySpark session with the Spark Extension dependency (version ≥1.1.0) as follows:

```python
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .config("spark.jars.packages", "uk.co.gresearch.spark:spark-extension_2.12:2.15.0-3.5") \
    .getOrCreate()
```

Note: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depending on your PySpark version.

#### PySpark REPL

Launch the Python Spark REPL with the Spark Extension dependency (version ≥1.1.0) as follows:

```shell script
pyspark --packages uk.co.gresearch.spark:spark-extension_2.12:2.15.0-3.5
```

Note: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depending on your PySpark version.

#### PySpark `spark-submit`

Run your Python scripts that use PySpark via `spark-submit`:

```shell script
spark-submit --packages uk.co.gresearch.spark:spark-extension_2.12:2.15.0-3.5 [script.py]
```

Note: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depending on your Spark version.

#### PyPi package (local Spark cluster only)

You may want to install the `pyspark-extension` python package from PyPi into your development environment.
This provides you code completion, typing and test capabilities during your development phase.

Running your Python application on a Spark cluster will still require one of the above ways
to add the Scala package to the Spark environment.

```shell script
pip install pyspark-extension==2.15.0.3.5
```

Note: Pick the right Spark version (here 3.5) depending on your PySpark version.

### Your favorite Data Science notebook

There are plenty of [Data Science notebooks](https://datasciencenotebook.org/) around. To use this library,
add **a jar dependency** to your notebook using these **Maven coordinates**:

    uk.co.gresearch.spark:spark-extension_2.12:2.15.0-3.5

Or [download the jar](https://mvnrepository.com/artifact/uk.co.gresearch.spark/spark-extension) and place it
on a filesystem where it is accessible by the notebook, and reference that jar file directly.

Check the documentation of your favorite notebook to learn how to add jars to your Spark environment.

## Known issues
### Spark Connect Server

Most features are not supported **in Python** in conjunction with a [Spark Connect server](https://spark.apache.org/docs/latest/spark-connect-overview.html).
This also holds for Databricks Runtime environment 13.x and above. Details can be found [in this blog](https://semyonsinchenko.github.io/ssinchenko/post/how-databricks-14x-breaks-3dparty-compatibility/).

Calling any of those features when connected to a Spark Connect server will raise this error:

    This feature is not supported for Spark Connect.

Use a classic connection to a Spark cluster instead.

## Build

You can build this project against different versions of Spark and Scala.

### Switch Spark and Scala version

If you want to build for a Spark or Scala version different to what is defined in the `pom.xml` file, then run

```shell script
sh set-version.sh [SPARK-VERSION] [SCALA-VERSION]
```

For example, switch to Spark 3.5.0 and Scala 2.13.8 by running `sh set-version.sh 3.5.0 2.13.8`.

### Build the Scala project

Then execute `mvn package` to create a jar from the sources. It can be found in `target/`.

## Testing

Run the Scala tests via `mvn test`.

### Setup Python environment

In order to run the Python tests, setup a Python environment as follows:

```shell script
virtualenv -p python3 venv
source venv/bin/activate
pip install python/[test]
```

### Run Python tests

Run the Python tests via `env PYTHONPATH=python/test python -m pytest python/test`.

### Build Python package

Run the following commands in the project root directory to create a whl from the sources:

```shell script
pip install build
python -m build python/
```

It can be found in `python/dist/`.

## Publications

- ***Guaranteeing in-partition order for partitioned-writing in Apache Spark**, Enrico Minack, 20/01/2023*:<br/>https://www.gresearch.com/blog/article/guaranteeing-in-partition-order-for-partitioned-writing-in-apache-spark/
- ***Un-pivot, sorted groups and many bug fixes: Celebrating the first Spark 3.4 release**, Enrico Minack, 21/03/2023*:<br/>https://www.gresearch.com/blog/article/un-pivot-sorted-groups-and-many-bug-fixes-celebrating-the-first-spark-3-4-release/
- ***A PySpark bug makes co-grouping with window function partition-key-order-sensitive**, Enrico Minack, 29/03/2023*:<br/>https://www.gresearch.com/blog/article/a-pyspark-bug-makes-co-grouping-with-window-function-partition-key-order-sensitive/
- ***Spark’s groupByKey should be avoided – and here’s why**, Enrico Minack, 13/06/2023*:<br/>https://www.gresearch.com/blog/article/sparks-groupbykey-should-be-avoided-and-heres-why/
- ***Inspecting Parquet files with Spark**, Enrico Minack, 28/07/2023*:<br/>https://www.gresearch.com/blog/article/parquet-files-know-your-scaling-limits/
- ***Enhancing Spark’s UI with Job Descriptions**, Enrico Minack, 12/12/2023*:<br/>https://www.gresearch.com/blog/article/enhancing-sparks-ui-with-job-descriptions/
- ***PySpark apps with dependencies: Managing Python dependencies in code**, Enrico Minack, 24/01/2024*:<br/>https://www.gresearch.com/news/pyspark-apps-with-dependencies-managing-python-dependencies-in-code/
- ***Observing Spark Aggregates: Cheap Metrics from Datasets**, Enrico Minack, 06/02/2024*:<br/>https://www.gresearch.com/news/observing-spark-aggregates-cheap-metrics-from-datasets-2/

## Security

Please see our [security policy](https://github.com/G-Research/spark-extension/blob/master/SECURITY.md) for details on reporting security vulnerabilities.


================================================
FILE: RELEASE.md
================================================
# Releasing Spark Extension

This provides instructions on how to release a version of `spark-extension`. We release this library
for a number of Spark and Scala environments, but all from the same git tag. Release for the environment
that is set in the `pom.xml` and create a tag. On success, release from that tag for all other environments
as described below.

Use the `release.sh` script to test and release all versions. Or execute the following steps manually.

## Testing master for all environments

The following steps release a snapshot and test it. Test all versions listed [further down](#releasing-master-for-other-environments).

- Set the version with `./set-version.sh`, e.g. `./set-version.sh 3.4.0 2.12.17`
- Release a snapshot (make sure the version in the `pom.xml` file ends with `SNAPSHOT`): `mvn clean deploy`
- Test the released snapshot: `./test-release.sh`

## Releasing from master

Follow this procedure to release a new version:

- Add a new entry to `CHANGELOG.md` listing all notable changes of this release.
  Use the heading `## [VERSION] - YYYY-MM-dd`, e.g. `## [1.1.0] - 2020-03-12`.
- Remove the `-SNAPSHOT` suffix from the version, e.g. `./set-version 1.1.0`.
- Update the versions in the `README.md` and `python/README.md` file to the version of your `pom.xml` to reflect the latest version,
  e.g. replace all `1.0.0-3.1` with `1.1.0-3.1` and `1.0.0.3.1` with `1.1.0.3.1`, respectively.
- Commit the change to your local git repository, use a commit message like `Releasing 1.1.0`. Do not push to github yet.
- Tag that commit with a version tag like `v1.1.0` and message like `Release v1.1.0`. Do not push to github yet.
- Release the version with `mvn clean deploy`. This will be put into a staging repository and not automatically released (due to `<autoReleaseAfterClose>false</autoReleaseAfterClose>` in your [`pom.xml`](pom.xml) file).
- Inspect and test the staged version. Use `./test-release.sh` or the `spark-examples` project for that. If you are happy with everything:
  - Push the commit and tag to origin.
  - Release the package with `mvn nexus-staging:release`.
  - Bump the version to the next [minor version](https://semver.org/) and append the `-SNAPSHOT` suffix again: `./set-version 1.2.0-SNAPSHOT`.
  - Commit this change to your local git repository, use a commit message like `Post-release version bump to 1.2.0`.
  - Push all local commits to origin.
- Otherwise drop it with `mvn nexus-staging:drop`. Remove the last two commits from your local history.

## Releasing master for other environments

Once you have released the new version, release from the same tag for all other Spark and Scala environments as well:
- Release for these environments, one of these has been released above, that should be the tagged version:

|Spark|Scala|
|:----|:----|
|3.2  |2.12.15 and 2.13.5|
|3.3  |2.12.15 and 2.13.8|
|3.4  |2.12.17 and 2.13.8|
|3.5  |2.12.17 and 2.13.8|
- Always use the latest Spark version per Spark minor version
- Release process:
  - Checkout the release tag, e.g. `git checkout v1.0.0`
  - Set the version in the `pom.xml` file via `set-version.sh`, e.g. `./set-version.sh 3.4.0 2.12.17`
  - Review the `pom.xml` file changes: `git diff pom.xml`
  - Release the version with `mvn clean deploy`
  - Inspect and test the staged version. Use `./test-release.sh` or the `spark-examples` project for that.
    - If you are happy with everything, release the package with `mvn nexus-staging:release`.
    - Otherwise drop it with `mvn nexus-staging:drop`.
- Revert the changes done to the `pom.xml` file: `git checkout pom.xml`

## Releasing a bug-fix version

A bug-fix version needs to be released from a [minor-version branch](https://semver.org/), e.g. `branch-1.1`.

### Create a bug-fix branch

If there is no bug-fix branch yet, create it:

- Create such a branch from the respective [minor-version tag](https://semver.org/), e.g. create minor version branch `branch-1.1` from tag `v1.1.0`.
- Bump the version to the next [patch version](https://semver.org/) in `pom.xml` and append the `-SNAPSHOT` suffix again, e.g. `1.1.0` → `1.1.1-SNAPSHOT`.
- Commit this change to your local git repository, use a commit message like `Post-release version bump to 1.1.1`.
- Push this commit to origin.

Merge your bug fixes into this branch as you would normally do for master, use PRs for that.

### Release from a bug-fix branch

This is very similar to [releasing from master](#releasing-from-master),
but the version increment occurs on [patch level](https://semver.org/):

- Add a new entry to `CHANGELOG.md` listing all notable changes of this release.
  Use the heading `## [VERSION] - YYYY-MM-dd`, e.g. `## [1.1.1] - 2020-03-12`.
- Remove the `-SNAPSHOT` suffix from the version, e.g. `./set-version 1.1.1`.
- Update the versions in the `README.md` and `python/README.md` file to the version of your `pom.xml` to reflect the latest version,
  e.g. replace all `1.1.0-3.1` with `1.1.1-3.1` and `1.1.0.3.1` with `1.1.1.3.1`, respectively.
- Commit the change to your local git repository, use a commit message like `Releasing 1.1.1`. Do not push to github yet.
- Tag that commit with a version tag like `v1.1.1` and message like `Release v1.1.1`. Do not push to github yet.
- Release the version with `mvn clean deploy`. This will be put into a staging repository and not automatically released (due to `<autoReleaseAfterClose>false</autoReleaseAfterClose>` in your [`pom.xml`](pom.xml) file).
- Inspect and test the staged version. Use `./test-release.sh` or the `spark-examples` project for that. If you are happy with everything:
  - Push the commit and tag to origin.
  - Release the package with `mvn nexus-staging:release`.
  - Bump the version to the next [patch version](https://semver.org/) and append the `-SNAPSHOT` suffix again: `./set-version 1.1.2-SNAPSHOT`.
  - Commit this change to your local git repository, use a commit message like `Post-release version bump to 1.1.2`.
  - Push all local commits to origin.
- Otherwise drop it with `mvn nexus-staging:drop`. Remove the last two commits from your local history.

Consider releasing the bug-fix version for other environments as well. See [above](#releasing-master-for-other-environments) section for details.


================================================
FILE: ROW_NUMBER.md
================================================
# Global Row Number

Spark provides the [SQL function `row_number`](https://spark.apache.org/docs/latest/api/sql/index.html#row_number),
which assigns each row a consecutive number, starting from 1. This function works on a [Window](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/expressions/Window.html).
Assigning a row number over the entire Dataset will load the entire dataset into a single partition / executor.
This does not scale.

Spark extensions provide the `Dataset` transformation `withRowNumbers`, which assigns a global row number while scaling:

```scala
val df = Seq((1, "one"), (2, "TWO"), (2, "two"), (3, "three")).toDF("id", "value")
df.show()
// +---+-----+
// | id|value|
// +---+-----+
// |  1|  one|
// |  2|  TWO|
// |  2|  two|
// |  3|three|
// +---+-----+

import uk.co.gresearch.spark._

df.withRowNumbers().show()
// +---+-----+----------+
// | id|value|row_number|
// +---+-----+----------+
// |  1|  one|         1|
// |  2|  two|         2|
// |  2|  TWO|         3|
// |  3|three|         4|
// +---+-----+----------+
```

In Java:
```java
import uk.co.gresearch.spark.RowNumbers;

RowNumbers.of(df).show();
// +---+-----+----------+
// | id|value|row_number|
// +---+-----+----------+
// |  1|  one|         1|
// |  2|  two|         2|
// |  2|  TWO|         3|
// |  3|three|         4|
// +---+-----+----------+
```

In Python:
```python
import gresearch.spark

df.with_row_numbers().show()
# +---+-----+----------+
# | id|value|row_number|
# +---+-----+----------+
# |  1|  one|         1|
# |  2|  two|         2|
# |  2|  TWO|         3|
# |  3|three|         4|
# +---+-----+----------+
```

## Row number order
Row numbers are assigned in the current order of the Dataset. If you want a specific order, provide columns as follows:

```scala
df.withRowNumbers($"id".desc, $"value").show()
// +---+-----+----------+
// | id|value|row_number|
// +---+-----+----------+
// |  3|three|         1|
// |  2|  TWO|         2|
// |  2|  two|         3|
// |  1|  one|         4|
// +---+-----+----------+
```

In Java:
```java
RowNumbers.withOrderColumns(df.col("id").desc(), df.col("value")).of(df).show();
// +---+-----+----------+
// | id|value|row_number|
// +---+-----+----------+
// |  3|three|         1|
// |  2|  TWO|         2|
// |  2|  two|         3|
// |  1|  one|         4|
// +---+-----+----------+
```

In Python:
```python
df.with_row_numbers(order=[df.id.desc(), df.value]).show()
# +---+-----+----------+
# | id|value|row_number|
# +---+-----+----------+
# |  3|three|         1|
# |  2|  TWO|         2|
# |  2|  two|         3|
# |  1|  one|         4|
# +---+-----+----------+
```

## Row number column name

The column name that contains the row number can be changed by providing the `rowNumberColumnName` argument:

```scala
df.withRowNumbers(rowNumberColumnName="row").show()
// +---+-----+---+
// | id|value|row|
// +---+-----+---+
// |  1|  one|  1|
// |  2|  TWO|  2|
// |  2|  two|  3|
// |  3|three|  4|
// +---+-----+---+
```

In Java:
```java
RowNumbers.withRowNumberColumnName("row").of(df).show();
// +---+-----+---+
// | id|value|row|
// +---+-----+---+
// |  1|  one|  1|
// |  2|  TWO|  2|
// |  2|  two|  3|
// |  3|three|  4|
// +---+-----+---+
```

In Python:
```python
df.with_row_numbers(row_number_column_name='row').show()
# +---+-----+---+
# | id|value|row|
# +---+-----+---+
# |  1|  one|  1|
# |  2|  TWO|  2|
# |  2|  two|  3|
# |  3|three|  4|
# +---+-----+---+
```

## Cached / persisted intermediate Dataset

The `withRowNumbers` transformation requires the input Dataset to be
[cached](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html#cache():Dataset.this.type) /
[persisted](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html#persist(newLevel:org.apache.spark.storage.StorageLevel):Dataset.this.type),
after adding an intermediate column. You can specify the level of persistence through the `storageLevel` parameter.

```scala
import org.apache.spark.storage.StorageLevel

val dfWithRowNumbers = df.withRowNumbers(storageLevel=StorageLevel.DISK_ONLY)
```

In Java:
```java
import org.apache.spark.storage.StorageLevel;

Dataset<Row> dfWithRowNumbers = RowNumbers.withStorageLevel(StorageLevel.DISK_ONLY()).of(df);
```

In Python:
```python
from pyspark.storagelevel import StorageLevel

df_with_row_numbers = df.with_row_numbers(storage_level=StorageLevel.DISK_ONLY)
```

## Un-persist intermediate Dataset

If you want control over when to un-persist this intermediate Dataset, you can provide an `UnpersistHandle` and call it
when you are done with the result Dataset:

```scala
import uk.co.gresearch.spark.UnpersistHandle

val unpersist = UnpersistHandle()
val dfWithRowNumbers = df.withRowNumbers(unpersistHandle=unpersist);

// after you are done with dfWithRowNumbers you may want to call unpersist()
unpersist(blocking=false)
```

In Java:
```java
import uk.co.gresearch.spark.UnpersistHandle;

UnpersistHandle unpersist = new UnpersistHandle();
Dataset<Row> dfWithRowNumbers = RowNumbers.withUnpersistHandle(unpersist).of(df);

// after you are done with dfWithRowNumbers you may want to call unpersist()
unpersist.apply(true);
```

In Python:
```python
unpersist = spark.unpersist_handle()
df_with_row_numbers = df.with_row_numbers(unpersist_handle=unpersist)

# after you are done with df_with_row_numbers you may want to call unpersist()
unpersist(blocking=True)
```

## Spark warning

You will recognize that Spark logs the following warning:

```
WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
```
This warning is unavoidable, because `withRowNumbers` has to pull information about the initial partitions into a single partition.
Fortunately, there are only 12 Bytes per input partition required, so this amount of data usually fits into a single partition and the warning can safely be ignored.

## Known issues

Note that this feature is not supported in Python when connected with a [Spark Connect server](README.md#spark-connect-server).


================================================
FILE: SECURITY.md
================================================
# Security and Coordinated Vulnerability Disclosure Policy

This project appreciates and encourages coordinated disclosure of security vulnerabilities. We prefer that you use the GitHub reporting mechanism to privately report vulnerabilities. Under the main repository's security tab, click "Report a vulnerability" to open the advisory form.

If you are unable to report it via GitHub, have received no response after repeated attempts, or have other security related questions, please contact security@gr-oss.io and mention this project in the subject line.

================================================
FILE: build-whl.sh
================================================
#!/bin/bash

set -eo pipefail

base=$(cd "$(dirname "$0")"; pwd)

version=$(grep --max-count=1 "<version>.*</version>" "$base/pom.xml" | sed -E -e "s/\s*<[^>]+>//g")
artifact_id=$(grep --max-count=1 "<artifactId>.*</artifactId>" "$base/pom.xml" | sed -E -e "s/\s*<[^>]+>//g")

rm -rf "$base/python/pyspark/jars/$artifact_id-*.jar"

pip install build
python -m build "$base/python/"

# check for missing modules in whl file
pyversion=${version/SNAPSHOT/dev0}
pyversion=${pyversion//-/.}

missing="$(diff <(cd "$base/python"; find gresearch -type f | grep -v ".pyc$" | sort) <(unzip -l "$base/python/dist/pyspark_extension-${pyversion}-*.whl" | tail -n +4 | head -n -2 | sed -E -e "s/^ +//" -e "s/ +/ /g" | cut -d " " -f 4- | sort) | grep "^<" || true)"
if [ -n "$missing" ]
then
  echo "These files are missing from the whl file:"
  echo "$missing"
  exit 1
fi

jars=$(unzip -l "$base/python/dist/pyspark_extension-${pyversion}-*.whl" | grep ".jar" | wc -l)
if [ $jars -ne 1 ]
then
  echo "Expected exactly one jar in whl file, but $jars found!"
  exit 1
fi


================================================
FILE: bump-version.sh
================================================
#!/bin/bash
#
# Copyright 2020 G-Research
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, s

Download .txt

gitextract_5zjc6xfa/

├── .github/
│   ├── actions/
│   │   ├── build-whl/
│   │   │   └── action.yml
│   │   ├── check-compat/
│   │   │   └── action.yml
│   │   ├── prime-caches/
│   │   │   └── action.yml
│   │   ├── test-jvm/
│   │   │   └── action.yml
│   │   ├── test-python/
│   │   │   └── action.yml
│   │   └── test-release/
│   │       └── action.yml
│   ├── dependabot.yml
│   ├── show-spark-versions.sh
│   └── workflows/
│       ├── build-jvm.yml
│       ├── build-python.yml
│       ├── build-snapshots.yml
│       ├── check.yml
│       ├── ci.yml
│       ├── clear-caches.yaml
│       ├── prepare-release.yml
│       ├── prime-caches.yml
│       ├── publish-release.yml
│       ├── publish-snapshot.yml
│       ├── test-jvm.yml
│       ├── test-python.yml
│       ├── test-release.yml
│       ├── test-results.yml
│       └── test-snapshots.yml
├── .gitignore
├── .scalafmt.conf
├── CHANGELOG.md
├── CONDITIONAL.md
├── DIFF.md
├── GROUPS.md
├── HISTOGRAM.md
├── LICENSE
├── MAINTAINERS.md
├── PARQUET.md
├── PARTITIONING.md
├── PYSPARK-DEPS.md
├── README.md
├── RELEASE.md
├── ROW_NUMBER.md
├── SECURITY.md
├── build-whl.sh
├── bump-version.sh
├── examples/
│   └── python-deps/
│       ├── Dockerfile
│       ├── docker-compose.yml
│       └── example.py
├── pom.xml
├── python/
│   ├── README.md
│   ├── gresearch/
│   │   ├── __init__.py
│   │   └── spark/
│   │       ├── __init__.py
│   │       ├── diff/
│   │       │   ├── __init__.py
│   │       │   └── comparator/
│   │       │       └── __init__.py
│   │       └── parquet/
│   │           └── __init__.py
│   ├── pyproject.toml
│   ├── pyspark/
│   │   └── jars/
│   │       └── .gitignore
│   ├── setup.py
│   └── test/
│       ├── __init__.py
│       ├── spark_common.py
│       ├── test_diff.py
│       ├── test_histogram.py
│       ├── test_job_description.py
│       ├── test_jvm.py
│       ├── test_package.py
│       ├── test_parquet.py
│       └── test_row_number.py
├── release.sh
├── set-version.sh
├── src/
│   ├── main/
│   │   ├── scala/
│   │   │   └── uk/
│   │   │       └── co/
│   │   │           └── gresearch/
│   │   │               ├── package.scala
│   │   │               └── spark/
│   │   │                   ├── BuildVersion.scala
│   │   │                   ├── Histogram.scala
│   │   │                   ├── RowNumbers.scala
│   │   │                   ├── SparkVersion.scala
│   │   │                   ├── UnpersistHandle.scala
│   │   │                   ├── diff/
│   │   │                   │   ├── App.scala
│   │   │                   │   ├── Diff.scala
│   │   │                   │   ├── DiffComparators.scala
│   │   │                   │   ├── DiffOptions.scala
│   │   │                   │   ├── comparator/
│   │   │                   │   │   ├── DefaultDiffComparator.scala
│   │   │                   │   │   ├── DiffComparator.scala
│   │   │                   │   │   ├── DurationDiffComparator.scala
│   │   │                   │   │   ├── EpsilonDiffComparator.scala
│   │   │                   │   │   ├── EquivDiffComparator.scala
│   │   │                   │   │   ├── MapDiffComparator.scala
│   │   │                   │   │   ├── NullSafeEqualDiffComparator.scala
│   │   │                   │   │   ├── TypedDiffComparator.scala
│   │   │                   │   │   └── WhitespaceDiffComparator.scala
│   │   │                   │   └── package.scala
│   │   │                   ├── group/
│   │   │                   │   └── package.scala
│   │   │                   ├── package.scala
│   │   │                   └── parquet/
│   │   │                       ├── ParquetMetaDataUtil.scala
│   │   │                       └── package.scala
│   │   ├── scala-spark-3.2/
│   │   │   └── uk/
│   │   │       └── co/
│   │   │           └── gresearch/
│   │   │               └── spark/
│   │   │                   └── parquet/
│   │   │                       └── SplitFile.scala
│   │   ├── scala-spark-3.3/
│   │   │   └── uk/
│   │   │       └── co/
│   │   │           └── gresearch/
│   │   │               └── spark/
│   │   │                   └── parquet/
│   │   │                       └── SplitFile.scala
│   │   ├── scala-spark-3.5/
│   │   │   ├── org/
│   │   │   │   └── apache/
│   │   │   │       └── spark/
│   │   │   │           └── sql/
│   │   │   │               └── extension/
│   │   │   │                   └── package.scala
│   │   │   └── uk/
│   │   │       └── co/
│   │   │           └── gresearch/
│   │   │               └── spark/
│   │   │                   └── Backticks.scala
│   │   └── scala-spark-4.0/
│   │       ├── org/
│   │       │   └── apache/
│   │       │       └── spark/
│   │       │           └── sql/
│   │       │               └── extension/
│   │       │                   └── package.scala
│   │       └── uk/
│   │           └── co/
│   │               └── gresearch/
│   │                   └── spark/
│   │                       ├── Backticks.scala
│   │                       └── parquet/
│   │                           └── SplitFile.scala
│   └── test/
│       ├── files/
│       │   ├── encrypted1.parquet
│       │   ├── encrypted2.parquet
│       │   ├── nested.parquet
│       │   └── test.parquet/
│       │       ├── file1.parquet
│       │       └── file2.parquet
│       ├── java/
│       │   └── uk/
│       │       └── co/
│       │           └── gresearch/
│       │               └── test/
│       │                   ├── SparkJavaTests.java
│       │                   └── diff/
│       │                       ├── DiffJavaTests.java
│       │                       ├── JavaValue.java
│       │                       └── JavaValueAs.java
│       ├── resources/
│       │   ├── log4j.properties
│       │   └── log4j2.properties
│       ├── scala/
│       │   └── uk/
│       │       └── co/
│       │           └── gresearch/
│       │               ├── spark/
│       │               │   ├── GroupBySuite.scala
│       │               │   ├── HistogramSuite.scala
│       │               │   ├── SparkSuite.scala
│       │               │   ├── SparkTestSession.scala
│       │               │   ├── WritePartitionedSuite.scala
│       │               │   ├── diff/
│       │               │   │   ├── AppSuite.scala
│       │               │   │   ├── DiffComparatorSuite.scala
│       │               │   │   ├── DiffOptionsSuite.scala
│       │               │   │   ├── DiffSuite.scala
│       │               │   │   └── examples/
│       │               │   │       └── Examples.scala
│       │               │   ├── group/
│       │               │   │   └── GroupSuite.scala
│       │               │   ├── parquet/
│       │               │   │   └── ParquetSuite.scala
│       │               │   └── test/
│       │               │       └── package.scala
│       │               └── test/
│       │                   ├── ClasspathSuite.scala
│       │                   ├── Spec.scala
│       │                   └── Suite.scala
│       ├── scala-spark-3/
│       │   └── uk/
│       │       └── co/
│       │           └── gresearch/
│       │               └── spark/
│       │                   └── SparkSuiteHelper.scala
│       └── scala-spark-4/
│           └── uk/
│               └── co/
│                   └── gresearch/
│                       └── spark/
│                           └── SparkSuiteHelper.scala
├── test-release.py
├── test-release.scala
└── test-release.sh

Download .txt

SYMBOL INDEX (297 symbols across 18 files)

FILE: examples/python-deps/example.py
  function main (line 3) | def main():

FILE: python/gresearch/spark/__init__.py
  function _is_column (line 66) | def _is_column(obj: Any) -> bool:
  function _is_column_or_str (line 70) | def _is_column_or_str(obj: Any) -> bool:
  function _is_dataframe (line 74) | def _is_dataframe(obj: Any) -> bool:
  function _check_java_pkg_is_installed (line 78) | def _check_java_pkg_is_installed(jvm: JVMView) -> bool:
  function _get_jvm (line 91) | def _get_jvm(obj: Any) -> JVMView:
  function _to_seq (line 127) | def _to_seq(jvm: JVMView, list: List[Any]) -> JavaObject:
  function _to_map (line 132) | def _to_map(jvm: JVMView, map: Mapping[Any, Any]) -> JavaObject:
  function backticks (line 136) | def backticks(*name_parts: str) -> str:
  function distinct_prefix_for (line 145) | def distinct_prefix_for(existing: List[str]) -> str:
  function handle_configured_case_sensitivity (line 158) | def handle_configured_case_sensitivity(column_name: str, case_sensitive:...
  function list_contains_case_sensitivity (line 171) | def list_contains_case_sensitivity(column_names: Iterable[str], columnNa...
  function list_filter_case_sensitivity (line 181) | def list_filter_case_sensitivity(column_names: Iterable[str], filter: It...
  function list_diff_case_sensitivity (line 194) | def list_diff_case_sensitivity(column_names: Iterable[str], other: Itera...
  function dotnet_ticks_to_timestamp (line 207) | def dotnet_ticks_to_timestamp(tick_column: Union[str, Column]) -> Column:
  function dotnet_ticks_to_unix_epoch (line 237) | def dotnet_ticks_to_unix_epoch(tick_column: Union[str, Column]) -> Column:
  function dotnet_ticks_to_unix_epoch_nanos (line 267) | def dotnet_ticks_to_unix_epoch_nanos(tick_column: Union[str, Column]) ->...
  function timestamp_to_dotnet_ticks (line 297) | def timestamp_to_dotnet_ticks(timestamp_column: Union[str, Column]) -> C...
  function unix_epoch_to_dotnet_ticks (line 326) | def unix_epoch_to_dotnet_ticks(unix_column: Union[str, Column]) -> Column:
  function unix_epoch_nanos_to_dotnet_ticks (line 357) | def unix_epoch_nanos_to_dotnet_ticks(unix_column: Union[str, Column]) ->...
  function count_null (line 389) | def count_null(e: "ColumnOrName") -> Column:
  function histogram (line 410) | def histogram(self: DataFrame,
  class UnpersistHandle (line 441) | class UnpersistHandle:
    method __init__ (line 442) | def __init__(self, handle):
    method __call__ (line 445) | def __call__(self, blocking: Optional[bool] = None):
  function unpersist_handle (line 453) | def unpersist_handle(self: SparkSession) -> UnpersistHandle:
  function _get_sort_cols (line 462) | def _get_sort_cols(df: DataFrame, order: Union[str, Column, List[Union[s...
  function with_row_numbers (line 472) | def with_row_numbers(self: DataFrame,
  function session (line 500) | def session(self: DataFrame) -> SparkSession:
  function session_or_ctx (line 504) | def session_or_ctx(self: DataFrame) -> Union[SparkSession, SQLContext]:
  function set_description (line 515) | def set_description(description: Optional[str], if_not_set: bool = False):
  function job_description (line 527) | def job_description(description: str, if_not_set: bool = False):
  function append_description (line 555) | def append_description(extra_description: str, separator: str = " - "):
  function append_job_description (line 566) | def append_job_description(extra_description: str, separator: str = " - "):
  function create_temporary_dir (line 593) | def create_temporary_dir(spark: Union[SparkSession, SparkContext], prefi...
  function install_pip_package (line 612) | def install_pip_package(spark: Union[SparkSession, SparkContext], *packa...
  function install_poetry_project (line 652) | def install_poetry_project(spark: Union[SparkSession, SparkContext],

FILE: python/gresearch/spark/diff/__init__.py
  class deprecated (line 41) | class deprecated:
    method __init__ (line 42) | def __init__(self, msg: str) -> None:
    method __call__ (line 45) | def __call__(self, func: _T) -> _T:
  class DiffMode (line 55) | class DiffMode(Enum):
    method _to_java (line 64) | def _to_java(self, jvm: JVMView) -> JavaObject:
  class DiffOptions (line 69) | class DiffOptions:
    method with_diff_column (line 108) | def with_diff_column(self, diff_column: str) -> 'DiffOptions':
    method with_left_column_prefix (line 121) | def with_left_column_prefix(self, left_column_prefix: str) -> 'DiffOpt...
    method with_right_column_prefix (line 134) | def with_right_column_prefix(self, right_column_prefix: str) -> 'DiffO...
    method with_insert_diff_value (line 147) | def with_insert_diff_value(self, insert_diff_value: str) -> 'DiffOptio...
    method with_change_diff_value (line 160) | def with_change_diff_value(self, change_diff_value: str) -> 'DiffOptio...
    method with_delete_diff_value (line 173) | def with_delete_diff_value(self, delete_diff_value: str) -> 'DiffOptio...
    method with_nochange_diff_value (line 186) | def with_nochange_diff_value(self, nochange_diff_value: str) -> 'DiffO...
    method with_change_column (line 199) | def with_change_column(self, change_column: str) -> 'DiffOptions':
    method without_change_column (line 212) | def without_change_column(self) -> 'DiffOptions':
    method with_diff_mode (line 222) | def with_diff_mode(self, diff_mode: DiffMode) -> 'DiffOptions':
    method with_sparse_mode (line 235) | def with_sparse_mode(self, sparse_mode: bool) -> 'DiffOptions':
    method with_default_comparator (line 248) | def with_default_comparator(self, comparator: DiffComparator) -> 'Diff...
    method with_data_type_comparator (line 252) | def with_data_type_comparator(self, comparator: DiffComparator, *data_...
    method with_column_name_comparator (line 267) | def with_column_name_comparator(self, comparator: DiffComparator, *col...
    method comparator_for (line 282) | def comparator_for(self, column: StructField) -> DiffComparator:
  class Differ (line 292) | class Differ:
    method __init__ (line 299) | def __init__(self, options: DiffOptions = None):
    method diff (line 303) | def diff(self, left: DataFrame, right: DataFrame, *id_columns: str) ->...
    method diff (line 306) | def diff(self, left: DataFrame, right: DataFrame, id_columns: Iterable...
    method diff (line 308) | def diff(self, left: DataFrame, right: DataFrame, *id_or_ignore_column...
    method _columns_of_side (line 392) | def _columns_of_side(df: DataFrame, id_columns: List[str], side_prefix...
    method diffwith (line 398) | def diffwith(self, left: DataFrame, right: DataFrame, *id_columns: str...
    method diffwith (line 401) | def diffwith(self, left: DataFrame, right: DataFrame, id_columns: Iter...
    method diffwith (line 403) | def diffwith(self, left: DataFrame, right: DataFrame, *id_or_ignore_co...
    method _check_schema (line 448) | def _check_schema(self, left: DataFrame, right: DataFrame, id_columns:...
    method _get_change_column (line 551) | def _get_change_column(self,
    method _do_diff (line 566) | def _do_diff(self, left: DataFrame, right: DataFrame, id_columns: List...
    method _get_diff_id_columns (line 604) | def _get_diff_id_columns(self, pk_columns: List[str],
    method _get_diff_value_columns (line 609) | def _get_diff_value_columns(self, pk_columns: List[str],
    method _get_diff_columns (line 667) | def _get_diff_columns(self, pk_columns: List[str],
  function diff (line 678) | def diff(self: DataFrame, other: DataFrame, *id_columns: str) -> DataFra...
  function diff (line 682) | def diff(self: DataFrame, other: DataFrame, id_columns: Iterable[str], i...
  function diff (line 686) | def diff(self: DataFrame, other: DataFrame, *id_or_ignore_columns: Union...
  function diff (line 690) | def diff(self: DataFrame, other: DataFrame, options: DiffOptions, *id_co...
  function diff (line 694) | def diff(self: DataFrame, other: DataFrame, options: DiffOptions, id_col...
  function diff (line 698) | def diff(self: DataFrame, other: DataFrame, options: DiffOptions, *id_or...
  function diff (line 701) | def diff(self: DataFrame, other: DataFrame, *options_or_id_or_ignore_col...
  function diffwith (line 784) | def diffwith(self: DataFrame, other: DataFrame, *id_columns: str) -> Dat...
  function diffwith (line 788) | def diffwith(self: DataFrame, other: DataFrame, id_columns: Iterable[str...
  function diffwith (line 792) | def diffwith(self: DataFrame, other: DataFrame, *id_or_ignore_columns: U...
  function diffwith (line 796) | def diffwith(self: DataFrame, other: DataFrame, options: DiffOptions, *i...
  function diffwith (line 800) | def diffwith(self: DataFrame, other: DataFrame, options: DiffOptions, id...
  function diffwith (line 804) | def diffwith(self: DataFrame, other: DataFrame, options: DiffOptions, *i...
  function diffwith (line 807) | def diffwith(self: DataFrame, other: DataFrame, *options_or_id_or_ignore...
  function diff_with_options (line 842) | def diff_with_options(self: DataFrame, other: DataFrame, options: DiffOp...
  function diff_with_options (line 846) | def diff_with_options(self: DataFrame, other: DataFrame, options: DiffOp...
  function diff_with_options (line 850) | def diff_with_options(self: DataFrame, other: DataFrame, options: DiffOp...
  function diffwith_with_options (line 872) | def diffwith_with_options(self: DataFrame, other: DataFrame, options: Di...
  function diffwith_with_options (line 876) | def diffwith_with_options(self: DataFrame, other: DataFrame, options: Di...
  function diffwith_with_options (line 880) | def diffwith_with_options(self: DataFrame, other: DataFrame, options: Di...

FILE: python/gresearch/spark/diff/comparator/__init__.py
  class DiffComparator (line 27) | class DiffComparator(abc.ABC):
    method equiv (line 29) | def equiv(self, left: Column, right: Column) -> Column:
  class DiffComparators (line 33) | class DiffComparators:
    method default (line 35) | def default() -> 'DefaultDiffComparator':
    method nullSafeEqual (line 39) | def nullSafeEqual() -> 'NullSafeEqualDiffComparator':
    method epsilon (line 43) | def epsilon(epsilon: float) -> 'EpsilonDiffComparator':
    method string (line 48) | def string(whitespace_agnostic: bool = True) -> 'StringDiffComparator':
    method duration (line 53) | def duration(duration: str) -> 'DurationDiffComparator':
    method map (line 58) | def map(key_type: DataType, value_type: DataType, key_order_sensitive:...
  class NullSafeEqualDiffComparator (line 65) | class NullSafeEqualDiffComparator(DiffComparator):
    method equiv (line 66) | def equiv(self, left: Column, right: Column) -> Column:
  class DefaultDiffComparator (line 72) | class DefaultDiffComparator(NullSafeEqualDiffComparator):
    method _to_java (line 74) | def _to_java(self, jvm: JVMView) -> JavaObject:
  class EpsilonDiffComparator (line 79) | class EpsilonDiffComparator(DiffComparator):
    method as_relative (line 84) | def as_relative(self) -> 'EpsilonDiffComparator':
    method as_absolute (line 87) | def as_absolute(self) -> 'EpsilonDiffComparator':
    method as_inclusive (line 90) | def as_inclusive(self) -> 'EpsilonDiffComparator':
    method as_exclusive (line 93) | def as_exclusive(self) -> 'EpsilonDiffComparator':
    method equiv (line 96) | def equiv(self, left: Column, right: Column) -> Column:
  class StringDiffComparator (line 113) | class StringDiffComparator(DiffComparator):
    method equiv (line 116) | def equiv(self, left: Column, right: Column) -> Column:
  class DurationDiffComparator (line 123) | class DurationDiffComparator(DiffComparator):
    method as_inclusive (line 127) | def as_inclusive(self) -> 'DurationDiffComparator':
    method as_exclusive (line 130) | def as_exclusive(self) -> 'DurationDiffComparator':
    method equiv (line 133) | def equiv(self, left: Column, right: Column) -> Column:
  class MapDiffComparator (line 140) | class MapDiffComparator(DiffComparator):
    method equiv (line 145) | def equiv(self, left: Column, right: Column) -> Column:

FILE: python/gresearch/spark/parquet/__init__.py
  function _jreader (line 29) | def _jreader(reader: DataFrameReader) -> JavaObject:
  function parquet_metadata (line 34) | def parquet_metadata(self: DataFrameReader, *paths: str, parallelism: Op...
  function parquet_schema (line 69) | def parquet_schema(self: DataFrameReader, *paths: str, parallelism: Opti...
  function parquet_blocks (line 104) | def parquet_blocks(self: DataFrameReader, *paths: str, parallelism: Opti...
  function parquet_block_columns (line 136) | def parquet_block_columns(self: DataFrameReader, *paths: str, parallelis...
  function parquet_partitions (line 172) | def parquet_partitions(self: DataFrameReader, *paths: str, parallelism: ...

FILE: python/setup.py
  class custom_sdist (line 36) | class custom_sdist(sdist):
    method make_distribution (line 37) | def make_distribution(self):

FILE: python/test/spark_common.py
  function spark_session (line 30) | def spark_session():
  class SparkTest (line 38) | class SparkTest(unittest.TestCase):
    method main (line 41) | def main(file: str):
    method get_pom_path (line 58) | def get_pom_path() -> str:
    method get_spark_config (line 66) | def get_spark_config(path) -> SparkConf:
    method get_spark_session (line 80) | def get_spark_session(cls) -> SparkSession:
    method setUpClass (line 101) | def setUpClass(cls):
    method tearDownClass (line 107) | def tearDownClass(cls):
    method sql_conf (line 113) | def sql_conf(self, pairs):

FILE: python/test/test_diff.py
  class DiffTest (line 27) | class DiffTest(SparkTest):
    method assert_requirement (line 32) | def assert_requirement(self, error_message: str):
    method setUpClass (line 38) | def setUpClass(cls):
    method test_check_schema (line 206) | def test_check_schema(self):
    method test_dataframe_diff (line 493) | def test_dataframe_diff(self):
    method test_dataframe_diff_with_ids_ignored (line 497) | def test_dataframe_diff_with_ids_ignored(self):
    method test_dataframe_diff_with_wrong_argument_types (line 501) | def test_dataframe_diff_with_wrong_argument_types(self):
    method test_dataframe_diffwith (line 545) | def test_dataframe_diffwith(self):
    method test_dataframe_diffwith_with_default_options (line 550) | def test_dataframe_diffwith_with_default_options(self):
    method test_dataframe_diffwith_with_options (line 555) | def test_dataframe_diffwith_with_options(self):
    method test_dataframe_diffwith_with_ignored (line 561) | def test_dataframe_diffwith_with_ignored(self):
    method test_dataframe_diffwith_with_wrong_argument_types (line 566) | def test_dataframe_diffwith_with_wrong_argument_types(self):
    method test_dataframe_diff_with_default_options (line 610) | def test_dataframe_diff_with_default_options(self):
    method test_dataframe_diff_with_options (line 616) | def test_dataframe_diff_with_options(self):
    method test_dataframe_diff_with_options_and_ignored (line 623) | def test_dataframe_diff_with_options_and_ignored(self):
    method test_dataframe_diff_with_changes (line 630) | def test_dataframe_diff_with_changes(self):
    method test_dataframe_diff_with_diff_mode_column_by_column (line 637) | def test_dataframe_diff_with_diff_mode_column_by_column(self):
    method test_dataframe_diff_with_diff_mode_side_by_side (line 644) | def test_dataframe_diff_with_diff_mode_side_by_side(self):
    method test_dataframe_diff_with_diff_mode_left_side (line 651) | def test_dataframe_diff_with_diff_mode_left_side(self):
    method test_dataframe_diff_with_diff_mode_right_side (line 658) | def test_dataframe_diff_with_diff_mode_right_side(self):
    method test_dataframe_diff_with_sparse_mode (line 665) | def test_dataframe_diff_with_sparse_mode(self):
    method test_differ_diff (line 672) | def test_differ_diff(self):
    method test_differ_diffwith (line 676) | def test_differ_diffwith(self):
    method test_differ_diff_with_default_options (line 681) | def test_differ_diff_with_default_options(self):
    method test_differ_diff_with_options (line 686) | def test_differ_diff_with_options(self):
    method test_differ_diff_with_changes (line 691) | def test_differ_diff_with_changes(self):
    method test_differ_diff_in_diff_mode_column_by_column (line 696) | def test_differ_diff_in_diff_mode_column_by_column(self):
    method test_differ_diff_in_diff_mode_side_by_side (line 701) | def test_differ_diff_in_diff_mode_side_by_side(self):
    method test_differ_diff_in_diff_mode_left_side (line 706) | def test_differ_diff_in_diff_mode_left_side(self):
    method test_differ_diff_in_diff_mode_right_side (line 711) | def test_differ_diff_in_diff_mode_right_side(self):
    method test_differ_diff_with_sparse_mode (line 716) | def test_differ_diff_with_sparse_mode(self):
    method test_diff_options_default (line 722) | def test_diff_options_default(self):
    method test_diff_mode_consts (line 747) | def test_diff_mode_consts(self):
    method test_diff_options_comparator_for (line 759) | def test_diff_options_comparator_for(self):
    method test_diff_fluent_setters (line 774) | def test_diff_fluent_setters(self):
    method test_diff_with_epsilon_comparator (line 834) | def test_diff_with_epsilon_comparator(self):
    method test_diff_options_with_duplicate_comparators (line 862) | def test_diff_options_with_duplicate_comparators(self):

FILE: python/test/test_histogram.py
  class HistogramTest (line 22) | class HistogramTest(SparkTest):
    method setUpClass (line 25) | def setUpClass(cls):
    method test_histogram_with_ints (line 37) | def test_histogram_with_ints(self):
    method test_histogram_with_floats (line 45) | def test_histogram_with_floats(self):

FILE: python/test/test_job_description.py
  class JobDescriptionTest (line 25) | class JobDescriptionTest(SparkTest):
    method _assert_job_description (line 27) | def _assert_job_description(self, expected: Optional[str]):
    method setUp (line 41) | def setUp(self) -> None:
    method test_with_job_description (line 44) | def test_with_job_description(self):
    method test_append_job_description (line 59) | def test_append_job_description(self):

FILE: python/test/test_jvm.py
  class PackageTest (line 31) | class PackageTest(SparkTest):
    method setUpClass (line 35) | def setUpClass(cls):
    method test_get_jvm_classic (line 40) | def test_get_jvm_classic(self):
    method test_get_jvm_connect (line 51) | def test_get_jvm_connect(self):
    method test_get_jvm_check_java_pkg_is_installed (line 64) | def test_get_jvm_check_java_pkg_is_installed(self):
    method test_dotnet_ticks (line 79) | def test_dotnet_ticks(self):
    method test_histogram (line 94) | def test_histogram(self):
    method test_with_row_numbers (line 100) | def test_with_row_numbers(self):
    method test_job_description (line 106) | def test_job_description(self):
    method test_create_temp_dir (line 118) | def test_create_temp_dir(self):
    method test_install_pip_package (line 124) | def test_install_pip_package(self):
    method test_install_poetry_project (line 130) | def test_install_poetry_project(self):
    method test_parquet (line 136) | def test_parquet(self):

FILE: python/test/test_package.py
  class PackageTest (line 40) | class PackageTest(SparkTest):
    method setUpClass (line 43) | def setUpClass(cls):
    method compare_dfs (line 106) | def compare_dfs(self, expected, actual):
    method test_backticks (line 116) | def test_backticks(self):
    method test_distinct_prefix_for (line 124) | def test_distinct_prefix_for(self):
    method test_handle_configured_case_sensitivity (line 133) | def test_handle_configured_case_sensitivity(self):
    method test_list_contains_case_sensitivity (line 146) | def test_list_contains_case_sensitivity(self):
    method test_list_filter_case_sensitivity (line 158) | def test_list_filter_case_sensitivity(self):
    method test_list_diff_case_sensitivity (line 170) | def test_list_diff_case_sensitivity(self):
    method test_dotnet_ticks_to_timestamp (line 183) | def test_dotnet_ticks_to_timestamp(self):
    method test_dotnet_ticks_to_unix_epoch (line 191) | def test_dotnet_ticks_to_unix_epoch(self):
    method test_dotnet_ticks_to_unix_epoch_nanos (line 199) | def test_dotnet_ticks_to_unix_epoch_nanos(self):
    method test_timestamp_to_dotnet_ticks (line 208) | def test_timestamp_to_dotnet_ticks(self):
    method test_unix_epoch_dotnet_ticks (line 218) | def test_unix_epoch_dotnet_ticks(self):
    method test_unix_epoch_nanos_to_dotnet_ticks (line 226) | def test_unix_epoch_nanos_to_dotnet_ticks(self):
    method test_count_null (line 233) | def test_count_null(self):
    method test_session (line 242) | def test_session(self):
    method test_session_or_ctx (line 246) | def test_session_or_ctx(self):
    method test_create_temp_dir (line 251) | def test_create_temp_dir(self):
    method test_install_pip_package (line 259) | def test_install_pip_package(self):
    method test_install_pip_package_unknown_argument (line 283) | def test_install_pip_package_unknown_argument(self):
    method test_install_pip_package_package_not_found (line 289) | def test_install_pip_package_package_not_found(self):
    method test_install_pip_package_not_supported (line 295) | def test_install_pip_package_not_supported(self):
    method test_install_poetry_project (line 306) | def test_install_poetry_project(self):
    method test_install_poetry_project_wrong_arguments (line 342) | def test_install_poetry_project_wrong_arguments(self):
    method test_install_poetry_project_not_supported (line 353) | def test_install_poetry_project_not_supported(self):

FILE: python/test/test_parquet.py
  class ParquetTest (line 23) | class ParquetTest(SparkTest):
    method test_parquet_metadata (line 27) | def test_parquet_metadata(self):
    method test_parquet_schema (line 33) | def test_parquet_schema(self):
    method test_parquet_blocks (line 39) | def test_parquet_blocks(self):
    method test_parquet_block_columns (line 45) | def test_parquet_block_columns(self):
    method test_parquet_partitions (line 51) | def test_parquet_partitions(self):

FILE: python/test/test_row_number.py
  class RowNumberTest (line 24) | class RowNumberTest(SparkTest):
    method setUpClass (line 27) | def setUpClass(cls):
    method test_row_numbers (line 68) | def test_row_numbers(self):
    method test_row_numbers_order_one_column (line 72) | def test_row_numbers_order_one_column(self):
    method test_row_numbers_order_two_columns (line 78) | def test_row_numbers_order_two_columns(self):
    method test_row_numbers_order_not_asc_one_column (line 84) | def test_row_numbers_order_not_asc_one_column(self):
    method test_row_numbers_order_not_asc_two_columns (line 90) | def test_row_numbers_order_not_asc_two_columns(self):
    method test_row_numbers_order_desc_one_column (line 96) | def test_row_numbers_order_desc_one_column(self):
    method test_row_numbers_order_desc_two_columns (line 102) | def test_row_numbers_order_desc_two_columns(self):
    method test_row_numbers_unpersist (line 108) | def test_row_numbers_unpersist(self):
    method test_row_numbers_row_number_col_name (line 130) | def test_row_numbers_row_number_col_name(self):

FILE: src/test/java/uk/co/gresearch/test/SparkJavaTests.java
  class SparkJavaTests (line 37) | public class SparkJavaTests {
    method beforeClass (line 41) | @BeforeClass
    method testBackticks (line 57) | @Test
    method testHistogram (line 68) | @Test
    method testHistogramWithAggColumn (line 75) | @Test
    method testRowNumbers (line 86) | @Test
    method testRowNumbersOrderOneColumn (line 97) | @Test
    method testRowNumbersOrderTwoColumns (line 108) | @Test
    method testRowNumbersOrderDesc (line 119) | @Test
    method testRowNumbersUnpersist (line 130) | @Test
    method testRowNumbersStorageLevelAndUnpersist (line 150) | @Test
    method testRowNumbersColumnName (line 164) | @Test
    method afterClass (line 177) | @AfterClass

FILE: src/test/java/uk/co/gresearch/test/diff/DiffJavaTests.java
  class DiffJavaTests (line 36) | public class DiffJavaTests {
    method beforeClass (line 41) | @BeforeClass
    method testDiff (line 61) | @Test
    method testDiffNoKey (line 73) | @Test
    method testDiffSingleKey (line 86) | @Test
    method testDiffMultipleKeys (line 98) | @Test
    method testDiffIgnoredColumn (line 110) | @Test
    method testDiffAs (line 122) | @Test
    method testDiffOfWith (line 135) | @Test
    method testDiffer (line 147) | @Test
    method testDifferWithIgnored (line 162) | @Test
    method testDiffWithOptions (line 179) | @Test
    method testDiffWithComparators (line 203) | @Test
    method testDiffWithComparator (line 221) | private void testDiffWithComparator(DiffOptions options) {
    method afterClass (line 235) | @AfterClass

FILE: src/test/java/uk/co/gresearch/test/diff/JavaValue.java
  class JavaValue (line 22) | public class JavaValue implements Serializable {
    method JavaValue (line 27) | public JavaValue() { }
    method JavaValue (line 29) | public JavaValue(Integer id, String label, Double score) {
    method getId (line 35) | public Integer getId() {
    method setId (line 39) | public void setId(Integer id) {
    method getLabel (line 43) | public String getLabel() {
    method setLabel (line 47) | public void setLabel(String label) {
    method getScore (line 51) | public Double getScore() {
    method setScore (line 55) | public void setScore(Double score) {
    method equals (line 59) | @Override
    method hashCode (line 68) | @Override
    method toString (line 73) | @Override

FILE: src/test/java/uk/co/gresearch/test/diff/JavaValueAs.java
  class JavaValueAs (line 22) | public class JavaValueAs implements Serializable {
    method JavaValueAs (line 30) | public JavaValueAs() { }
    method JavaValueAs (line 32) | public JavaValueAs(String diff, Integer id, String left_label, String ...
    method getDiff (line 41) | public String getDiff() {
    method setDiff (line 45) | public void setDiff(String diff) {
    method getId (line 49) | public Integer getId() {
    method setId (line 53) | public void setId(Integer id) {
    method getLeft_label (line 57) | public String getLeft_label() {
    method setLeft_label (line 61) | public void setLeft_label(String left_label) {
    method getRight_label (line 65) | public String getRight_label() {
    method setRight_label (line 69) | public void setRight_label(String right_label) {
    method getLeft_score (line 73) | public Double getLeft_score() {
    method setLeft_score (line 77) | public void setLeft_score(Double left_score) {
    method getRight_score (line 81) | public Double getRight_score() {
    method setRight_score (line 85) | public void setRight_score(Double right_score) {
    method equals (line 89) | @Override
    method hashCode (line 98) | @Override
    method toString (line 103) | @Override

Download .json

Condensed preview — 128 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (922K chars).

[
  {
    "path": ".github/actions/build-whl/action.yml",
    "chars": 3795,
    "preview": "name: 'Build Whl'\nauthor: 'EnricoMi'\ndescription: 'A GitHub Action that builds pyspark-extension package'\n\ninputs:\n  spa"
  },
  {
    "path": ".github/actions/check-compat/action.yml",
    "chars": 3810,
    "preview": "name: 'Check'\nauthor: 'EnricoMi'\ndescription: 'A GitHub Action that checks compatibility of spark-extension'\n\ninputs:\n  "
  },
  {
    "path": ".github/actions/prime-caches/action.yml",
    "chars": 3815,
    "preview": "name: 'Prime caches'\nauthor: 'EnricoMi'\ndescription: 'A GitHub Action that primes caches'\n\ninputs:\n  spark-version:\n    "
  },
  {
    "path": ".github/actions/test-jvm/action.yml",
    "chars": 3755,
    "preview": "name: 'Test JVM'\nauthor: 'EnricoMi'\ndescription: 'A GitHub Action that tests JVM spark-extension'\n\ninputs:\n  spark-versi"
  },
  {
    "path": ".github/actions/test-python/action.yml",
    "chars": 9408,
    "preview": "name: 'Test Python'\nauthor: 'EnricoMi'\ndescription: 'A GitHub Action that tests Python spark-extension'\n\n# pyspark is no"
  },
  {
    "path": ".github/actions/test-release/action.yml",
    "chars": 8094,
    "preview": "name: 'Test Release'\nauthor: 'EnricoMi'\ndescription: 'A GitHub Action that tests spark-extension release'\n\n# pyspark is "
  },
  {
    "path": ".github/dependabot.yml",
    "chars": 208,
    "preview": "version: 2\nupdates:\n  - package-ecosystem: \"github-actions\"\n    directory: \"/\"\n    schedule:\n      interval: \"monthly\"\n\n"
  },
  {
    "path": ".github/show-spark-versions.sh",
    "chars": 989,
    "preview": "#!/bin/bash\n\nbase=$(cd \"$(dirname \"$0\")\"; pwd)\n\ngrep -- \"-version\" \"$base\"/workflows/prime-caches.yml | sed -e \"s/ -//g\""
  },
  {
    "path": ".github/workflows/build-jvm.yml",
    "chars": 3193,
    "preview": "name: Build JVM\n\non:\n  workflow_call:\n\njobs:\n  build:\n    name: Build (Spark ${{ matrix.spark-version }} Scala ${{ matri"
  },
  {
    "path": ".github/workflows/build-python.yml",
    "chars": 2463,
    "preview": "name: Build Python\n\non:\n  workflow_call:\n\njobs:\n  # pyspark<4 is not available for snapshots or scala other than 2.12\n  "
  },
  {
    "path": ".github/workflows/build-snapshots.yml",
    "chars": 2887,
    "preview": "name: Build Snapshots\n\non:\n  workflow_call:\n\njobs:\n  build:\n    name: Build (Spark ${{ matrix.spark-version }} Scala ${{"
  },
  {
    "path": ".github/workflows/check.yml",
    "chars": 3530,
    "preview": "name: Check\n\non:\n  workflow_call:\n\njobs:\n  lint:\n    name: Scala lint\n    runs-on: ubuntu-latest\n\n    steps:\n      - nam"
  },
  {
    "path": ".github/workflows/ci.yml",
    "chars": 2126,
    "preview": "name: CI\n\non:\n  schedule:\n    - cron: '0 8 */10 * *'\n  push:\n    branches:\n      - 'master'\n    tags:\n      - '*'\n  merg"
  },
  {
    "path": ".github/workflows/clear-caches.yaml",
    "chars": 758,
    "preview": "name: Clear caches\n\non:\n  workflow_dispatch:\n\npermissions:\n  actions: write\n\njobs:\n  clear-cache:\n    runs-on: ubuntu-la"
  },
  {
    "path": ".github/workflows/prepare-release.yml",
    "chars": 7156,
    "preview": "name: Prepare release\n\non:\n  workflow_dispatch:\n    inputs:\n      github_release_latest:\n        description: 'Make the "
  },
  {
    "path": ".github/workflows/prime-caches.yml",
    "chars": 5653,
    "preview": "name: Prime caches\n\non:\n  workflow_dispatch:\n\njobs:\n  prime:\n    name: Spark ${{ matrix.spark-compat-version }}.${{ matr"
  },
  {
    "path": ".github/workflows/publish-release.yml",
    "chars": 8117,
    "preview": "name: Publish release\n\non:\n  workflow_dispatch:\n    inputs:\n      versions:\n        required: true\n        type: string\n"
  },
  {
    "path": ".github/workflows/publish-snapshot.yml",
    "chars": 5561,
    "preview": "name: Publish snapshot\n\non:\n  workflow_dispatch:\n  push:\n    branches: [\"master\"]\n\nenv:\n  PYTHON_VERSION: \"3.10\"\n\njobs:\n"
  },
  {
    "path": ".github/workflows/test-jvm.yml",
    "chars": 3458,
    "preview": "name: Test JVM\n\non:\n  workflow_call:\n\njobs:\n  test:\n    name: Test (Spark ${{ matrix.spark-compat-version }}.${{ matrix."
  },
  {
    "path": ".github/workflows/test-python.yml",
    "chars": 3712,
    "preview": "name: Test Python\n\non:\n  workflow_call:\n\njobs:\n  # pyspark is not available for snapshots or scala other than 2.12\n  # w"
  },
  {
    "path": ".github/workflows/test-release.yml",
    "chars": 3567,
    "preview": "name: Test release\n\non:\n  workflow_call:\n\njobs:\n  test:\n    name: Test Release Spark ${{ matrix.spark-version }} Scala $"
  },
  {
    "path": ".github/workflows/test-results.yml",
    "chars": 971,
    "preview": "name: Test Results\n\non:\n  workflow_run:\n    workflows: [\"CI\"]\n    types:\n      - completed\npermissions: {}\n\njobs:\n  publ"
  },
  {
    "path": ".github/workflows/test-snapshots.yml",
    "chars": 2926,
    "preview": "name: Test Snapshots\n\non:\n  workflow_call:\n\njobs:\n  test:\n    name: Test (Spark ${{ matrix.spark-version }} Scala ${{ ma"
  },
  {
    "path": ".gitignore",
    "chars": 427,
    "preview": "# use glob syntax.\nsyntax: glob\n*.ser\n*.class\n*~\n*.bak\n#*.off\n*.old\n\n# eclipse conf file\n.settings\n.classpath\n.project\n."
  },
  {
    "path": ".scalafmt.conf",
    "chars": 124,
    "preview": "version = 3.7.17\nrunner.dialect = scala213\nrewrite.trailingCommas.style = keep\ndocstrings.style = Asterisk\nmaxColumn = 1"
  },
  {
    "path": "CHANGELOG.md",
    "chars": 5328,
    "preview": "# Changelog\nAll notable changes to this project will be documented in this file.\n\nThe format is based on [Keep a Changel"
  },
  {
    "path": "CONDITIONAL.md",
    "chars": 1675,
    "preview": "# DataFrame Transformations\n\nThe Spark `Dataset` API allows for chaining transformations as in the following example:\n\n`"
  },
  {
    "path": "DIFF.md",
    "chars": 23271,
    "preview": "# Spark Diff\n\nAdd the following `import` to your Scala code:\n\n```scala\nimport uk.co.gresearch.spark.diff._\n```\n\nor this "
  },
  {
    "path": "GROUPS.md",
    "chars": 2387,
    "preview": "# Sorted Groups\n\nSpark provides the ability to group rows by an arbitrary key,\nwhile then providing an iterator for each"
  },
  {
    "path": "HISTOGRAM.md",
    "chars": 1530,
    "preview": "# Histogram\n\nFor a table `df` like\n\n|user   |score|\n|:-----:|:---:|\n|Alice  |101  |\n|Alice  |221  |\n|Alice  |211  |\n|Ali"
  },
  {
    "path": "LICENSE",
    "chars": 11358,
    "preview": "\n                                 Apache License\n                           Version 2.0, January 2004\n                  "
  },
  {
    "path": "MAINTAINERS.md",
    "chars": 294,
    "preview": "## Current maintainers of the project\n\n| Maintainer             | GitHub ID                                             "
  },
  {
    "path": "PARQUET.md",
    "chars": 18403,
    "preview": "# Parquet Metadata\n\nThe structure of Parquet files (the metadata, not the data stored in Parquet) can be inspected simil"
  },
  {
    "path": "PARTITIONING.md",
    "chars": 6266,
    "preview": "# Partitioned Writing\n\nIf you have ever used `Dataset[T].write.partitionBy`, here is how you can minimize the number of\n"
  },
  {
    "path": "PYSPARK-DEPS.md",
    "chars": 4681,
    "preview": "# PySpark dependencies\n\nUsing PySpark on a cluster requires all cluster nodes to have those Python packages installed th"
  },
  {
    "path": "README.md",
    "chars": 14602,
    "preview": "# Spark Extension\n\nThis project provides extensions to the [Apache Spark project](https://spark.apache.org/) in Scala an"
  },
  {
    "path": "RELEASE.md",
    "chars": 6253,
    "preview": "# Releasing Spark Extension\n\nThis provides instructions on how to release a version of `spark-extension`. We release thi"
  },
  {
    "path": "ROW_NUMBER.md",
    "chars": 6125,
    "preview": "# Global Row Number\n\nSpark provides the [SQL function `row_number`](https://spark.apache.org/docs/latest/api/sql/index.h"
  },
  {
    "path": "SECURITY.md",
    "chars": 559,
    "preview": "# Security and Coordinated Vulnerability Disclosure Policy\n\nThis project appreciates and encourages coordinated disclosu"
  },
  {
    "path": "build-whl.sh",
    "chars": 1057,
    "preview": "#!/bin/bash\n\nset -eo pipefail\n\nbase=$(cd \"$(dirname \"$0\")\"; pwd)\n\nversion=$(grep --max-count=1 \"<version>.*</version>\" \""
  },
  {
    "path": "bump-version.sh",
    "chars": 2106,
    "preview": "#!/bin/bash\n#\n# Copyright 2020 G-Research\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "examples/python-deps/Dockerfile",
    "chars": 137,
    "preview": "FROM apache/spark:3.5.0\n\nENV PATH=\"${PATH}:/opt/spark/bin\"\n\nUSER root\nRUN mkdir -p /home/spark; chown spark:spark /home/"
  },
  {
    "path": "examples/python-deps/docker-compose.yml",
    "chars": 946,
    "preview": "version: \"3\"\nservices:\n  master:\n    container_name: spark-master\n    image: spark-extension-example-docker\n    command:"
  },
  {
    "path": "examples/python-deps/example.py",
    "chars": 363,
    "preview": "from pyspark.sql import SparkSession\n\ndef main():\n    spark = SparkSession.builder.appName(\"spark_app\").getOrCreate()\n\n "
  },
  {
    "path": "pom.xml",
    "chars": 15231,
    "preview": "<project xmlns=\"http://maven.apache.org/POM/4.0.0\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocat"
  },
  {
    "path": "python/README.md",
    "chars": 5964,
    "preview": "# Spark Extension\n\nThis project provides extensions to the [Apache Spark project](https://spark.apache.org/) in Scala an"
  },
  {
    "path": "python/gresearch/__init__.py",
    "chars": 586,
    "preview": "#  Copyright 2020 G-Research\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you may not use thi"
  },
  {
    "path": "python/gresearch/spark/__init__.py",
    "chars": 27916,
    "preview": "#  Copyright 2020 G-Research\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you may not use thi"
  },
  {
    "path": "python/gresearch/spark/diff/__init__.py",
    "chars": 44396,
    "preview": "#  Copyright 2020 G-Research\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you may not use thi"
  },
  {
    "path": "python/gresearch/spark/diff/comparator/__init__.py",
    "chars": 5015,
    "preview": "#  Copyright 2022 G-Research\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you may not use thi"
  },
  {
    "path": "python/gresearch/spark/parquet/__init__.py",
    "chars": 9968,
    "preview": "#  Copyright 2023 G-Research\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you may not use thi"
  },
  {
    "path": "python/pyproject.toml",
    "chars": 81,
    "preview": "[build-system]\nrequires = [\"setuptools\"]\nbuild-backend = \"setuptools.build_meta\"\n"
  },
  {
    "path": "python/pyspark/jars/.gitignore",
    "chars": 71,
    "preview": "# Ignore everything in this directory\n*\n# Except this file\n!.gitignore\n"
  },
  {
    "path": "python/setup.py",
    "chars": 4685,
    "preview": "#!/usr/bin/env python3\n\n#  Copyright 2023 G-Research\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\")"
  },
  {
    "path": "python/test/__init__.py",
    "chars": 586,
    "preview": "#  Copyright 2020 G-Research\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you may not use thi"
  },
  {
    "path": "python/test/spark_common.py",
    "chars": 4871,
    "preview": "#  Copyright 2020 G-Research\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you may not use thi"
  },
  {
    "path": "python/test/test_diff.py",
    "chars": 51605,
    "preview": "#  Copyright 2020 G-Research\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you may not use thi"
  },
  {
    "path": "python/test/test_histogram.py",
    "chars": 1944,
    "preview": "#  Copyright 2020 G-Research\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you may not use thi"
  },
  {
    "path": "python/test/test_job_description.py",
    "chars": 3029,
    "preview": "#  Copyright 2023 G-Research\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you may not use thi"
  },
  {
    "path": "python/test/test_jvm.py",
    "chars": 7073,
    "preview": "#  Copyright 2024 G-Research\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you may not use thi"
  },
  {
    "path": "python/test/test_package.py",
    "chars": 19364,
    "preview": "#  Copyright 2023 G-Research\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you may not use thi"
  },
  {
    "path": "python/test/test_parquet.py",
    "chars": 3274,
    "preview": "#  Copyright 2023 G-Research\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you may not use thi"
  },
  {
    "path": "python/test/test_row_number.py",
    "chars": 6073,
    "preview": "#  Copyright 2022 G-Research\n#\n#  Licensed under the Apache License, Version 2.0 (the \"License\");\n#  you may not use thi"
  },
  {
    "path": "release.sh",
    "chars": 5347,
    "preview": "#!/bin/bash\n#\n# Copyright 2020 G-Research\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
  },
  {
    "path": "set-version.sh",
    "chars": 1901,
    "preview": "#!/bin/bash\n\nif [ $# -eq 1 ]\nthen\n    IFS=-\n    read version flavour <<< \"$1\"\n\n    echo \"setting version=$version${flavo"
  },
  {
    "path": "src/main/scala/uk/co/gresearch/package.scala",
    "chars": 3620,
    "preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala/uk/co/gresearch/spark/BuildVersion.scala",
    "chars": 2365,
    "preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala/uk/co/gresearch/spark/Histogram.scala",
    "chars": 4062,
    "preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala/uk/co/gresearch/spark/RowNumbers.scala",
    "chars": 4915,
    "preview": "/*\n * Copyright 2023 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala/uk/co/gresearch/spark/SparkVersion.scala",
    "chars": 1253,
    "preview": "/*\n * Copyright 2023 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala/uk/co/gresearch/spark/UnpersistHandle.scala",
    "chars": 2344,
    "preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala/uk/co/gresearch/spark/diff/App.scala",
    "chars": 12225,
    "preview": "/*\n * Copyright 2023 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala/uk/co/gresearch/spark/diff/Diff.scala",
    "chars": 39451,
    "preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala/uk/co/gresearch/spark/diff/DiffComparators.scala",
    "chars": 5079,
    "preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala/uk/co/gresearch/spark/diff/DiffOptions.scala",
    "chars": 17896,
    "preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala/uk/co/gresearch/spark/diff/comparator/DefaultDiffComparator.scala",
    "chars": 847,
    "preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala/uk/co/gresearch/spark/diff/comparator/DiffComparator.scala",
    "chars": 753,
    "preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala/uk/co/gresearch/spark/diff/comparator/DurationDiffComparator.scala",
    "chars": 2175,
    "preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala/uk/co/gresearch/spark/diff/comparator/EpsilonDiffComparator.scala",
    "chars": 1662,
    "preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala/uk/co/gresearch/spark/diff/comparator/EquivDiffComparator.scala",
    "chars": 4767,
    "preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala/uk/co/gresearch/spark/diff/comparator/MapDiffComparator.scala",
    "chars": 3572,
    "preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala/uk/co/gresearch/spark/diff/comparator/NullSafeEqualDiffComparator.scala",
    "chars": 821,
    "preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala/uk/co/gresearch/spark/diff/comparator/TypedDiffComparator.scala",
    "chars": 1091,
    "preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala/uk/co/gresearch/spark/diff/comparator/WhitespaceDiffComparator.scala",
    "chars": 1034,
    "preview": "/*\n * Copyright 2023 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala/uk/co/gresearch/spark/diff/package.scala",
    "chars": 16219,
    "preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala/uk/co/gresearch/spark/group/package.scala",
    "chars": 7997,
    "preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala/uk/co/gresearch/spark/package.scala",
    "chars": 40809,
    "preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala/uk/co/gresearch/spark/parquet/ParquetMetaDataUtil.scala",
    "chars": 4267,
    "preview": "/*\n * Copyright 2023 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala/uk/co/gresearch/spark/parquet/package.scala",
    "chars": 24395,
    "preview": "/*\n * Copyright 2023 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala-spark-3.2/uk/co/gresearch/spark/parquet/SplitFile.scala",
    "chars": 948,
    "preview": "/*\n * Copyright 2023 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala-spark-3.3/uk/co/gresearch/spark/parquet/SplitFile.scala",
    "chars": 963,
    "preview": "/*\n * Copyright 2023 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala-spark-3.5/org/apache/spark/sql/extension/package.scala",
    "chars": 967,
    "preview": "/*\n * Copyright 2024 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala-spark-3.5/uk/co/gresearch/spark/Backticks.scala",
    "chars": 2628,
    "preview": "/*\n * Copyright 2021 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala-spark-4.0/org/apache/spark/sql/extension/package.scala",
    "chars": 1033,
    "preview": "/*\n * Copyright 2024 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala-spark-4.0/uk/co/gresearch/spark/Backticks.scala",
    "chars": 2085,
    "preview": "/*\n * Copyright 2021 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/main/scala-spark-4.0/uk/co/gresearch/spark/parquet/SplitFile.scala",
    "chars": 972,
    "preview": "/*\n * Copyright 2023 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/test/java/uk/co/gresearch/test/SparkJavaTests.java",
    "chars": 7408,
    "preview": "/*\n * Copyright 2021 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/test/java/uk/co/gresearch/test/diff/DiffJavaTests.java",
    "chars": 10479,
    "preview": "/*\n * Copyright 2021 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/test/java/uk/co/gresearch/test/diff/JavaValue.java",
    "chars": 1960,
    "preview": "/*\n * Copyright 2021 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/test/java/uk/co/gresearch/test/diff/JavaValueAs.java",
    "chars": 3220,
    "preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/test/resources/log4j.properties",
    "chars": 1900,
    "preview": "#\n# Copyright 2020 G-Research\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this"
  },
  {
    "path": "src/test/resources/log4j2.properties",
    "chars": 3259,
    "preview": "#\n# Copyright 2020 G-Research\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this"
  },
  {
    "path": "src/test/scala/uk/co/gresearch/spark/GroupBySuite.scala",
    "chars": 10120,
    "preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/test/scala/uk/co/gresearch/spark/HistogramSuite.scala",
    "chars": 9241,
    "preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/test/scala/uk/co/gresearch/spark/SparkSuite.scala",
    "chars": 26482,
    "preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/test/scala/uk/co/gresearch/spark/SparkTestSession.scala",
    "chars": 1188,
    "preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/test/scala/uk/co/gresearch/spark/WritePartitionedSuite.scala",
    "chars": 9869,
    "preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/test/scala/uk/co/gresearch/spark/diff/AppSuite.scala",
    "chars": 4712,
    "preview": "/*\n * Copyright 2023 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/test/scala/uk/co/gresearch/spark/diff/DiffComparatorSuite.scala",
    "chars": 25605,
    "preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/test/scala/uk/co/gresearch/spark/diff/DiffOptionsSuite.scala",
    "chars": 9802,
    "preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/test/scala/uk/co/gresearch/spark/diff/DiffSuite.scala",
    "chars": 73772,
    "preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/test/scala/uk/co/gresearch/spark/diff/examples/Examples.scala",
    "chars": 2456,
    "preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/test/scala/uk/co/gresearch/spark/group/GroupSuite.scala",
    "chars": 8998,
    "preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/test/scala/uk/co/gresearch/spark/parquet/ParquetSuite.scala",
    "chars": 24186,
    "preview": "/*\n * Copyright 2023 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/test/scala/uk/co/gresearch/spark/test/package.scala",
    "chars": 1022,
    "preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/test/scala/uk/co/gresearch/test/ClasspathSuite.scala",
    "chars": 2035,
    "preview": "/*\n * Copyright 2025 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/test/scala/uk/co/gresearch/test/Spec.scala",
    "chars": 806,
    "preview": "/*\n * Copyright 2025 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/test/scala/uk/co/gresearch/test/Suite.scala",
    "chars": 810,
    "preview": "/*\n * Copyright 2025 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/test/scala-spark-3/uk/co/gresearch/spark/SparkSuiteHelper.scala",
    "chars": 840,
    "preview": "/*\n * Copyright 2024 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "src/test/scala-spark-4/uk/co/gresearch/spark/SparkSuiteHelper.scala",
    "chars": 917,
    "preview": "/*\n * Copyright 2024 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
  },
  {
    "path": "test-release.py",
    "chars": 2166,
    "preview": "# this requires parquet-hadoop-*-tests.jar\n# fetch with mvn dependency:get -Dtransitive=false -Dartifact=org.apache.parq"
  },
  {
    "path": "test-release.scala",
    "chars": 2483,
    "preview": "// this requires parquet-hadoop-*-tests.jar\n// fetch with mvn dependency:get -Dtransitive=false -Dartifact=org.apache.pa"
  },
  {
    "path": "test-release.sh",
    "chars": 2565,
    "preview": "#!/bin/bash\n\nset -eo pipefail\n\nversion=$(grep --max-count=1 \"<version>.*</version>\" pom.xml | sed -E -e \"s/\\s*<[^>]+>//g"
  }
]

// ... and 5 more files (download for full content)

About this extraction

This page contains the full source code of the G-Research/spark-extension GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 128 files (857.8 KB), approximately 229.2k tokens, and a symbol index with 297 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo