Showing preview only (907K chars total). Download the full file or copy to clipboard to get everything.
Repository: G-Research/spark-extension
Branch: master
Commit: 65c3dda4a96b
Files: 128
Total size: 857.8 KB
Directory structure:
gitextract_5zjc6xfa/
├── .github/
│ ├── actions/
│ │ ├── build-whl/
│ │ │ └── action.yml
│ │ ├── check-compat/
│ │ │ └── action.yml
│ │ ├── prime-caches/
│ │ │ └── action.yml
│ │ ├── test-jvm/
│ │ │ └── action.yml
│ │ ├── test-python/
│ │ │ └── action.yml
│ │ └── test-release/
│ │ └── action.yml
│ ├── dependabot.yml
│ ├── show-spark-versions.sh
│ └── workflows/
│ ├── build-jvm.yml
│ ├── build-python.yml
│ ├── build-snapshots.yml
│ ├── check.yml
│ ├── ci.yml
│ ├── clear-caches.yaml
│ ├── prepare-release.yml
│ ├── prime-caches.yml
│ ├── publish-release.yml
│ ├── publish-snapshot.yml
│ ├── test-jvm.yml
│ ├── test-python.yml
│ ├── test-release.yml
│ ├── test-results.yml
│ └── test-snapshots.yml
├── .gitignore
├── .scalafmt.conf
├── CHANGELOG.md
├── CONDITIONAL.md
├── DIFF.md
├── GROUPS.md
├── HISTOGRAM.md
├── LICENSE
├── MAINTAINERS.md
├── PARQUET.md
├── PARTITIONING.md
├── PYSPARK-DEPS.md
├── README.md
├── RELEASE.md
├── ROW_NUMBER.md
├── SECURITY.md
├── build-whl.sh
├── bump-version.sh
├── examples/
│ └── python-deps/
│ ├── Dockerfile
│ ├── docker-compose.yml
│ └── example.py
├── pom.xml
├── python/
│ ├── README.md
│ ├── gresearch/
│ │ ├── __init__.py
│ │ └── spark/
│ │ ├── __init__.py
│ │ ├── diff/
│ │ │ ├── __init__.py
│ │ │ └── comparator/
│ │ │ └── __init__.py
│ │ └── parquet/
│ │ └── __init__.py
│ ├── pyproject.toml
│ ├── pyspark/
│ │ └── jars/
│ │ └── .gitignore
│ ├── setup.py
│ └── test/
│ ├── __init__.py
│ ├── spark_common.py
│ ├── test_diff.py
│ ├── test_histogram.py
│ ├── test_job_description.py
│ ├── test_jvm.py
│ ├── test_package.py
│ ├── test_parquet.py
│ └── test_row_number.py
├── release.sh
├── set-version.sh
├── src/
│ ├── main/
│ │ ├── scala/
│ │ │ └── uk/
│ │ │ └── co/
│ │ │ └── gresearch/
│ │ │ ├── package.scala
│ │ │ └── spark/
│ │ │ ├── BuildVersion.scala
│ │ │ ├── Histogram.scala
│ │ │ ├── RowNumbers.scala
│ │ │ ├── SparkVersion.scala
│ │ │ ├── UnpersistHandle.scala
│ │ │ ├── diff/
│ │ │ │ ├── App.scala
│ │ │ │ ├── Diff.scala
│ │ │ │ ├── DiffComparators.scala
│ │ │ │ ├── DiffOptions.scala
│ │ │ │ ├── comparator/
│ │ │ │ │ ├── DefaultDiffComparator.scala
│ │ │ │ │ ├── DiffComparator.scala
│ │ │ │ │ ├── DurationDiffComparator.scala
│ │ │ │ │ ├── EpsilonDiffComparator.scala
│ │ │ │ │ ├── EquivDiffComparator.scala
│ │ │ │ │ ├── MapDiffComparator.scala
│ │ │ │ │ ├── NullSafeEqualDiffComparator.scala
│ │ │ │ │ ├── TypedDiffComparator.scala
│ │ │ │ │ └── WhitespaceDiffComparator.scala
│ │ │ │ └── package.scala
│ │ │ ├── group/
│ │ │ │ └── package.scala
│ │ │ ├── package.scala
│ │ │ └── parquet/
│ │ │ ├── ParquetMetaDataUtil.scala
│ │ │ └── package.scala
│ │ ├── scala-spark-3.2/
│ │ │ └── uk/
│ │ │ └── co/
│ │ │ └── gresearch/
│ │ │ └── spark/
│ │ │ └── parquet/
│ │ │ └── SplitFile.scala
│ │ ├── scala-spark-3.3/
│ │ │ └── uk/
│ │ │ └── co/
│ │ │ └── gresearch/
│ │ │ └── spark/
│ │ │ └── parquet/
│ │ │ └── SplitFile.scala
│ │ ├── scala-spark-3.5/
│ │ │ ├── org/
│ │ │ │ └── apache/
│ │ │ │ └── spark/
│ │ │ │ └── sql/
│ │ │ │ └── extension/
│ │ │ │ └── package.scala
│ │ │ └── uk/
│ │ │ └── co/
│ │ │ └── gresearch/
│ │ │ └── spark/
│ │ │ └── Backticks.scala
│ │ └── scala-spark-4.0/
│ │ ├── org/
│ │ │ └── apache/
│ │ │ └── spark/
│ │ │ └── sql/
│ │ │ └── extension/
│ │ │ └── package.scala
│ │ └── uk/
│ │ └── co/
│ │ └── gresearch/
│ │ └── spark/
│ │ ├── Backticks.scala
│ │ └── parquet/
│ │ └── SplitFile.scala
│ └── test/
│ ├── files/
│ │ ├── encrypted1.parquet
│ │ ├── encrypted2.parquet
│ │ ├── nested.parquet
│ │ └── test.parquet/
│ │ ├── file1.parquet
│ │ └── file2.parquet
│ ├── java/
│ │ └── uk/
│ │ └── co/
│ │ └── gresearch/
│ │ └── test/
│ │ ├── SparkJavaTests.java
│ │ └── diff/
│ │ ├── DiffJavaTests.java
│ │ ├── JavaValue.java
│ │ └── JavaValueAs.java
│ ├── resources/
│ │ ├── log4j.properties
│ │ └── log4j2.properties
│ ├── scala/
│ │ └── uk/
│ │ └── co/
│ │ └── gresearch/
│ │ ├── spark/
│ │ │ ├── GroupBySuite.scala
│ │ │ ├── HistogramSuite.scala
│ │ │ ├── SparkSuite.scala
│ │ │ ├── SparkTestSession.scala
│ │ │ ├── WritePartitionedSuite.scala
│ │ │ ├── diff/
│ │ │ │ ├── AppSuite.scala
│ │ │ │ ├── DiffComparatorSuite.scala
│ │ │ │ ├── DiffOptionsSuite.scala
│ │ │ │ ├── DiffSuite.scala
│ │ │ │ └── examples/
│ │ │ │ └── Examples.scala
│ │ │ ├── group/
│ │ │ │ └── GroupSuite.scala
│ │ │ ├── parquet/
│ │ │ │ └── ParquetSuite.scala
│ │ │ └── test/
│ │ │ └── package.scala
│ │ └── test/
│ │ ├── ClasspathSuite.scala
│ │ ├── Spec.scala
│ │ └── Suite.scala
│ ├── scala-spark-3/
│ │ └── uk/
│ │ └── co/
│ │ └── gresearch/
│ │ └── spark/
│ │ └── SparkSuiteHelper.scala
│ └── scala-spark-4/
│ └── uk/
│ └── co/
│ └── gresearch/
│ └── spark/
│ └── SparkSuiteHelper.scala
├── test-release.py
├── test-release.scala
└── test-release.sh
================================================
FILE CONTENTS
================================================
================================================
FILE: .github/actions/build-whl/action.yml
================================================
name: 'Build Whl'
author: 'EnricoMi'
description: 'A GitHub Action that builds pyspark-extension package'
inputs:
spark-version:
description: Spark version, e.g. 3.4.0, 3.4.0-SNAPSHOT, or 4.0.0-preview1
required: true
scala-version:
description: Scala version, e.g. 2.12.15
required: true
spark-compat-version:
description: Spark compatibility version, e.g. 3.4
required: true
scala-compat-version:
description: Scala compatibility version, e.g. 2.12
required: true
java-compat-version:
description: Java compatibility version, e.g. 8
required: true
python-version:
description: Python version, e.g. 3.8
required: true
runs:
using: 'composite'
steps:
- name: Fetch Binaries Artifact
uses: actions/download-artifact@v4
with:
name: Binaries-${{ inputs.spark-compat-version }}-${{ inputs.scala-compat-version }}
path: .
- name: Set versions in pom.xml
run: |
./set-version.sh ${{ inputs.spark-version }} ${{ inputs.scala-version }}
git diff
shell: bash
- name: Make this work with PySpark preview versions
if: contains(inputs.spark-version, 'preview')
run: |
sed -i -e 's/f"\(pyspark~=.*\)"/f"\1.dev1"/' -e 's/f"\({spark_compat_version}.0\)"/"${{ inputs.spark-version }}"/g' python/setup.py
git diff python/setup.py
shell: bash
- name: Restore Maven packages cache
if: github.event_name != 'schedule'
uses: actions/cache/restore@v4
with:
path: ~/.m2/repository
key: ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}
restore-keys: |
${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}
${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-
- name: Setup JDK ${{ inputs.java-compat-version }}
uses: actions/setup-java@v4
with:
java-version: ${{ inputs.java-compat-version }}
distribution: 'zulu'
- name: Fetch Release Test Dependencies
run: |
# Fetch Release Test Dependencies
echo "::group::mvn dependency:get"
mvn dependency:get -Dtransitive=false -Dartifact=org.apache.parquet:parquet-hadoop:1.16.0:jar:tests
echo "::endgroup::"
shell: bash
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: ${{ inputs.python-version }}
- name: Install Python dependencies
run: |
# Install Python dependencies
echo "::group::mvn compile"
python -m pip install --upgrade pip build twine
echo "::endgroup::"
shell: bash
- name: Build whl
run: |
# Build whl
echo "::group::build-whl.sh"
./build-whl.sh
echo "::endgroup::"
shell: bash
- name: Test whl
run: |
# Test whl
echo "::group::test-release.py"
twine check python/dist/*
# .dev1 allows this to work with preview versions
pip install python/dist/*.whl "pyspark~=${{ inputs.spark-compat-version }}.0.dev1"
python test-release.py
echo "::endgroup::"
shell: bash
- name: Upload whl
uses: actions/upload-artifact@v4
with:
name: Whl (Spark ${{ inputs.spark-compat-version }} Scala ${{ inputs.scala-compat-version }})
path: |
python/dist/*.whl
- name: Build whl with mvn
env:
JDK_JAVA_OPTIONS: --add-exports java.base/sun.nio.ch=ALL-UNNAMED --add-exports java.base/sun.util.calendar=ALL-UNNAMED
run: |
# Build whl with mvn
rm -rf target python/dist python/pyspark_extension.egg-info pyspark/jars/*.jar
echo "::group::build-whl.sh"
./build-whl.sh
echo "::endgroup::"
shell: bash
branding:
icon: 'check-circle'
color: 'green'
================================================
FILE: .github/actions/check-compat/action.yml
================================================
name: 'Check'
author: 'EnricoMi'
description: 'A GitHub Action that checks compatibility of spark-extension'
inputs:
spark-version:
description: Spark version, e.g. 3.4.0 or 3.4.0-SNAPSHOT
required: true
scala-version:
description: Scala version, e.g. 2.12.15
required: true
spark-compat-version:
description: Spark compatibility version, e.g. 3.4
required: true
scala-compat-version:
description: Scala compatibility version, e.g. 2.12
required: true
package-version:
description: Spark-Extension version to check against
required: true
runs:
using: 'composite'
steps:
- name: Fetch Binaries Artifact
uses: actions/download-artifact@v4
with:
name: Binaries-${{ inputs.spark-compat-version }}-${{ inputs.scala-compat-version }}
path: .
- name: Set versions in pom.xml
run: |
./set-version.sh ${{ inputs.spark-version }} ${{ inputs.scala-version }}
git diff
shell: bash
- name: Restore Maven packages cache
if: github.event_name != 'schedule'
uses: actions/cache/restore@v4
with:
path: ~/.m2/repository
key: ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}
restore-keys: |
${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}
${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-
- name: Setup JDK 1.8
uses: actions/setup-java@v4
with:
java-version: '8'
distribution: 'zulu'
- name: Install Checker
run: |
# Install Checker
echo "::group::apt update install"
sudo apt update
sudo apt install japi-compliance-checker
echo "::endgroup::"
shell: bash
- name: Release exists
id: exists
continue-on-error: true
run: |
# Release exists
curl --head --fail https://repo1.maven.org/maven2/uk/co/gresearch/spark/spark-extension_${{ inputs.scala-compat-version }}/${{ inputs.package-version }}-${{ inputs.spark-compat-version }}/spark-extension_${{ inputs.scala-compat-version }}-${{ inputs.package-version }}-${{ inputs.spark-compat-version }}.jar
shell: bash
- name: Fetch package
if: steps.exists.outcome == 'success'
run: |
# Fetch package
echo "::group::mvn dependency:get"
mvn dependency:get -Dtransitive=false -DremoteRepositories -Dartifact=uk.co.gresearch.spark:spark-extension_${{ inputs.scala-compat-version }}:${{ inputs.package-version }}-${{ inputs.spark-compat-version }}
echo "::endgroup::"
shell: bash
- name: Check
if: steps.exists.outcome == 'success'
continue-on-error: ${{ github.ref == 'refs/heads/master' }}
run: |
# Check
echo "::group::japi-compliance-checker"
ls -lah ~/.m2/repository/uk/co/gresearch/spark/spark-extension_${{ inputs.scala-compat-version }}/${{ inputs.package-version }}-${{ inputs.spark-compat-version }}/spark-extension_${{ inputs.scala-compat-version }}-${{ inputs.package-version }}-${{ inputs.spark-compat-version }}.jar target/spark-extension*.jar
japi-compliance-checker ~/.m2/repository/uk/co/gresearch/spark/spark-extension_${{ inputs.scala-compat-version }}/${{ inputs.package-version }}-${{ inputs.spark-compat-version }}/spark-extension_${{ inputs.scala-compat-version }}-${{ inputs.package-version }}-${{ inputs.spark-compat-version }}.jar target/spark-extension*.jar
echo "::endgroup::"
shell: bash
- name: Upload Report
uses: actions/upload-artifact@v4
if: always() && steps.exists.outcome == 'success'
with:
name: Compat-Report-${{ inputs.spark-compat-version }}
path: compat_reports/spark-extension/*
branding:
icon: 'check-circle'
color: 'green'
================================================
FILE: .github/actions/prime-caches/action.yml
================================================
name: 'Prime caches'
author: 'EnricoMi'
description: 'A GitHub Action that primes caches'
inputs:
spark-version:
description: Spark version, e.g. 3.4.0 or 3.4.0-SNAPSHOT
required: true
scala-version:
description: Scala version, e.g. 2.12.15
required: true
spark-compat-version:
description: Spark compatibility version, e.g. 3.4
required: true
scala-compat-version:
description: Scala compatibility version, e.g. 2.12
required: true
java-compat-version:
description: Java compatibility version, e.g. 8
required: true
hadoop-version:
description: Hadoop version, e.g. 2.7 or 2
required: true
runs:
using: 'composite'
steps:
- name: Set versions in pom.xml
run: |
./set-version.sh ${{ inputs.spark-version }} ${{ inputs.scala-version }}
git diff
shell: bash
- name: Check Maven packages cache
id: mvn-build-cache
uses: actions/cache/restore@v4
with:
lookup-only: true
path: ~/.m2/repository
key: ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}
- name: Check Spark Binaries cache
id: spark-binaries-cache
uses: actions/cache/restore@v4
with:
lookup-only: true
path: ~/spark
key: ${{ runner.os }}-spark-binaries-${{ inputs.spark-version }}-${{ inputs.scala-compat-version }}
- name: Prepare priming caches
id: setup
run: |
# Prepare priming caches
if [[ "${{ inputs.spark-version }}" == *"-SNAPSHOT" ]] || [[ -z "${{ steps.mvn-build-cache.outputs.cache-hit }}" ]]; then
echo "prime-mvn-cache=true" >> "$GITHUB_ENV"
echo "prime-some-cache=true" >> "$GITHUB_ENV"
fi;
if [[ "${{ inputs.spark-version }}" == *"-SNAPSHOT" ]] || [[ -z "${{ steps.spark-binaries-cache.outputs.cache-hit }}" ]]; then
echo "prime-spark-cache=true" >> "$GITHUB_ENV"
echo "prime-some-cache=true" >> "$GITHUB_ENV"
fi;
shell: bash
- name: Setup JDK ${{ inputs.java-compat-version }}
if: env.prime-some-cache
uses: actions/setup-java@v4
with:
java-version: ${{ inputs.java-compat-version }}
distribution: 'zulu'
- name: Build
if: env.prime-mvn-cache
env:
JDK_JAVA_OPTIONS: --add-exports java.base/sun.nio.ch=ALL-UNNAMED --add-exports java.base/sun.util.calendar=ALL-UNNAMED
run: |
# Build
echo "::group::mvn dependency:go-offline"
mvn --batch-mode dependency:go-offline
echo "::endgroup::"
shell: bash
- name: Save Maven packages cache
if: env.prime-mvn-cache
uses: actions/cache/save@v4
with:
path: ~/.m2/repository
key: ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}-${{ github.run_id }}
- name: Setup Spark Binaries
if: env.prime-spark-cache && ! contains(inputs.spark-version, '-SNAPSHOT')
env:
SPARK_PACKAGE: spark-${{ inputs.spark-version }}/spark-${{ inputs.spark-version }}-bin-hadoop${{ inputs.hadoop-version }}${{ startsWith(inputs.spark-version, '3.') && inputs.scala-compat-version == '2.13' && '-scala2.13' || '' }}.tgz
run: |
wget --progress=dot:giga "https://www.apache.org/dyn/closer.lua/spark/${SPARK_PACKAGE}?action=download" -O - | tar -xzC "${{ runner.temp }}"
archive=$(basename "${SPARK_PACKAGE}") bash -c "mv -v "${{ runner.temp }}/\${archive/%.tgz/}" ~/spark"
shell: bash
- name: Save Spark Binaries cache
if: env.prime-spark-cache && ! contains(inputs.spark-version, '-SNAPSHOT')
uses: actions/cache/save@v4
with:
path: ~/spark
key: ${{ runner.os }}-spark-binaries-${{ inputs.spark-version }}-${{ inputs.scala-compat-version }}-${{ github.run_id }}
branding:
icon: 'check-circle'
color: 'green'
================================================
FILE: .github/actions/test-jvm/action.yml
================================================
name: 'Test JVM'
author: 'EnricoMi'
description: 'A GitHub Action that tests JVM spark-extension'
inputs:
spark-version:
description: Spark version, e.g. 3.4.0, 3.4.0-SNAPSHOT or 4.0.0-preview1
required: true
spark-compat-version:
description: Spark compatibility version, e.g. 3.4
required: true
spark-archive-url:
description: The URL to download the Spark binary distribution
required: false
scala-version:
description: Scala version, e.g. 2.12.15
required: true
scala-compat-version:
description: Scala compatibility version, e.g. 2.12
required: true
hadoop-version:
description: Hadoop version, e.g. 2.7 or 2
required: true
java-compat-version:
description: Java compatibility version, e.g. 8
required: true
runs:
using: 'composite'
steps:
- name: Fetch Binaries Artifact
uses: actions/download-artifact@v4
with:
name: Binaries-${{ inputs.spark-compat-version }}-${{ inputs.scala-compat-version }}
path: .
- name: Set versions in pom.xml
run: |
./set-version.sh ${{ inputs.spark-version }} ${{ inputs.scala-version }}
git diff
shell: bash
- name: Restore Spark Binaries cache
if: github.event_name != 'schedule' && ! contains(inputs.spark-version, '-SNAPSHOT')
uses: actions/cache/restore@v4
with:
path: ~/spark
key: ${{ runner.os }}-spark-binaries-${{ inputs.spark-version }}-${{ inputs.scala-compat-version }}
restore-keys: |
${{ runner.os }}-spark-binaries-${{ inputs.spark-version }}-${{ inputs.scala-compat-version }}
- name: Setup Spark Binaries
if: ( ! contains(inputs.spark-version, '-SNAPSHOT') )
env:
SPARK_PACKAGE: spark-${{ inputs.spark-version }}/spark-${{ inputs.spark-version }}-bin-hadoop${{ inputs.hadoop-version }}${{ startsWith(inputs.spark-version, '3.') && inputs.scala-compat-version == '2.13' && '-scala2.13' || '' }}.tgz
run: |
# Setup Spark Binaries
if [[ ! -e ~/spark ]]
then
url="${{ inputs.spark-archive-url }}"
wget --progress=dot:giga "${url:-https://www.apache.org/dyn/closer.lua/spark/${SPARK_PACKAGE}?action=download}" -O - | tar -xzC "${{ runner.temp }}"
archive=$(basename "${SPARK_PACKAGE}") bash -c "mv -v "${{ runner.temp }}/\${archive/%.tgz/}" ~/spark"
fi
echo "SPARK_HOME=$(cd ~/spark; pwd)" >> $GITHUB_ENV
shell: bash
- name: Restore Maven packages cache
if: github.event_name != 'schedule'
uses: actions/cache/restore@v4
with:
path: ~/.m2/repository
key: ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}
restore-keys: |
${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}
${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-
- name: Setup JDK ${{ inputs.java-compat-version }}
uses: actions/setup-java@v4
with:
java-version: ${{ inputs.java-compat-version }}
distribution: 'zulu'
- name: Scala and Java Tests
env:
JDK_JAVA_OPTIONS: --add-exports java.base/sun.nio.ch=ALL-UNNAMED --add-exports java.base/sun.util.calendar=ALL-UNNAMED
run: |
# Scala and Java Tests
echo "::group::mvn test"
mvn --batch-mode --update-snapshots -Dspotless.check.skip test integration-test
echo "::endgroup::"
shell: bash
- name: Upload Test Results
if: always()
uses: actions/upload-artifact@v4
with:
name: JVM Test Results (Spark ${{ inputs.spark-version }} Scala ${{ inputs.scala-version }})
path: |
target/surefire-*reports/*.xml
branding:
icon: 'check-circle'
color: 'green'
================================================
FILE: .github/actions/test-python/action.yml
================================================
name: 'Test Python'
author: 'EnricoMi'
description: 'A GitHub Action that tests Python spark-extension'
# pyspark is not available for snapshots or scala other than 2.12
# we would have to compile spark from sources for this, not worth it
# so this action only works with scala 2.12 and non-snapshot spark versions
inputs:
spark-version:
description: Spark version, e.g. 3.4.0 or 4.0.0-preview1
required: true
scala-version:
description: Scala version, e.g. 2.12.15
required: true
spark-compat-version:
description: Spark compatibility version, e.g. 3.4
required: true
spark-archive-url:
description: The URL to download the Spark binary distribution
required: false
spark-package-repo:
description: The URL of an alternate maven repository to fetch Spark packages
required: false
scala-compat-version:
description: Scala compatibility version, e.g. 2.12
required: true
java-compat-version:
description: Java compatibility version, e.g. 8
required: true
hadoop-version:
description: Hadoop version, e.g. 2.7 or 2
required: true
python-version:
description: Python version, e.g. 3.8
required: true
runs:
using: 'composite'
steps:
- name: Fetch Binaries Artifact
uses: actions/download-artifact@v4
with:
name: Binaries-${{ inputs.spark-compat-version }}-${{ inputs.scala-compat-version }}
path: .
- name: Set versions in pom.xml
run: |
./set-version.sh ${{ inputs.spark-version }} ${{ inputs.scala-version }}
git diff
SPARK_EXTENSION_VERSION=$(grep --max-count=1 "<version>.*</version>" pom.xml | sed -E -e "s/\s*<[^>]+>//g")
echo "SPARK_EXTENSION_VERSION=$SPARK_EXTENSION_VERSION" | tee -a "$GITHUB_ENV"
shell: bash
- name: Make this work with PySpark preview versions
if: contains(inputs.spark-version, 'preview')
run: |
sed -i -e 's/\({spark_compat_version}.0\)"/\1.dev1"/' python/setup.py
git diff python/setup.py
shell: bash
- name: Restore Spark Binaries cache
if: github.event_name != 'schedule' && ( startsWith(inputs.spark-version, '3.') && inputs.scala-compat-version == '2.12' || startsWith(inputs.spark-version, '4.') ) && ! contains(inputs.spark-version, '-SNAPSHOT')
uses: actions/cache/restore@v4
with:
path: ~/spark
key: ${{ runner.os }}-spark-binaries-${{ inputs.spark-version }}-${{ inputs.scala-compat-version }}
restore-keys: |
${{ runner.os }}-spark-binaries-${{ inputs.spark-version }}-${{ inputs.scala-compat-version }}
- name: Setup Spark Binaries
if: ( startsWith(inputs.spark-version, '3.') && inputs.scala-compat-version == '2.12' || startsWith(inputs.spark-version, '4.') ) && ! contains(inputs.spark-version, '-SNAPSHOT')
env:
SPARK_PACKAGE: spark-${{ inputs.spark-version }}/spark-${{ inputs.spark-version }}-bin-hadoop${{ inputs.hadoop-version }}${{ startsWith(inputs.spark-version, '3.') && inputs.scala-compat-version == '2.13' && '-scala2.13' || '' }}.tgz
run: |
# Setup Spark Binaries
if [[ ! -e ~/spark ]]
then
url="${{ inputs.spark-archive-url }}"
wget --progress=dot:giga "${url:-https://www.apache.org/dyn/closer.lua/spark/${SPARK_PACKAGE}?action=download}" -O - | tar -xzC "${{ runner.temp }}"
archive=$(basename "${SPARK_PACKAGE}") bash -c "mv -v "${{ runner.temp }}/\${archive/%.tgz/}" ~/spark"
fi
echo "SPARK_BIN_HOME=$(cd ~/spark; pwd)" >> $GITHUB_ENV
shell: bash
- name: Restore Maven packages cache
if: github.event_name != 'schedule'
uses: actions/cache/restore@v4
with:
path: ~/.m2/repository
key: ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}
restore-keys: |
${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}
${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-
- name: Setup JDK ${{ inputs.java-compat-version }}
uses: actions/setup-java@v4
with:
java-version: ${{ inputs.java-compat-version }}
distribution: 'zulu'
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: ${{ inputs.python-version }}
- name: Install Python dependencies
run: |
# Install Python dependencies
echo "::group::pip install"
python -m venv .pytest-venv
.pytest-venv/bin/python -m pip install --upgrade pip
.pytest-venv/bin/pip install pypandoc
.pytest-venv/bin/pip install -e python/[test]
echo "::endgroup::"
PYSPARK_HOME=$(.pytest-venv/bin/python -c "import os; import pyspark; print(os.path.dirname(pyspark.__file__))")
PYSPARK_BIN_HOME="$(cd ".pytest-venv/"; pwd)"
PYSPARK_PYTHON="$PYSPARK_BIN_HOME/bin/python"
echo "PYSPARK_HOME=$PYSPARK_HOME" | tee -a "$GITHUB_ENV"
echo "PYSPARK_BIN_HOME=$PYSPARK_BIN_HOME" | tee -a "$GITHUB_ENV"
echo "PYSPARK_PYTHON=$PYSPARK_PYTHON" | tee -a "$GITHUB_ENV"
shell: bash
- name: Prepare Poetry tests
run: |
# Prepare Poetry tests
echo "::group::Prepare poetry tests"
# install poetry in venv
python -m venv .poetry-venv
.poetry-venv/bin/python -m pip install poetry
# env var needed by poetry tests
echo "POETRY_PYTHON=$PWD/.poetry-venv/bin/python" | tee -a "$GITHUB_ENV"
# clone example poetry project
git clone https://github.com/Textualize/rich.git .rich
cd .rich
git reset --hard 20024635c06c22879fd2fd1e380ec4cccd9935dd
# env var needed by poetry tests
echo "RICH_SOURCES=$PWD" | tee -a "$GITHUB_ENV"
echo "::endgroup::"
shell: bash
- name: Python Unit Tests
env:
SPARK_HOME: ${{ env.PYSPARK_HOME }}
PYTHONPATH: python/test
run: |
.pytest-venv/bin/python -m pytest python/test --junit-xml test-results/pytest-$(date +%s.%N)-$RANDOM.xml
shell: bash
- name: Install Spark Extension
run: |
# Install Spark Extension
echo "::group::mvn install"
mvn --batch-mode --update-snapshots install -Dspotless.check.skip -DskipTests -Dmaven.test.skip=true -Dgpg.skip
echo "::endgroup::"
shell: bash
- name: Start Spark Connect
id: spark-connect
if: ( contains('3.4,3.5', inputs.spark-compat-version) && inputs.scala-compat-version == '2.12' || startsWith(inputs.spark-version, '4.') ) && ! contains(inputs.spark-version, '-SNAPSHOT')
env:
SPARK_HOME: ${{ env.SPARK_BIN_HOME }}
CONNECT_GRPC_BINDING_ADDRESS: 127.0.0.1
CONNECT_GRPC_BINDING_PORT: 15002
run: |
# Start Spark Connect
for attempt in {1..10}; do
$SPARK_HOME/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_${{ inputs.scala-compat-version }}:${{ inputs.spark-version }} --repositories "${{ inputs.spark-package-repo }}"
sleep 10
for log in $SPARK_HOME/logs/spark-*-org.apache.spark.sql.connect.service.SparkConnectServer-*.out; do
echo "::group::Spark Connect server log: $log"
eoc="EOC-$RANDOM"
echo "::stop-commands::$eoc"
cat "$log" || true
echo "::$eoc::"
echo "::endgroup::"
done
if netstat -an | grep 15002; then
break;
fi
echo "::warning title=Starting Spark Connect server failed::Attempt #$attempt to start Spark Connect server failed"
$SPARK_HOME/sbin/stop-connect-server.sh --packages org.apache.spark:spark-connect_${{ inputs.scala-compat-version }}:${{ inputs.spark-version }}
sleep 5
done
if ! netstat -an | grep 15002; then
echo "::error title=Starting Spark Connect server failed::All attempts to start Spark Connect server failed"
exit 1
fi
shell: bash
- name: Python Unit Tests (Spark Connect)
if: steps.spark-connect.outcome == 'success'
env:
SPARK_HOME: ${{ env.PYSPARK_HOME }}
PYTHONPATH: python/test
TEST_SPARK_CONNECT_SERVER: sc://127.0.0.1:15002
run: |
# Python Unit Tests (Spark Connect)
echo "::group::pip install"
# .dev1 allows this to work with preview versions
.pytest-venv/bin/pip install "pyspark[connect]~=${{ inputs.spark-compat-version }}.0.dev1"
echo "::endgroup::"
.pytest-venv/bin/python -m pytest python/test --junit-xml test-results-connect/pytest-$(date +%s.%N)-$RANDOM.xml
shell: bash
- name: Stop Spark Connect
if: always() && steps.spark-connect.outcome == 'success'
env:
SPARK_HOME: ${{ env.SPARK_BIN_HOME }}
run: |
# Stop Spark Connect
$SPARK_HOME/sbin/stop-connect-server.sh
for log in $SPARK_HOME/logs/spark-*-org.apache.spark.sql.connect.service.SparkConnectServer-*.out; do
echo "::group::Spark Connect server log: $log"
eoc="EOC-$RANDOM"
echo "::stop-commands::$eoc"
cat "$log" || true
echo "::$eoc::"
echo "::endgroup::"
done
shell: bash
- name: Upload Test Results
if: always()
uses: actions/upload-artifact@v4
with:
name: Python Test Results (Spark ${{ inputs.spark-version }} Scala ${{ inputs.scala-version }} Python ${{ inputs.python-version }})
path: |
test-results/*.xml
test-results-connect/*.xml
branding:
icon: 'check-circle'
color: 'green'
================================================
FILE: .github/actions/test-release/action.yml
================================================
name: 'Test Release'
author: 'EnricoMi'
description: 'A GitHub Action that tests spark-extension release'
# pyspark is not available for snapshots or scala other than 2.12
# we would have to compile spark from sources for this, not worth it
# so this action only works with scala 2.12 and non-snapshot spark versions
inputs:
spark-version:
description: Spark version, e.g. 3.4.0 or 4.0.0-preview1
required: true
scala-version:
description: Scala version, e.g. 2.12.15
required: true
spark-compat-version:
description: Spark compatibility version, e.g. 3.4
required: true
spark-archive-url:
description: The URL to download the Spark binary distribution
required: false
scala-compat-version:
description: Scala compatibility version, e.g. 2.12
required: true
java-compat-version:
description: Java compatibility version, e.g. 8
required: true
hadoop-version:
description: Hadoop version, e.g. 2.7 or 2
required: true
python-version:
description: Python version, e.g. 3.8
default: ''
required: false
runs:
using: 'composite'
steps:
- name: Fetch Binaries Artifact
uses: actions/download-artifact@v4
with:
name: Binaries-${{ inputs.spark-compat-version }}-${{ inputs.scala-compat-version }}
path: .
- name: Set versions in pom.xml
run: |
./set-version.sh ${{ inputs.spark-version }} ${{ inputs.scala-version }}
git diff
SPARK_EXTENSION_VERSION=$(grep --max-count=1 "<version>.*</version>" pom.xml | sed -E -e "s/\s*<[^>]+>//g")
echo "SPARK_EXTENSION_VERSION=$SPARK_EXTENSION_VERSION" | tee -a "$GITHUB_ENV"
shell: bash
- name: Restore Spark Binaries cache
if: github.event_name != 'schedule'
uses: actions/cache/restore@v4
with:
path: ~/spark
key: ${{ runner.os }}-spark-binaries-${{ inputs.spark-version }}-${{ inputs.scala-compat-version }}
restore-keys: |
${{ runner.os }}-spark-binaries-${{ inputs.spark-version }}-${{ inputs.scala-compat-version }}
- name: Setup Spark Binaries
env:
SPARK_PACKAGE: spark-${{ inputs.spark-version }}/spark-${{ inputs.spark-version }}-bin-hadoop${{ inputs.hadoop-version }}${{ startsWith(inputs.spark-version, '3.') && inputs.scala-compat-version == '2.13' && '-scala2.13' || '' }}.tgz
run: |
# Setup Spark Binaries
if [[ ! -e ~/spark ]]
then
url="${{ inputs.spark-archive-url }}"
wget --progress=dot:giga "${url:-https://www.apache.org/dyn/closer.lua/spark/${SPARK_PACKAGE}?action=download}" -O - | tar -xzC "${{ runner.temp }}"
archive=$(basename "${SPARK_PACKAGE}") bash -c "mv -v "${{ runner.temp }}/\${archive/%.tgz/}" ~/spark"
fi
echo "SPARK_BIN_HOME=$(cd ~/spark; pwd)" >> $GITHUB_ENV
shell: bash
- name: Restore Maven packages cache
if: github.event_name != 'schedule'
uses: actions/cache/restore@v4
with:
path: ~/.m2/repository
key: ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}
restore-keys: |
${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}
${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-
- name: Setup JDK ${{ inputs.java-compat-version }}
uses: actions/setup-java@v4
with:
java-version: ${{ inputs.java-compat-version }}
distribution: 'zulu'
- name: Diff App test
env:
SPARK_HOME: ${{ env.SPARK_BIN_HOME }}
run: |
# Diff App test
echo "::group::spark-submit"
$SPARK_HOME/bin/spark-submit --packages com.github.scopt:scopt_${{ inputs.scala-compat-version }}:4.1.0 target/spark-extension_*.jar --format parquet --id id src/test/files/test.parquet/file1.parquet src/test/files/test.parquet/file2.parquet diff.parquet
echo
echo "::endgroup::"
echo "::group::spark-shell"
$SPARK_HOME/bin/spark-shell <<< 'val df = spark.read.parquet("diff.parquet").orderBy($"id").groupBy($"diff").count; df.show; if (df.count != 2) sys.exit(1)'
echo
echo "::endgroup::"
shell: bash
- name: Install Spark Extension
run: |
# Install Spark Extension
echo "::group::mvn install"
mvn --batch-mode --update-snapshots install -Dspotless.check.skip -DskipTests -Dmaven.test.skip=true -Dgpg.skip
echo "::endgroup::"
shell: bash
- name: Fetch Release Test Dependencies
run: |
# Fetch Release Test Dependencies
echo "::group::mvn dependency:get"
mvn dependency:get -Dtransitive=false -Dartifact=org.apache.parquet:parquet-hadoop:1.16.0:jar:tests
echo "::endgroup::"
shell: bash
- name: Scala Release Test
env:
SPARK_HOME: ${{ env.SPARK_BIN_HOME }}
run: |
# Scala Release Test
echo "::group::spark-shell"
$SPARK_BIN_HOME/bin/spark-shell --packages uk.co.gresearch.spark:spark-extension_${{ inputs.scala-compat-version }}:$SPARK_EXTENSION_VERSION --jars ~/.m2/repository/org/apache/parquet/parquet-hadoop/1.16.0/parquet-hadoop-1.16.0-tests.jar < test-release.scala
echo
echo "::endgroup::"
shell: bash
- name: Setup Python
uses: actions/setup-python@v5
if: inputs.python-version != ''
with:
python-version: ${{ inputs.python-version }}
- name: Python Release Test
if: inputs.python-version != ''
env:
SPARK_HOME: ${{ env.SPARK_BIN_HOME }}
run: |
# Python Release Test
echo "::group::spark-submit"
$SPARK_BIN_HOME/bin/spark-submit --packages uk.co.gresearch.spark:spark-extension_${{ inputs.scala-compat-version }}:$SPARK_EXTENSION_VERSION test-release.py
echo
echo "::endgroup::"
shell: bash
- name: Fetch Whl Artifact
if: inputs.python-version != ''
uses: actions/download-artifact@v4
with:
name: Whl (Spark ${{ inputs.spark-compat-version }} Scala ${{ inputs.scala-compat-version }})
path: .
- name: Install Python dependencies
if: inputs.python-version != ''
run: |
# Install Python dependencies
echo "::group::pip install"
python -m venv .pytest-venv
.pytest-venv/bin/python -m pip install --upgrade pip
.pytest-venv/bin/pip install pypandoc
.pytest-venv/bin/pip install $(ls pyspark_extension-*.whl)[test]
echo "::endgroup::"
PYSPARK_HOME=$(.pytest-venv/bin/python -c "import os; import pyspark; print(os.path.dirname(pyspark.__file__))")
PYSPARK_BIN_HOME="$(cd ".pytest-venv/"; pwd)"
PYSPARK_PYTHON="$PYSPARK_BIN_HOME/bin/python"
echo "PYSPARK_HOME=$PYSPARK_HOME" | tee -a "$GITHUB_ENV"
echo "PYSPARK_BIN_HOME=$PYSPARK_BIN_HOME" | tee -a "$GITHUB_ENV"
echo "PYSPARK_PYTHON=$PYSPARK_PYTHON" | tee -a "$GITHUB_ENV"
shell: bash
- name: PySpark Release Test
if: inputs.python-version != ''
run: |
.pytest-venv/bin/python3 test-release.py
shell: bash
- name: Python Integration Tests
if: inputs.python-version != ''
env:
SPARK_HOME: ${{ env.PYSPARK_HOME }}
PYTHONPATH: python:python/test
run: |
# Python Integration Tests
source .pytest-venv/bin/activate
find python/test -name 'test*.py' > tests
while read test
do
echo "::group::spark-submit $test"
if ! $PYSPARK_BIN_HOME/bin/spark-submit --master "local[2]" --packages uk.co.gresearch.spark:spark-extension_${{ inputs.scala-compat-version }}:$SPARK_EXTENSION_VERSION "$test" test-results-submit
then
state="fail"
fi
echo
echo "::endgroup::"
done < tests
if [[ "$state" == "fail" ]]; then exit 1; fi
shell: bash
- name: Upload Test Results
if: always() && inputs.python-version != ''
uses: actions/upload-artifact@v4
with:
name: Python Release Test Results (Spark ${{ inputs.spark-version }} Scala ${{ inputs.scala-version }} Python ${{ inputs.python-version }})
path: |
test-results-submit/*.xml
branding:
icon: 'check-circle'
color: 'green'
================================================
FILE: .github/dependabot.yml
================================================
version: 2
updates:
- package-ecosystem: "github-actions"
directory: "/"
schedule:
interval: "monthly"
- package-ecosystem: "maven"
directory: "/"
schedule:
interval: "daily"
================================================
FILE: .github/show-spark-versions.sh
================================================
#!/bin/bash
base=$(cd "$(dirname "$0")"; pwd)
grep -- "-version" "$base"/workflows/prime-caches.yml | sed -e "s/ -//g" -e "s/ //g" -e "s/'//g" | grep -v -e "matrix" -e "]" | while read line
do
IFS=":" read var compat_version <<< "$line"
if [[ "$var" == "spark-compat-version" ]]
then
while read line
do
IFS=":" read var patch_version <<< "$line"
if [[ "$var" == "spark-patch-version" ]]
then
echo -n "spark-version: $compat_version.$patch_version"
read line
if [[ "$line" == "spark-snapshot-version:true" ]]
then
echo "-SNAPSHOT"
else
echo
fi
break
fi
done
fi
done > "$base"/workflows/prime-caches.yml.tmp
grep spark-version "$base"/workflows/*.yml "$base"/workflows/prime-caches.yml.tmp | cut -d : -f 2- | sed -e "s/^[ -]*//" -e "s/'//g" -e 's/{"params": {"//g' -e 's/params: {//g' -e 's/"//g' -e "s/,.*//" | grep "^spark-version" | grep -v "matrix" | sort | uniq
================================================
FILE: .github/workflows/build-jvm.yml
================================================
name: Build JVM
on:
workflow_call:
jobs:
build:
name: Build (Spark ${{ matrix.spark-version }} Scala ${{ matrix.scala-version }})
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
include:
- spark-version: '3.2.4'
spark-compat-version: '3.2'
scala-compat-version: '2.12'
scala-version: '2.12.15'
java-compat-version: '8'
hadoop-version: '2.7'
- spark-version: '3.3.4'
spark-compat-version: '3.3'
scala-compat-version: '2.12'
scala-version: '2.12.15'
java-compat-version: '8'
hadoop-version: '3'
- spark-version: '3.4.4'
spark-compat-version: '3.4'
scala-compat-version: '2.12'
scala-version: '2.12.17'
java-compat-version: '8'
hadoop-version: '3'
- spark-version: '3.5.8'
spark-compat-version: '3.5'
scala-compat-version: '2.12'
scala-version: '2.12.18'
java-compat-version: '8'
hadoop-version: '3'
- spark-version: '3.2.4'
spark-compat-version: '3.2'
scala-compat-version: '2.13'
scala-version: '2.13.5'
java-compat-version: '8'
hadoop-version: '3.2'
- spark-version: '3.3.4'
spark-compat-version: '3.3'
scala-compat-version: '2.13'
scala-version: '2.13.8'
java-compat-version: '8'
hadoop-version: '3'
- spark-version: '3.4.4'
spark-compat-version: '3.4'
scala-compat-version: '2.13'
scala-version: '2.13.8'
java-compat-version: '8'
hadoop-version: '3'
- spark-version: '3.5.8'
spark-compat-version: '3.5'
scala-compat-version: '2.13'
scala-version: '2.13.8'
java-compat-version: '8'
hadoop-version: '3'
- spark-version: '4.0.2'
spark-compat-version: '4.0'
scala-compat-version: '2.13'
scala-version: '2.13.16'
java-compat-version: '17'
hadoop-version: '3'
- spark-version: '4.1.1'
spark-compat-version: '4.1'
scala-compat-version: '2.13'
scala-version: '2.13.17'
java-compat-version: '17'
hadoop-version: '3'
- spark-version: '4.2.0-preview3'
spark-compat-version: '4.2'
scala-compat-version: '2.13'
scala-version: '2.13.18'
java-compat-version: '17'
hadoop-version: '3'
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Build
uses: ./.github/actions/build
with:
spark-version: ${{ matrix.spark-version }}
scala-version: ${{ matrix.scala-version }}
spark-compat-version: ${{ matrix.spark-compat-version }}
scala-compat-version: ${{ matrix.scala-compat-version }}
java-compat-version: ${{ matrix.java-compat-version }}
hadoop-version: ${{ matrix.hadoop-version }}
================================================
FILE: .github/workflows/build-python.yml
================================================
name: Build Python
on:
workflow_call:
jobs:
# pyspark<4 is not available for snapshots or scala other than 2.12
whl:
name: Build whl (Spark ${{ matrix.spark-version }} Scala ${{ matrix.scala-version }})
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
include:
- spark-compat-version: '3.2'
spark-version: '3.2.4'
scala-compat-version: '2.12'
scala-version: '2.12.15'
java-compat-version: '8'
python-version: '3.9'
- spark-compat-version: '3.3'
spark-version: '3.3.4'
scala-compat-version: '2.12'
scala-version: '2.12.15'
java-compat-version: '8'
python-version: '3.9'
- spark-compat-version: '3.4'
spark-version: '3.4.4'
scala-compat-version: '2.12'
scala-version: '2.12.17'
java-compat-version: '8'
python-version: '3.9'
- spark-compat-version: '3.5'
spark-version: '3.5.8'
scala-compat-version: '2.12'
scala-version: '2.12.18'
java-compat-version: '8'
python-version: '3.9'
- spark-compat-version: '4.0'
spark-version: '4.0.2'
scala-compat-version: '2.13'
scala-version: '2.13.16'
java-compat-version: '17'
python-version: '3.9'
- spark-version: '4.1.1'
spark-compat-version: '4.1'
scala-compat-version: '2.13'
scala-version: '2.13.17'
java-compat-version: '17'
hadoop-version: '3'
python-version: '3.10'
- spark-version: '4.2.0-preview3'
spark-compat-version: '4.2'
scala-compat-version: '2.13'
scala-version: '2.13.18'
java-compat-version: '17'
hadoop-version: '3'
python-version: '3.10'
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Build
uses: ./.github/actions/build-whl
with:
spark-version: ${{ matrix.spark-version }}
scala-version: ${{ matrix.scala-version }}
spark-compat-version: ${{ matrix.spark-compat-version }}
scala-compat-version: ${{ matrix.scala-compat-version }}
java-compat-version: ${{ matrix.java-compat-version }}
python-version: ${{ matrix.python-version }}
================================================
FILE: .github/workflows/build-snapshots.yml
================================================
name: Build Snapshots
on:
workflow_call:
jobs:
build:
name: Build (Spark ${{ matrix.spark-version }} Scala ${{ matrix.scala-version }})
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
include:
- spark-compat-version: '3.2'
spark-version: '3.2.5-SNAPSHOT'
scala-compat-version: '2.12'
scala-version: '2.12.15'
java-compat-version: '8'
- spark-compat-version: '3.3'
spark-version: '3.3.5-SNAPSHOT'
scala-compat-version: '2.12'
scala-version: '2.12.15'
java-compat-version: '8'
- spark-compat-version: '3.4'
spark-version: '3.4.5-SNAPSHOT'
scala-compat-version: '2.12'
scala-version: '2.12.17'
java-compat-version: '8'
- spark-compat-version: '3.5'
spark-version: '3.5.9-SNAPSHOT'
scala-compat-version: '2.12'
scala-version: '2.12.18'
java-compat-version: '8'
- spark-compat-version: '3.2'
spark-version: '3.2.5-SNAPSHOT'
scala-compat-version: '2.13'
scala-version: '2.13.5'
java-compat-version: '8'
- spark-compat-version: '3.3'
spark-version: '3.3.5-SNAPSHOT'
scala-compat-version: '2.13'
scala-version: '2.13.8'
java-compat-version: '8'
- spark-compat-version: '3.4'
spark-version: '3.4.5-SNAPSHOT'
scala-compat-version: '2.13'
scala-version: '2.13.8'
java-compat-version: '8'
- spark-compat-version: '3.5'
spark-version: '3.5.9-SNAPSHOT'
scala-compat-version: '2.13'
scala-version: '2.13.8'
java-compat-version: '8'
- spark-compat-version: '4.0'
spark-version: '4.0.3-SNAPSHOT'
scala-compat-version: '2.13'
scala-version: '2.13.16'
java-compat-version: '17'
- spark-compat-version: '4.1'
spark-version: '4.1.2-SNAPSHOT'
scala-compat-version: '2.13'
scala-version: '2.13.17'
java-compat-version: '17'
- spark-compat-version: '4.2'
spark-version: '4.2.0-SNAPSHOT'
scala-compat-version: '2.13'
scala-version: '2.13.18'
java-compat-version: '17'
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Build
uses: ./.github/actions/build
with:
spark-version: ${{ matrix.spark-version }}
scala-version: ${{ matrix.scala-version }}
spark-compat-version: ${{ matrix.spark-compat-version }}-SNAPSHOT
scala-compat-version: ${{ matrix.scala-compat-version }}
java-compat-version: ${{ matrix.java-compat-version }}
================================================
FILE: .github/workflows/check.yml
================================================
name: Check
on:
workflow_call:
jobs:
lint:
name: Scala lint
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Setup JDK ${{ inputs.java-compat-version }}
uses: actions/setup-java@v4
with:
java-version: '11'
distribution: 'zulu'
- name: Check
id: check
run: |
mvn --batch-mode --update-snapshots spotless:check
shell: bash
- name: Changes
if: failure() && steps.check.outcome == 'failure'
run: |
mvn --batch-mode --update-snapshots spotless:apply
git diff
shell: bash
config:
name: Configure compat
runs-on: ubuntu-latest
outputs:
major-version: ${{ steps.versions.outputs.major-version }}
release-version: ${{ steps.versions.outputs.release-version }}
release-major-version: ${{ steps.versions.outputs.release-major-version }}
steps:
- name: Checkout
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Get versions
id: versions
run: |
version=$(grep -m1 version pom.xml | sed -e "s/<[^>]*>//g" -e "s/ //g")
echo "version: $version"
echo "major-version: ${version/.*/}"
echo "version=$version" >> "$GITHUB_OUTPUT"
echo "major-version=${version/.*/}" >> "$GITHUB_OUTPUT"
release_version=$(git tag | grep "^v" | sort --version-sort | tail -n1 | sed "s/^v//")
echo "release-version: $release_version"
echo "release-major-version: ${release_version/.*/}"
echo "release-version=$release_version" >> "$GITHUB_OUTPUT"
echo "release-major-version=${release_version/.*/}" >> "$GITHUB_OUTPUT"
shell: bash
compat:
name: Compat (Spark ${{ matrix.spark-compat-version }} Scala ${{ matrix.scala-compat-version }})
needs: config
runs-on: ubuntu-latest
if: needs.config.outputs.major-version == needs.config.outputs.release-major-version
strategy:
fail-fast: false
matrix:
include:
- spark-compat-version: '3.2'
spark-version: '3.2.4'
scala-compat-version: '2.12'
scala-version: '2.12.15'
- spark-compat-version: '3.3'
spark-version: '3.3.4'
scala-compat-version: '2.12'
scala-version: '2.12.15'
- spark-compat-version: '3.4'
scala-compat-version: '2.12'
scala-version: '2.12.17'
spark-version: '3.4.4'
- spark-compat-version: '3.5'
scala-compat-version: '2.12'
scala-version: '2.12.18'
spark-version: '3.5.8'
- spark-compat-version: '4.0'
scala-compat-version: '2.13'
scala-version: '2.13.16'
spark-version: '4.0.2'
- spark-compat-version: '4.1'
scala-compat-version: '2.13'
scala-version: '2.13.17'
spark-version: '4.1.1'
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Check
uses: ./.github/actions/check-compat
with:
spark-version: ${{ matrix.spark-version }}
scala-version: ${{ matrix.scala-version }}
spark-compat-version: ${{ matrix.spark-compat-version }}
scala-compat-version: ${{ matrix.scala-compat-version }}
package-version: ${{ needs.config.outputs.release-version }}
================================================
FILE: .github/workflows/ci.yml
================================================
name: CI
on:
schedule:
- cron: '0 8 */10 * *'
push:
branches:
- 'master'
tags:
- '*'
merge_group:
pull_request:
workflow_dispatch:
jobs:
event_file:
name: "Event File"
runs-on: ubuntu-latest
steps:
- name: Upload
uses: actions/upload-artifact@v4
with:
name: Event File
path: ${{ github.event_path }}
build-jvm:
name: "Build JVM"
uses: "./.github/workflows/build-jvm.yml"
build-snapshots:
name: "Build Snapshots"
uses: "./.github/workflows/build-snapshots.yml"
build-python:
name: "Build Python"
needs: build-jvm
uses: "./.github/workflows/build-python.yml"
test-jvm:
name: "Test JVM"
needs: build-jvm
uses: "./.github/workflows/test-jvm.yml"
test-python:
name: "Test Python"
needs: build-jvm
uses: "./.github/workflows/test-python.yml"
test-snapshots-jvm:
name: "Test Snapshots"
needs: build-snapshots
uses: "./.github/workflows/test-snapshots.yml"
test-release:
name: "Test Release"
needs: build-jvm
uses: "./.github/workflows/test-release.yml"
check:
name: "Check"
needs: build-jvm
uses: "./.github/workflows/check.yml"
# A single job that succeeds if all jobs listed under 'needs' succeed.
# This allows to configure a single job as a required check.
# The 'needed' jobs then can be changed through pull-requests.
test_success:
name: "Test success"
if: always()
runs-on: ubuntu-latest
# the if clauses below have to reflect the number of jobs listed here
needs: [build-jvm, build-python, test-jvm, test-python, test-release]
env:
RESULTS: ${{ join(needs.*.result, ',') }}
steps:
- name: "Success"
# we expect all required jobs to have success result
if: env.RESULTS == 'success,success,success,success,success'
run: true
shell: bash
- name: "Failure"
# we expect all required jobs to have success result, fail otherwise
if: env.RESULTS != 'success,success,success,success,success'
run: false
shell: bash
================================================
FILE: .github/workflows/clear-caches.yaml
================================================
name: Clear caches
on:
workflow_dispatch:
permissions:
actions: write
jobs:
clear-cache:
runs-on: ubuntu-latest
steps:
- name: Clear caches
uses: actions/github-script@v7
with:
script: |
const caches = await github.paginate(
github.rest.actions.getActionsCacheList.endpoint.merge({
owner: context.repo.owner,
repo: context.repo.repo,
})
)
for (const cache of caches) {
console.log(cache)
github.rest.actions.deleteActionsCacheById({
owner: context.repo.owner,
repo: context.repo.repo,
cache_id: cache.id,
})
}
================================================
FILE: .github/workflows/prepare-release.yml
================================================
name: Prepare release
on:
workflow_dispatch:
inputs:
github_release_latest:
description: 'Make the created GitHub release the latest'
required: false
default: true
type: boolean
jobs:
get-version:
name: Get version
runs-on: ubuntu-latest
outputs:
release-tag: ${{ steps.versions.outputs.release-tag }}
is-snapshot: ${{ steps.versions.outputs.is-snapshot }}
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Get versions
id: versions
run: |
# get release version
version=$(grep --max-count=1 "<version>.*</version>" pom.xml | sed -E -e "s/\s*<[^>]+>//g" -e "s/-SNAPSHOT//" -e "s/-[0-9.]+//g")
is_snapshot=$(if grep -q "<version>.*-SNAPSHOT</version>" pom.xml; then echo "true"; else echo "false"; fi)
# share versions
echo "release-tag=v${version}" >> "$GITHUB_OUTPUT"
echo "is-snapshot=$is_snapshot" >> "$GITHUB_OUTPUT"
prepare-release:
name: Prepare release
runs-on: ubuntu-latest
if: ( ! github.event.repository.fork )
needs: get-version
# secrets are provided by environment
environment:
name: tagged
url: 'https://github.com/G-Research/spark-extension?version=${{ needs.get-version.outputs.release-tag }}'
steps:
- name: Create GitHub App token
uses: actions/create-github-app-token@v2
id: app-token
with:
app-id: ${{ vars.APP_ID }}
private-key: ${{ secrets.PRIVATE_KEY }}
# required to push to a branch
permission-contents: write
- name: Get GitHub App User ID
id: get-user-id
run: echo "user-id=$(gh api "/users/${{ steps.app-token.outputs.app-slug }}[bot]" --jq .id)" >> "$GITHUB_OUTPUT"
env:
GH_TOKEN: ${{ steps.app-token.outputs.token }}
- name: Checkout code
uses: actions/checkout@v4
with:
token: ${{ steps.app-token.outputs.token }}
fetch-depth: 0
- name: Check branch setup
run: |
# Check branch setup
if [[ "$GITHUB_REF" != "refs/heads/master" ]] && [[ "$GITHUB_REF" != "refs/heads/master-"* ]]
then
echo "This workflow must be run on master or master-* branch, not $GITHUB_REF"
exit 1
fi
- name: Tag and bump version
if: needs.get-version.outputs.is-snapshot
run: |
# check for unreleased entry in CHANGELOG.md
readarray -t changes < <(grep -A 100 "^## \[UNRELEASED\] - YYYY-MM-DD" CHANGELOG.md | grep -B 100 --max-count=1 -E "^## \[[0-9.]+\]" | grep "^-")
if [ ${#changes[@]} -eq 0 ]
then
echo "Did not find any changes in CHANGELOG.md under '## [UNRELEASED] - YYYY-MM-DD'"
exit 1
fi
# get latest and release version
latest=$(grep --max-count=1 "<version>.*</version>" README.md | sed -E -e "s/\s*<[^>]+>//g" -e "s/-[0-9.]+//g")
version=$(grep --max-count=1 "<version>.*</version>" pom.xml | sed -E -e "s/\s*<[^>]+>//g" -e "s/-SNAPSHOT//" -e "s/-[0-9.]+//g")
# update changlog
echo "Releasing ${#changes[@]} changes as version $version:"
for (( i=0; i<${#changes[@]}; i++ )); do echo "${changes[$i]}" ; done
sed -i "s/## \[UNRELEASED\] - YYYY-MM-DD/## [$version] - $(date +%Y-%m-%d)/" CHANGELOG.md
sed -i -e "s/$latest-/$version-/g" -e "s/$latest\./$version./g" README.md PYSPARK-DEPS.md python/README.md
./set-version.sh $version
# configure git so we can commit changes
git config --global user.name '${{ steps.app-token.outputs.app-slug }}[bot]'
git config --global user.email '${{ steps.get-user-id.outputs.user-id }}+${{ steps.app-token.outputs.app-slug }}[bot]@users.noreply.github.com'
# commit changes to local repo
echo "Committing release to local git"
git add pom.xml python/setup.py CHANGELOG.md README.md PYSPARK-DEPS.md python/README.md
git commit -m "Releasing $version"
git tag -a "v${version}" -m "Release v${version}"
# bump version
# define function to bump version
function next_version {
local version=$1
local branch=$2
patch=${version/*./}
majmin=${version%.${patch}}
if [[ $branch == "master" ]]
then
# minor version bump
if [[ $version != *".0" ]]
then
echo "version is patch version, should be M.m.0: $version" >&2
exit 1
fi
maj=${version/.*/}
min=${majmin#${maj}.}
next=${maj}.$((min+1)).0
echo "$next"
else
# patch version bump
next=${majmin}.$((patch+1))
echo "$next"
fi
}
# get next version
pkg_version="${version/-*/}"
branch=$(git rev-parse --abbrev-ref HEAD)
next_pkg_version="$(next_version "$pkg_version" "$branch")"
# bump the version
echo "Bump version to $next_pkg_version"
./set-version.sh $next_pkg_version-SNAPSHOT
# commit changes to local repo
echo "Committing release to local git"
git commit -a -m "Post-release version bump to $next_pkg_version"
# push all commits and tag to origin
echo "Pushing release commit and tag to origin"
git push origin "$GITHUB_REF_NAME" "v${version}" --tags
# NOTE: This push will not trigger a CI as we are using GITHUB_TOKEN to push
# More info on: https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow
github-release:
name: Create GitHub release
runs-on: ubuntu-latest
needs:
- get-version
- prepare-release
permissions:
contents: write # required to create release
steps:
- name: Checkout release tag
uses: actions/checkout@v4
with:
ref: ${{ needs.get-version.outputs.release-tag }}
- name: Extract release notes
id: release-notes
run: |
awk '/^## /{if(seen==1)exit; seen++} seen' CHANGELOG.md > ./release-notes.txt
# Grab release name
name=$(grep -m 1 "^## " CHANGELOG.md | sed "s/^## //")
echo "release_name=$name" >> $GITHUB_OUTPUT
# provide release notes file path as output
echo "release_notes_path=release-notes.txt" >> $GITHUB_OUTPUT
- name: Publish GitHub release
uses: ncipollo/release-action@2c591bcc8ecdcd2db72b97d6147f871fcd833ba5
id: github-release
with:
name: ${{ steps.release-notes.outputs.release_name }}
bodyFile: ${{ steps.release-notes.outputs.release_notes_path }}
makeLatest: ${{ inputs.github_release_latest }}
tag: ${{ needs.get-version.outputs.release-tag }}
token: ${{ github.token }}
================================================
FILE: .github/workflows/prime-caches.yml
================================================
name: Prime caches
on:
workflow_dispatch:
jobs:
prime:
name: Spark ${{ matrix.spark-compat-version }}.${{ matrix.spark-patch-version }}${{ matrix.spark-snapshot-version && '-SNAPSHOT' }} Scala ${{ matrix.scala-version }}
runs-on: ubuntu-latest
strategy:
fail-fast: false
# keep in-sync with .github/workflows/test-jvm.yml
matrix:
include:
- spark-compat-version: '3.2'
scala-compat-version: '2.12'
scala-version: '2.12.15'
spark-patch-version: '4'
hadoop-version: '2.7'
- spark-compat-version: '3.3'
scala-compat-version: '2.12'
scala-version: '2.12.15'
spark-patch-version: '4'
hadoop-version: '3'
- spark-compat-version: '3.4'
scala-compat-version: '2.12'
scala-version: '2.12.17'
spark-patch-version: '4'
hadoop-version: '3'
- spark-compat-version: '3.5'
scala-compat-version: '2.12'
scala-version: '2.12.18'
spark-patch-version: '8'
hadoop-version: '3'
- spark-compat-version: '3.2'
scala-compat-version: '2.13'
scala-version: '2.13.5'
spark-patch-version: '4'
hadoop-version: '3.2'
- spark-compat-version: '3.3'
scala-compat-version: '2.13'
scala-version: '2.13.8'
spark-patch-version: '4'
hadoop-version: '3'
- spark-compat-version: '3.4'
scala-compat-version: '2.13'
scala-version: '2.13.8'
spark-patch-version: '4'
hadoop-version: '3'
- spark-compat-version: '3.5'
scala-compat-version: '2.13'
scala-version: '2.13.8'
spark-patch-version: '8'
hadoop-version: '3'
- spark-compat-version: '4.0'
scala-compat-version: '2.13'
scala-version: '2.13.16'
spark-patch-version: '2'
java-compat-version: '17'
hadoop-version: '3'
- spark-compat-version: '4.1'
scala-compat-version: '2.13'
scala-version: '2.13.17'
spark-patch-version: '1'
java-compat-version: '17'
hadoop-version: '3'
- spark-compat-version: '4.2'
scala-compat-version: '2.13'
scala-version: '2.13.18'
spark-patch-version: '0-preview3'
java-compat-version: '17'
hadoop-version: '3'
- spark-compat-version: '3.2'
scala-compat-version: '2.12'
scala-version: '2.12.15'
spark-patch-version: '5'
spark-snapshot-version: true
hadoop-version: '2.7'
- spark-compat-version: '3.3'
scala-compat-version: '2.12'
scala-version: '2.12.15'
spark-patch-version: '5'
spark-snapshot-version: true
hadoop-version: '3'
- spark-compat-version: '3.4'
scala-compat-version: '2.12'
scala-version: '2.12.17'
spark-patch-version: '5'
spark-snapshot-version: true
hadoop-version: '3'
- spark-compat-version: '3.5'
scala-compat-version: '2.12'
scala-version: '2.12.18'
spark-patch-version: '9'
spark-snapshot-version: true
hadoop-version: '3'
- spark-compat-version: '3.2'
scala-compat-version: '2.13'
scala-version: '2.13.5'
spark-patch-version: '5'
spark-snapshot-version: true
hadoop-version: '3.2'
- spark-compat-version: '3.3'
scala-compat-version: '2.13'
scala-version: '2.13.8'
spark-patch-version: '5'
spark-snapshot-version: true
hadoop-version: '3'
- spark-compat-version: '3.4'
scala-compat-version: '2.13'
scala-version: '2.13.8'
spark-patch-version: '5'
spark-snapshot-version: true
hadoop-version: '3'
- spark-compat-version: '3.5'
scala-compat-version: '2.13'
scala-version: '2.13.8'
spark-patch-version: '9'
spark-snapshot-version: true
hadoop-version: '3'
- spark-compat-version: '4.0'
scala-compat-version: '2.13'
scala-version: '2.13.16'
spark-patch-version: '3'
spark-snapshot-version: true
hadoop-version: '3'
- spark-compat-version: '4.1'
scala-compat-version: '2.13'
scala-version: '2.13.17'
spark-patch-version: '2'
spark-snapshot-version: true
hadoop-version: '3'
- spark-compat-version: '4.2'
scala-compat-version: '2.13'
scala-version: '2.13.18'
spark-patch-version: '0'
spark-snapshot-version: true
hadoop-version: '3'
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Prime caches
uses: ./.github/actions/prime-caches
with:
spark-version: ${{ matrix.spark-compat-version }}.${{ matrix.spark-patch-version }}${{ matrix.spark-snapshot-version && '-SNAPSHOT' }}
scala-version: ${{ matrix.scala-version }}
spark-compat-version: ${{ matrix.spark-compat-version }}
scala-compat-version: ${{ matrix.scala-compat-version }}
hadoop-version: ${{ matrix.hadoop-version }}
java-compat-version: '8'
================================================
FILE: .github/workflows/publish-release.yml
================================================
name: Publish release
on:
workflow_dispatch:
inputs:
versions:
required: true
type: string
description: 'Example: {"include": [{"params": {"spark-version": "4.0.0","scala-version": "2.13.16"}}]}'
default: |
{
"include": [
{"params": {"spark-version": "3.2.4", "scala-version": "2.12.15", "java-compat-version": "8"}},
{"params": {"spark-version": "3.3.4", "scala-version": "2.12.15", "java-compat-version": "8"}},
{"params": {"spark-version": "3.4.4", "scala-version": "2.12.17", "java-compat-version": "8"}},
{"params": {"spark-version": "3.5.8", "scala-version": "2.12.18", "java-compat-version": "8"}},
{"params": {"spark-version": "3.2.4", "scala-version": "2.13.5", "java-compat-version": "8"}},
{"params": {"spark-version": "3.3.4", "scala-version": "2.13.8", "java-compat-version": "8"}},
{"params": {"spark-version": "3.4.4", "scala-version": "2.13.8", "java-compat-version": "8"}},
{"params": {"spark-version": "3.5.8", "scala-version": "2.13.8", "java-compat-version": "8"}},
{"params": {"spark-version": "4.0.2", "scala-version": "2.13.16", "java-compat-version": "17"}},
{"params": {"spark-version": "4.1.1", "scala-version": "2.13.17", "java-compat-version": "17"}}
]
}
env:
# PySpark 3 versions only work with Python 3.9
PYTHON_VERSION: "3.9"
jobs:
get-version:
name: Get version
runs-on: ubuntu-latest
outputs:
release-tag: ${{ steps.versions.outputs.release-tag }}
is-snapshot: ${{ steps.versions.outputs.is-snapshot }}
steps:
- name: Checkout release tag
uses: actions/checkout@v4
- name: Get versions
id: versions
run: |
# get release version
version=$(grep --max-count=1 "<version>.*</version>" pom.xml | sed -E -e "s/\s*<[^>]+>//g" -e "s/-SNAPSHOT//" -e "s/-[0-9.]+//g")
is_snapshot=$(if grep -q "<version>.*-SNAPSHOT</version>" pom.xml; then echo "true"; else echo "false"; fi)
# share versions
echo "release-tag=v${version}" >> "$GITHUB_OUTPUT"
echo "is-snapshot=$is_snapshot" >> "$GITHUB_OUTPUT"
- name: Check tag setup
run: |
# Check tag setup
if [[ "$GITHUB_REF" != "refs/tags/v"* ]]
then
echo "This workflow must be run on a tag, not $GITHUB_REF"
exit 1
fi
if [ "${{ steps.versions.outputs.is-snapshot }}" == "true" ]
then
echo "This is a tagged SNAPSHOT version. This is not allowed for release!"
exit 1
fi
if [ "${{ github.ref_name }}" != "${{ steps.versions.outputs.release-tag }}" ]
then
echo "The version in the pom.xml is ${{ steps.versions.outputs.release-tag }}"
echo "This tag is ${{ github.ref_name }}, which is different!"
exit 1
fi
- name: Show matrix
run: |
echo '${{ github.event.inputs.versions }}' | jq .
maven-release:
name: Publish maven release (Spark ${{ matrix.params.spark-version }}, Scala ${{ matrix.params.scala-version }})
runs-on: ubuntu-latest
needs: get-version
if: ( ! github.event.repository.fork )
# secrets are provided by environment
environment:
name: release
# a different URL for each point in the matrix, but the same URLs accross commits
url: 'https://github.com/G-Research/spark-extension?version=${{ needs.get-version.outputs.release-tag }}&spark=${{ matrix.params.spark-version }}&scala=${{ matrix.params.scala-version }}&package=maven'
permissions: {}
strategy:
fail-fast: false
matrix: ${{ fromJson(github.event.inputs.versions) }}
steps:
- name: Checkout release tag
uses: actions/checkout@v4
- name: Set up JDK and publish to Maven Central
uses: actions/setup-java@3a4f6e1af504cf6a31855fa899c6aa5355ba6c12 # v4.7.0
with:
java-version: ${{ matrix.params.java-compat-version }}
distribution: 'corretto'
server-id: central
server-username: MAVEN_USERNAME
server-password: MAVEN_PASSWORD
gpg-private-key: ${{ secrets.MAVEN_GPG_PRIVATE_KEY }}
gpg-passphrase: MAVEN_GPG_PASSPHRASE
- name: Inspect GPG
run: gpg -k
- name: Restore Maven packages cache
id: cache-maven
uses: actions/cache/restore@v4
with:
path: ~/.m2/repository
key: ${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}-${{ hashFiles('pom.xml') }}
restore-keys: |
${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}-${{ hashFiles('pom.xml') }}
${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}-
- name: Publish maven artifacts
id: publish-maven
run: |
./set-version.sh ${{ matrix.params.spark-version }} ${{ matrix.params.scala-version }}
mvn clean deploy -Dsign -Dspotless.check.skip -DskipTests -Dmaven.test.skip=true
env:
MAVEN_USERNAME: ${{ secrets.MAVEN_USERNAME }}
MAVEN_PASSWORD: ${{ secrets.MAVEN_PASSWORD }}
MAVEN_GPG_PASSPHRASE: ${{ secrets.MAVEN_GPG_PASSPHRASE}}
pypi-release:
name: Publish PyPi release (Spark ${{ matrix.params.spark-version }}, Scala ${{ matrix.params.scala-version }})
runs-on: ubuntu-latest
needs: get-version
if: ( ! github.event.repository.fork )
# secrets are provided by environment
environment:
name: release
# a different URL for each point in the matrix, but the same URLs accross commits
url: 'https://github.com/G-Research/spark-extension?version=${{ needs.get-version.outputs.release-tag }}&spark=${{ matrix.params.spark-version }}&scala=${{ matrix.params.scala-version }}&package=pypi'
permissions:
id-token: write # required for PiPy publish
strategy:
fail-fast: false
matrix: ${{ fromJson(github.event.inputs.versions) }}
steps:
- name: Checkout release tag
uses: actions/checkout@v4
- name: Set up JDK
uses: actions/setup-java@3a4f6e1af504cf6a31855fa899c6aa5355ba6c12 # v4.7.0
with:
java-version: ${{ matrix.params.java-compat-version }}
distribution: 'corretto'
- uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Restore Maven packages cache
id: cache-maven
uses: actions/cache/restore@v4
with:
path: ~/.m2/repository
key: ${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}-${{ hashFiles('pom.xml') }}
restore-keys: |
${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}-${{ hashFiles('pom.xml') }}
${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}-
- name: Build maven artifacts
id: maven
if: startsWith(matrix.params.spark-version, '3.') && startsWith(matrix.params.scala-version, '2.12.') || startsWith(matrix.params.spark-version, '4.') && startsWith(matrix.params.scala-version, '2.13.')
run: |
./set-version.sh ${{ matrix.params.spark-version }} ${{ matrix.params.scala-version }}
mvn clean package -Dspotless.check.skip -DskipTests -Dmaven.test.skip=true
- name: Prepare PyPi package
id: prepare-pypi-package
if: steps.maven.outcome == 'success'
run: |
./build-whl.sh
- name: Publish package distributions to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
if: steps.prepare-pypi-package.outcome == 'success'
with:
packages-dir: python/dist
skip-existing: true
verbose: true
================================================
FILE: .github/workflows/publish-snapshot.yml
================================================
name: Publish snapshot
on:
workflow_dispatch:
push:
branches: ["master"]
env:
PYTHON_VERSION: "3.10"
jobs:
check-version:
name: Check SNAPSHOT version
if: ( ! github.event.repository.fork )
runs-on: ubuntu-latest
permissions: {}
outputs:
is-snapshot: ${{ steps.check.outputs.is-snapshot }}
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Check if this is a SNAPSHOT version
id: check
run: |
# check is snapshot version
if grep -q "<version>.*-SNAPSHOT</version>" pom.xml
then
echo "Version in pom IS a SNAPSHOT version"
echo "is-snapshot=true" >> "$GITHUB_OUTPUT"
else
echo "Version in pom is NOT a SNAPSHOT version"
echo "is-snapshot=false" >> "$GITHUB_OUTPUT"
fi
snapshot:
name: Snapshot Spark ${{ matrix.params.spark-version }} Scala ${{ matrix.params.scala-version }}
needs: check-version
# when we release from master, this workflow will see a commit that does not have a SNAPSHOT version
# we want this workflow to skip over that commit
if: needs.check-version.outputs.is-snapshot == 'true'
runs-on: ubuntu-latest
# secrets are provided by environment
environment:
name: snapshot
# a different URL for each point in the matrix, but the same URLs accross commits
url: 'https://github.com/G-Research/spark-extension?spark=${{ matrix.params.spark-version }}&scala=${{ matrix.params.scala-version }}&snapshot'
permissions: {}
strategy:
fail-fast: false
matrix:
include:
- params: {"spark-version": "3.2.4", "scala-version": "2.12.15", "scala-compat-version": "2.12", "java-compat-version": "8"}
- params: {"spark-version": "3.3.4", "scala-version": "2.12.15", "scala-compat-version": "2.12", "java-compat-version": "8"}
- params: {"spark-version": "3.4.4", "scala-version": "2.12.17", "scala-compat-version": "2.12", "java-compat-version": "8"}
- params: {"spark-version": "3.5.8", "scala-version": "2.12.18", "scala-compat-version": "2.12", "java-compat-version": "8"}
- params: {"spark-version": "3.2.4", "scala-version": "2.13.5", "scala-compat-version": "2.13", "java-compat-version": "8"}
- params: {"spark-version": "3.3.4", "scala-version": "2.13.8", "scala-compat-version": "2.13", "java-compat-version": "8"}
- params: {"spark-version": "3.4.4", "scala-version": "2.13.8", "scala-compat-version": "2.13", "java-compat-version": "8"}
- params: {"spark-version": "3.5.8", "scala-version": "2.13.8", "scala-compat-version": "2.13", "java-compat-version": "8"}
- params: {"spark-version": "4.0.2", "scala-version": "2.13.16", "scala-compat-version": "2.13", "java-compat-version": "17"}
- params: {"spark-version": "4.1.1", "scala-version": "2.13.17", "scala-compat-version": "2.13", "java-compat-version": "17"}
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up JDK and publish to Maven Central
uses: actions/setup-java@3a4f6e1af504cf6a31855fa899c6aa5355ba6c12 # v4.7.0
with:
java-version: ${{ matrix.params.java-compat-version }}
distribution: 'corretto'
server-id: central
server-username: MAVEN_USERNAME
server-password: MAVEN_PASSWORD
gpg-private-key: ${{ secrets.MAVEN_GPG_PRIVATE_KEY }}
gpg-passphrase: MAVEN_GPG_PASSPHRASE
- name: Inspect GPG
run: gpg -k
- uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Restore Maven packages cache
id: cache-maven
uses: actions/cache/restore@v4
with:
path: ~/.m2/repository
key: ${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}-${{ hashFiles('pom.xml') }}
restore-keys: |
${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}-${{ hashFiles('pom.xml') }}
${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}-
- name: Publish snapshot
run: |
./set-version.sh ${{ matrix.params.spark-version }} ${{ matrix.params.scala-version }}
mvn clean deploy -Dsign -Dspotless.check.skip -DskipTests -Dmaven.test.skip=true
env:
MAVEN_USERNAME: ${{ secrets.MAVEN_USERNAME }}
MAVEN_PASSWORD: ${{ secrets.MAVEN_PASSWORD }}
MAVEN_GPG_PASSPHRASE: ${{ secrets.MAVEN_GPG_PASSPHRASE}}
- name: Prepare PyPi package to test snapshot
if: ${{ matrix.params.scala-version }} == 2.12*
run: |
# Build whl
./build-whl.sh
- name: Restore Spark Binaries cache
uses: actions/cache/restore@v4
with:
path: ~/spark
key: ${{ runner.os }}-spark-binaries-${{ matrix.params.spark-version }}-${{ matrix.params.scala-compat-version }}
restore-keys: |
${{ runner.os }}-spark-binaries-${{ matrix.params.spark-version }}-${{ matrix.params.scala-compat-version }}
- name: Rename Spark Binaries cache
run: |
mv ~/spark ./spark-${{ matrix.params.spark-version }}-${{ matrix.params.scala-compat-version }}
- name: Test snapshot
id: test-package
run: |
# Test the snapshot (needs whl)
./test-release.sh
================================================
FILE: .github/workflows/test-jvm.yml
================================================
name: Test JVM
on:
workflow_call:
jobs:
test:
name: Test (Spark ${{ matrix.spark-compat-version }}.${{ matrix.spark-patch-version }} Scala ${{ matrix.scala-version }})
runs-on: ubuntu-latest
strategy:
fail-fast: false
# keep in-sync with .github/workflows/prime-caches.yml
matrix:
include:
- spark-compat-version: '3.2'
scala-compat-version: '2.12'
scala-version: '2.12.15'
spark-patch-version: '4'
java-compat-version: '8'
hadoop-version: '2.7'
- spark-compat-version: '3.3'
scala-compat-version: '2.12'
scala-version: '2.12.15'
spark-patch-version: '4'
java-compat-version: '8'
hadoop-version: '3'
- spark-compat-version: '3.4'
scala-compat-version: '2.12'
scala-version: '2.12.17'
spark-patch-version: '4'
java-compat-version: '8'
hadoop-version: '3'
- spark-compat-version: '3.5'
scala-compat-version: '2.12'
scala-version: '2.12.18'
spark-patch-version: '7'
java-compat-version: '8'
hadoop-version: '3'
- spark-compat-version: '3.2'
scala-compat-version: '2.13'
scala-version: '2.13.5'
spark-patch-version: '4'
java-compat-version: '8'
hadoop-version: '3.2'
- spark-compat-version: '3.3'
scala-compat-version: '2.13'
scala-version: '2.13.8'
spark-patch-version: '4'
java-compat-version: '8'
hadoop-version: '3'
- spark-compat-version: '3.4'
scala-compat-version: '2.13'
scala-version: '2.13.8'
spark-patch-version: '4'
java-compat-version: '8'
hadoop-version: '3'
- spark-compat-version: '3.5'
scala-compat-version: '2.13'
scala-version: '2.13.8'
spark-patch-version: '7'
java-compat-version: '8'
hadoop-version: '3'
- spark-compat-version: '4.0'
scala-compat-version: '2.13'
scala-version: '2.13.16'
spark-patch-version: '2'
java-compat-version: '17'
hadoop-version: '3'
- spark-compat-version: '4.1'
scala-compat-version: '2.13'
scala-version: '2.13.17'
spark-patch-version: '1'
java-compat-version: '17'
hadoop-version: '3'
- spark-compat-version: '4.2'
scala-compat-version: '2.13'
scala-version: '2.13.18'
spark-patch-version: '0-preview3'
java-compat-version: '17'
hadoop-version: '3'
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Test
uses: ./.github/actions/test-jvm
env:
CI_SLOW_TESTS: 1
with:
spark-version: ${{ matrix.spark-compat-version }}.${{ matrix.spark-patch-version }}
scala-version: ${{ matrix.scala-version }}
spark-compat-version: ${{ matrix.spark-compat-version }}
spark-archive-url: ${{ matrix.spark-archive-url }}
scala-compat-version: ${{ matrix.scala-compat-version }}
java-compat-version: ${{ matrix.java-compat-version }}
hadoop-version: ${{ matrix.hadoop-version }}
================================================
FILE: .github/workflows/test-python.yml
================================================
name: Test Python
on:
workflow_call:
jobs:
# pyspark is not available for snapshots or scala other than 2.12
# we would have to compile spark from sources for this, not worth it
test:
name: Test (Spark ${{ matrix.spark-version }} Scala ${{ matrix.scala-version }} Python ${{ matrix.python-version }})
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
spark-compat-version: ['3.2', '3.3', '3.4', '3.5', '4.0']
python-version: ['3.9', '3.10', '3.11', '3.12', '3.13']
include:
- spark-compat-version: '3.2'
spark-version: '3.2.4'
scala-compat-version: '2.12'
scala-version: '2.12.15'
java-compat-version: '8'
hadoop-version: '2.7'
- spark-compat-version: '3.3'
spark-version: '3.3.4'
scala-compat-version: '2.12'
scala-version: '2.12.15'
java-compat-version: '8'
hadoop-version: '3'
- spark-compat-version: '3.4'
spark-version: '3.4.4'
scala-compat-version: '2.12'
scala-version: '2.12.17'
java-compat-version: '8'
hadoop-version: '3'
- spark-compat-version: '3.5'
spark-version: '3.5.8'
scala-compat-version: '2.12'
scala-version: '2.12.18'
java-compat-version: '8'
hadoop-version: '3'
- spark-compat-version: '4.0'
spark-version: '4.0.2'
scala-compat-version: '2.13'
scala-version: '2.13.16'
java-compat-version: '17'
hadoop-version: '3'
- spark-compat-version: '4.1'
spark-version: '4.1.1'
scala-compat-version: '2.13'
scala-version: '2.13.17'
java-compat-version: '17'
hadoop-version: '3'
python-version: '3.10'
- spark-compat-version: '4.2'
spark-version: '4.2.0-preview3'
scala-compat-version: '2.13'
scala-version: '2.13.18'
java-compat-version: '17'
hadoop-version: '3'
python-version: '3.10'
exclude:
- spark-compat-version: '3.2'
python-version: '3.10'
- spark-compat-version: '3.2'
python-version: '3.11'
- spark-compat-version: '3.2'
python-version: '3.12'
- spark-compat-version: '3.2'
python-version: '3.13'
- spark-compat-version: '3.3'
python-version: '3.11'
- spark-compat-version: '3.3'
python-version: '3.12'
- spark-compat-version: '3.3'
python-version: '3.13'
- spark-compat-version: '3.4'
python-version: '3.12'
- spark-compat-version: '3.4'
python-version: '3.13'
- spark-compat-version: '3.5'
python-version: '3.12'
- spark-compat-version: '3.5'
python-version: '3.13'
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Test
uses: ./.github/actions/test-python
with:
spark-version: ${{ matrix.spark-version }}
scala-version: ${{ matrix.scala-version }}
spark-compat-version: ${{ matrix.spark-compat-version }}
spark-archive-url: ${{ matrix.spark-archive-url }}
spark-package-repo: ${{ matrix.spark-package-repo }}
scala-compat-version: ${{ matrix.scala-compat-version }}
java-compat-version: ${{ matrix.java-compat-version }}
hadoop-version: ${{ matrix.hadoop-version }}
python-version: ${{ matrix.python-version }}
================================================
FILE: .github/workflows/test-release.yml
================================================
name: Test release
on:
workflow_call:
jobs:
test:
name: Test Release Spark ${{ matrix.spark-version }} Scala ${{ matrix.scala-version }}
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
include:
- spark-compat-version: '3.2'
spark-version: '3.2.4'
scala-compat-version: '2.12'
scala-version: '2.12.15'
java-compat-version: '8'
hadoop-version: '2.7'
python-version: '3.9'
- spark-compat-version: '3.3'
spark-version: '3.3.4'
scala-compat-version: '2.12'
scala-version: '2.12.15'
java-compat-version: '8'
hadoop-version: '3'
python-version: '3.10'
- spark-compat-version: '3.4'
spark-version: '3.4.4'
scala-compat-version: '2.12'
scala-version: '2.12.17'
java-compat-version: '8'
hadoop-version: '3'
python-version: '3.11'
- spark-compat-version: '3.5'
spark-version: '3.5.8'
scala-compat-version: '2.12'
scala-version: '2.12.18'
java-compat-version: '8'
hadoop-version: '3'
python-version: '3.11'
- spark-compat-version: '3.2'
spark-version: '3.2.4'
scala-compat-version: '2.13'
scala-version: '2.13.5'
java-compat-version: '8'
hadoop-version: '3.2'
- spark-compat-version: '3.3'
spark-version: '3.3.4'
scala-compat-version: '2.13'
scala-version: '2.13.8'
java-compat-version: '8'
hadoop-version: '3'
- spark-compat-version: '3.4'
spark-version: '3.4.4'
scala-compat-version: '2.13'
scala-version: '2.13.8'
java-compat-version: '8'
hadoop-version: '3'
- spark-compat-version: '3.5'
spark-version: '3.5.8'
scala-compat-version: '2.13'
scala-version: '2.13.8'
java-compat-version: '8'
hadoop-version: '3'
- spark-compat-version: '4.0'
spark-version: '4.0.2'
scala-compat-version: '2.13'
scala-version: '2.13.16'
java-compat-version: '17'
hadoop-version: '3'
python-version: '3.13'
- spark-compat-version: '4.1'
spark-version: '4.1.1'
scala-compat-version: '2.13'
scala-version: '2.13.17'
java-compat-version: '17'
hadoop-version: '3'
python-version: '3.13'
- spark-compat-version: '4.2'
spark-version: '4.2.0-preview3'
scala-compat-version: '2.13'
scala-version: '2.13.18'
java-compat-version: '17'
hadoop-version: '3'
python-version: '3.13'
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Test
uses: ./.github/actions/test-release
with:
spark-version: ${{ matrix.spark-version }}
scala-version: ${{ matrix.scala-version }}
spark-compat-version: ${{ matrix.spark-compat-version }}
spark-archive-url: ${{ matrix.spark-archive-url }}
scala-compat-version: ${{ matrix.scala-compat-version }}
java-compat-version: ${{ matrix.java-compat-version }}
hadoop-version: ${{ matrix.hadoop-version }}
python-version: ${{ matrix.python-version }}
================================================
FILE: .github/workflows/test-results.yml
================================================
name: Test Results
on:
workflow_run:
workflows: ["CI"]
types:
- completed
permissions: {}
jobs:
publish-test-results:
name: Publish Test Results
runs-on: ubuntu-latest
if: github.event.workflow_run.conclusion != 'skipped'
permissions:
checks: write
pull-requests: write
steps:
- name: Download and Extract Artifacts
uses: dawidd6/action-download-artifact@09f2f74827fd3a8607589e5ad7f9398816f540fe
with:
run_id: ${{ github.event.workflow_run.id }}
name: "^Event File$| Test Results "
name_is_regexp: true
path: artifacts
- name: Publish Test Results
uses: EnricoMi/publish-unit-test-result-action@v2
with:
commit: ${{ github.event.workflow_run.head_sha }}
event_file: artifacts/Event File/event.json
event_name: ${{ github.event.workflow_run.event }}
files: "artifacts/* Test Results*/**/*.xml"
================================================
FILE: .github/workflows/test-snapshots.yml
================================================
name: Test Snapshots
on:
workflow_call:
jobs:
test:
name: Test (Spark ${{ matrix.spark-version }} Scala ${{ matrix.scala-version }})
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
include:
- spark-compat-version: '3.2'
spark-version: '3.2.5-SNAPSHOT'
scala-compat-version: '2.12'
scala-version: '2.12.15'
java-compat-version: '8'
- spark-compat-version: '3.3'
spark-version: '3.3.5-SNAPSHOT'
scala-compat-version: '2.12'
scala-version: '2.12.15'
java-compat-version: '8'
- spark-compat-version: '3.4'
spark-version: '3.4.5-SNAPSHOT'
scala-compat-version: '2.12'
scala-version: '2.12.17'
java-compat-version: '8'
- spark-compat-version: '3.5'
spark-version: '3.5.9-SNAPSHOT'
scala-compat-version: '2.12'
scala-version: '2.12.18'
java-compat-version: '8'
- spark-compat-version: '3.2'
spark-version: '3.2.5-SNAPSHOT'
scala-compat-version: '2.13'
scala-version: '2.13.5'
java-compat-version: '8'
- spark-compat-version: '3.3'
spark-version: '3.3.5-SNAPSHOT'
scala-compat-version: '2.13'
scala-version: '2.13.8'
java-compat-version: '8'
- spark-compat-version: '3.4'
spark-version: '3.4.5-SNAPSHOT'
scala-compat-version: '2.13'
scala-version: '2.13.8'
java-compat-version: '8'
- spark-compat-version: '3.5'
spark-version: '3.5.9-SNAPSHOT'
scala-compat-version: '2.13'
scala-version: '2.13.8'
java-compat-version: '8'
- spark-compat-version: '4.0'
spark-version: '4.0.3-SNAPSHOT'
scala-compat-version: '2.13'
scala-version: '2.13.16'
java-compat-version: '17'
- spark-compat-version: '4.1'
spark-version: '4.1.2-SNAPSHOT'
scala-compat-version: '2.13'
scala-version: '2.13.17'
java-compat-version: '17'
- spark-compat-version: '4.1'
spark-version: '4.2.0-SNAPSHOT'
scala-compat-version: '2.13'
scala-version: '2.13.18'
java-compat-version: '17'
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Test
uses: ./.github/actions/test-jvm
env:
CI_SLOW_TESTS: 1
with:
spark-version: ${{ matrix.spark-version }}
scala-version: ${{ matrix.scala-version }}
spark-compat-version: ${{ matrix.spark-compat-version }}-SNAPSHOT
scala-compat-version: ${{ matrix.scala-compat-version }}
java-compat-version: ${{ matrix.java-compat-version }}
================================================
FILE: .gitignore
================================================
# use glob syntax.
syntax: glob
*.ser
*.class
*~
*.bak
#*.off
*.old
# eclipse conf file
.settings
.classpath
.project
.manager
.scala_dependencies
# idea
.idea
*.iml
# building
target
build
null
tmp*
temp*
dist
test-output
build.log
# other scm
.svn
.CVS
.hg*
# switch to regexp syntax.
# syntax: regexp
# ^\.pc/
#SHITTY output not in target directory
build.log
# project specific
python/**/__pycache__
spark-*
.cache
================================================
FILE: .scalafmt.conf
================================================
version = 3.7.17
runner.dialect = scala213
rewrite.trailingCommas.style = keep
docstrings.style = Asterisk
maxColumn = 120
================================================
FILE: CHANGELOG.md
================================================
# Changelog
All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
## [2.15.0] - 2025-12-13
### Added
- Support encrypted parquet files (#324)
### Changed
- Remove support for Spark 3.0 and Spark 3.1 (#332)
- Make all undocumented unintended public API parts private (#331)
- Reading Parquet metadata can use Parquet Hadoop version different to version coming with Spark (#330)
## [2.14.2] - 2025-07-21
### Changed
- Fixed release process (#320)
## [2.14.1] - 2025-07-17
### Changed
- Fixed release process (#319)
## [2.14.0] - 2025-07-17
### Added
- Support for Spark 4.0 (#269, #272, #293)
### Changed
- Improve backticks (#265)
New: This escapes backticks that already exist in column names.
Change: This does not quote columns that only contain letters, numbers
and underscores, which were quoted before.
- Move Python dependencies into `setup.py`, build jar from `setup.py` (#301)
## [2.13.0] - 2024-11-04
### Fixes
- Support diff for Spark Connect implemened via PySpark Dataset API (#251)
### Added
- Add ignore columns to diff in Python API (#252)
- Check that the Java / Scala package is installed when needed by Python (#250)
## [2.12.0] - 2024-04-26
### Fixes
- Diff change column should respect comparators (#238)
### Changed
- Make create_temporary_dir work with pyspark-extension only (#222).
This allows [installing PIP packages and Poetry projects](PYSPARK-DEPS.md)
via pure Python spark-extension package (Maven package not required any more).
- Add map diff comparator to Python API (#226)
## [2.11.0] - 2024-01-04
### Added
- Add count_null aggregate function (#206)
- Support reading parquet schema (#208)
- Add more columns to reading parquet metadata (#209, #211)
- Provide groupByKey shortcuts for groupBy.as (#213)
- Allow to install PIP packages into PySpark job (#215)
- Allow to install Poetry projects into PySpark job (#216)
## [2.10.0] - 2023-09-27
### Fixed
- Update setup.py to include parquet methods in python package (#191)
### Added
- Add --statistics option to diff app (#189)
- Add --filter option to diff app (#190)
## [2.9.0] - 2023-08-23
### Added
- Add key order sensitive map comparator (#187)
### Changed
- Use dataset encoder rather than implicit value encoder for implicit dataset extension class (#183)
### Fixed
- Fix key-sensitivity in map comparator (#186)
## [2.8.0] - 2023-05-24
### Added
- Add method to set and automatically unset Spark job description. (#172)
- Add column function that converts between .Net (C#, F#, Visual Basic) `DateTime.Ticks` and Spark timestamp / Unix epoch timestamps. (#153)
## [2.7.0] - 2023-05-05
### Added
- Spark app to diff files or tables and write result back to file or table. (#160)
- Add null value count to `parquetBlockColumns` and `parquet_block_columns`. (#162)
- Add `parallelism` argument to Parquet metadata methods. (#164)
### Changed
- Change data type of column name in `parquetBlockColumns` and `parquet_block_columns` to array of strings.
Cast to string to get earlier behaviour (string column name). (#162)
## [2.6.0] - 2023-04-11
### Added
- Add reader for parquet metadata. (#154)
## [2.5.0] - 2023-03-23
### Added
- Add whitespace agnostic diff comparator. (#137)
- Add Python whl package build. (#151)
## [2.4.0] - 2022-12-08
### Added
- Allow for custom diff equality. (#127)
### Fixed
- Fix Python API calling into Scala code. (#132)
## [2.3.0] - 2022-10-26
### Added
- Add diffWith to Scala, Java and Python Diff API. (#109)
### Changed
- Diff similar Datasets with ignoreColumns. Before, only similar DataFrame could be diffed with ignoreColumns. (#111)
### Fixed
- Cache before writing via partitionedBy to work around SPARK-40588. Unpersist via UnpersistHandle. (#124)
## [2.2.0] - 2022-07-21
### Added
- Add (global) row numbers transformation to Scala, Java and Python API. (#97)
### Removed
- Removed support for Pyton 3.6
## [2.1.0] - 2022-04-07
### Added
- Add sorted group methods to Dataset. (#76)
## [2.0.0] - 2021-10-29
### Added
- Add support for Spark 3.2 and Scala 2.13.
- Support to ignore columns in diff API. (#63)
### Removed
- Removed support for Spark 2.4.
## [1.3.3] - 2020-12-17
### Added
- Add support for Spark 3.1.
## [1.3.2] - 2020-12-17
### Changed
- Refine conditional transformation helper methods.
## [1.3.1] - 2020-12-10
### Changed
- Refine conditional transformation helper methods.
## [1.3.0] - 2020-12-07
### Added
- Add transformation to compute histogram. (#26)
- Add conditional transformation helper methods. (#27)
- Add partitioned writing helpers that simplifies writing optimally ordered partitioned data. (#29)
## [1.2.0] - 2020-10-06
### Added
- Add diff modes (#22): column-by-column, side-by-side, left and right side diff modes.
- Adds sparse mode (#23): diff DataFrame contains only changed values.
## [1.1.0] - 2020-08-24
### Added
- Add Python API for Diff transformation.
- Add change column to Diff transformation providing column names of all changed columns in a row.
- Add fluent methods to change immutable diff options.
- Add `backticks` method to handle column names that contain dots (`.`).
## [1.0.0] - 2020-03-12
### Added
- Add Diff transformation for Datasets.
================================================
FILE: CONDITIONAL.md
================================================
# DataFrame Transformations
The Spark `Dataset` API allows for chaining transformations as in the following example:
```scala
ds.where($"id" === 1)
.withColumn("state", lit("new"))
.orderBy($"timestamp")
```
When you define additional transformation functions, the `Dataset` API allows you to
also fluently call into those:
```scala
def transformation(df: DataFrame): DataFrame = df.distinct
ds.transform(transformation)
```
Here are some methods that extend this principle to conditional calls.
## Conditional Transformations
You can run a transformation after checking a condition with a chain of fluent transformation calls:
```scala
import uk.co.gresearch._
val condition = true
val result =
ds.where($"id" === 1)
.withColumn("state", lit("new"))
.when(condition).call(transformation)
.orderBy($"timestamp")
```
rather than
```scala
val condition = true
val filteredDf = ds.where($"id" === 1)
.withColumn("state", lit("new"))
val condDf = if (condition) ds.call(transformation) else ds
val result = ds.orderBy($"timestamp")
```
In case you need an else transformation as well, try:
```scala
import uk.co.gresearch._
val condition = true
val result =
ds.where($"id" === 1)
.withColumn("state", lit("new"))
.on(condition).either(transformation).or(other)
.orderBy($"timestamp")
```
## Fluent and conditional functions elsewhere
The same fluent notation works for instances other than `Dataset` or `DataFrame`, e.g.
for the `DataFrameWriter`:
```scala
def writeData[T](writer: DataFrameWriter[T]): Unit = { ... }
ds.write
.when(compress).call(_.option("compression", "gzip"))
.call(writeData)
```
================================================
FILE: DIFF.md
================================================
# Spark Diff
Add the following `import` to your Scala code:
```scala
import uk.co.gresearch.spark.diff._
```
or this `import` to your Python code:
```python
# noinspection PyUnresolvedReferences
from gresearch.spark.diff import *
```
This adds a `diff` transformation to `Dataset` and `DataFrame` that computes the differences between two datasets / dataframes,
i.e. which rows of one dataset / dataframes to _add_, _delete_ or _change_ to get to the other dataset / dataframes.
For example, in Scala
```scala
val left = Seq((1, "one"), (2, "two"), (3, "three")).toDF("id", "value")
val right = Seq((1, "one"), (2, "Two"), (4, "four")).toDF("id", "value")
```
or in Python:
```python
left = spark.createDataFrame([(1, "one"), (2, "two"), (3, "three")], ["id", "value"])
right = spark.createDataFrame([(1, "one"), (2, "Two"), (4, "four")], ["id", "value"])
```
diffing becomes as easy as:
```scala
left.diff(right).show()
```
|diff |id |value |
|:---:|:---:|:-----:|
| N| 1| one|
| D| 2| two|
| I| 2| Two|
| D| 3| three|
| I| 4| four|
With columns that provide unique identifiers per row (here `id`), the diff looks like:
```scala
left.diff(right, "id").show()
```
|diff |id |left_value|right_value|
|:---:|:---:|:--------:|:---------:|
| N| 1| one| one|
| C| 2| two| Two|
| D| 3| three| *null*|
| I| 4| *null*| four|
Equivalent alternative is this hand-crafted transformation (Scala)
```scala
left.withColumn("exists", lit(1)).as("l")
.join(right.withColumn("exists", lit(1)).as("r"),
$"l.id" <=> $"r.id",
"fullouter")
.withColumn("diff",
when($"l.exists".isNull, "I").
when($"r.exists".isNull, "D").
when(!($"l.value" <=> $"r.value"), "C").
otherwise("N"))
.show()
```
Statistics on the differences can be obtained by
```scala
left.diff(right, "id").groupBy("diff").count().show()
```
|diff |count |
|:----:|:-----:|
| N| 1|
| I| 1|
| D| 1|
| C| 1|
The `diff` transformation can optionally provide a *change column* that lists all non-id column names that have changed.
This column is an array of strings and only set for `"N"` and `"C"`action rows; it is *null* for `"I"` and `"D"`action rows.
|diff |changes|id |left_value|right_value|
|:---:|:-----:|:---:|:--------:|:---------:|
| N| []| 1| one| one|
| C|[value]| 2| two| Two|
| D| *null*| 3| three| *null*|
| I| *null*| 4| *null*| four|
## Features
This `diff` transformation provides the following features:
* id columns are optional
* provides typed `diffAs` and `diffWith` transformations
* supports *null* values in id and non-id columns
* detects *null* value insertion / deletion
* [configurable](#configuring-diff) via `DiffOptions`:
* diff column name (default: `"diff"`), if default name exists in diff result schema
* diff action labels (defaults: `"N"`, `"I"`, `"D"`, `"C"`), allows custom diff notation,<br/> e.g. Unix diff left-right notation (<, >) or git before-after format (+, -, -+)
* [custom equality operators](#comparators-equality) (e.g. double comparison with epsilon threshold)
* [different diff result formats](#diffing-modes)
* [sparse diffing mode](#sparse-mode)
* optionally provides a *change column* that lists all non-id column names that have changed (only for `"C"` action rows)
* guarantees that no duplicate columns exist in the result, throws a readable exception otherwise
## Configuring Diff
Diffing can be configured via an optional `DiffOptions` instance (see [Methods](#methods) below).
|option |default |description|
|--------------------|:-------:|-----------|
|`diffColumn` |`"diff"` |The 'diff column' provides the action or diff value encoding if the respective row has been inserted, changed, deleted or has not been changed at all.|
|`leftColumnPrefix` |`"left"` |Non-id columns of the 'left' dataset are prefixed with this prefix.|
|`rightColumnPrefix` |`"right"`|Non-id columns of the 'right' dataset are prefixed with this prefix.|
|`insertDiffValue` |`"I"` |Inserted rows are marked with this string in the 'diff column'.|
|`changeDiffValue` |`"C"` |Changed rows are marked with this string in the 'diff column'.|
|`deleteDiffValue` |`"D"` |Deleted rows are marked with this string in the 'diff column'.|
|`nochangeDiffValue` |`"N"` |Unchanged rows are marked with this string in the 'diff column'.|
|`changeColumn` |*none* |An array with the names of all columns that have changed values is provided in this column (only for unchanged and changed rows, *null* otherwise).|
|`diffMode` |`DiffModes.Default`|Configures the diff output format. For details see [Diff Modes](#diff-modes) section below.|
|`sparseMode` |`false` |When `true`, only values that have changed are provided on left and right side, `null` is used for un-changed values.|
|`defaultComparator` |`DiffComparators.default()`|The default equality for all value columns.|
|`dataTypeComparators`|_empty_ |Map from data types to comparators.|
|`columnNameComparators`|_empty_|Map from column names to comparators.|
Either construct an instance via the constructor …
```scala
// Scala
import uk.co.gresearch.spark.diff.{DiffOptions, DiffMode}
val options = DiffOptions("d", "l", "r", "i", "c", "d", "n", Some("changes"), DiffMode.Default, false)
```
```python
# Python
from gresearch.spark.diff import DiffOptions, DiffMode
options = DiffOptions("d", "l", "r", "i", "c", "d", "n", "changes", DiffMode.Default, False)
```
… or via the `.with*` methods. The former requires most options to be specified, whereas the latter
only requires the ones that deviate from the default. And it is more readable.
Start from the default options `DiffOptions.default` and customize as follows:
```scala
// Scala
import uk.co.gresearch.spark.diff.{DiffOptions, DiffMode, DiffComparators}
val options = DiffOptions.default
.withDiffColumn("d")
.withLeftColumnPrefix("l")
.withRightColumnPrefix("r")
.withInsertDiffValue("i")
.withChangeDiffValue("c")
.withDeleteDiffValue("d")
.withNochangeDiffValue("n")
.withChangeColumn("changes")
.withDiffMode(DiffMode.Default)
.withSparseMode(true)
.withDefaultComparator(DiffComparators.epsilon(0.001))
.withComparator(DiffComparators.epsilon(0.001), DoubleType)
.withComparator(DiffComparators.epsilon(0.001), "float_column")
```
```python
# Python
from pyspark.sql.types import DoubleType
from gresearch.spark.diff import DiffOptions, DiffMode, DiffComparators
options = DiffOptions() \
.with_diff_column("d") \
.with_left_column_prefix("l") \
.with_right_column_prefix("r") \
.with_insert_diff_value("i") \
.with_change_diff_value("c") \
.with_delete_diff_value("d") \
.with_nochange_diff_value("n") \
.with_change_column("changes") \
.with_diff_mode(DiffMode.Default) \
.with_sparse_mode(True) \
.with_default_comparator(DiffComparators.epsilon(0.01)) \
.with_data_type_comparator(DiffComparators.epsilon(0.001), DoubleType()) \
.with_column_name_comparator(DiffComparators.epsilon(0.001), "float_column")
```
### Diffing Modes
The result of the diff transformation can have the following formats:
- *column by column*: The non-id columns are arranged column by column, i.e. for each non-id column
there are two columns next to each other in the diff result, one from the left
and one from the right dataset. This is useful to easily compare the values
for each column.
- *side by side*: The non-id columns from the left and right dataset are are arranged side by side,
i.e. first there are all columns from the left dataset, then from the right one.
This is useful to visually compare the datasets as a whole, especially in conjunction
with the sparse mode.
- *left side*: Only the columns of the left dataset are present in the diff output. This mode
provides the left dataset as is, annotated with diff action and optional changed column names.
- *right side*: Only the columns of the right dataset are present in the diff output. This mode
provides the right dataset as given, as well as the diff action that has been applied to it.
This serves as a patch that, applied to the left dataset, results in the right dataset.
With the following two datasets `left` and `right`:
```scala
case class Value(id: Int, value: Option[String], label: Option[String])
val left = Seq(
Value(1, Some("one"), None),
Value(2, Some("two"), Some("number two")),
Value(3, Some("three"), Some("number three")),
Value(4, Some("four"), Some("number four")),
Value(5, Some("five"), Some("number five")),
).toDS
val right = Seq(
Value(1, Some("one"), Some("one")),
Value(2, Some("Two"), Some("number two")),
Value(3, Some("Three"), Some("number Three")),
Value(4, Some("four"), Some("number four")),
Value(6, Some("six"), Some("number six")),
).toDS
```
the diff modes produce the following outputs:
#### Column by Column
|diff |id |left_value|right_value|left_label |right_label |
|:---:|:---:|:--------:|:---------:|:----------:|:----------:|
|C |1 |one |one |*null* |one |
|C |2 |two |Two |number two |number two |
|C |3 |three |Three |number three|number Three|
|N |4 |four |four |number four |number four |
|D |5 |five |null |number five |*null* |
|I |6 |*null* |six |*null* |number six |
#### Side by Side
|diff |id |left_value|left_label |right_value|right_label |
|:---:|:---:|:--------:|:----------:|:---------:|:----------:|
|C |1 |one |*null* |one |one |
|C |2 |two |number two |Two |number two |
|C |3 |three |number three|Three |number Three|
|N |4 |four |number four |four |number four |
|D |5 |five |number five |null |*null* |
|I |6 |*null* |*null* |six |number six |
#### Left Side
|diff |id |value|label |
|:---:|:---:|:---:|:----------:|
|C |1 |one |null |
|C |2 |two |number two |
|C |3 |three|number three|
|N |4 |four |number four |
|D |5 |five |number five |
|I |6 |null |null |
#### Right Side
|diff |id |value|label |
|:---:|:---:|:---:|:----------:|
|C |1 |one |one |
|C |2 |Two |number two |
|C |3 |Three|number Three|
|N |4 |four |number four |
|D |5 |null |null |
|I |6 |six |number six |
### Sparse Mode
The diff modes above can be combined with sparse mode. In sparse mode, only values that differ between
the two datasets are in the diff result, all other values are `null`.
Above [Column by Column](#column-by-column) example would look in sparse mode as follows:
|diff |id |left_value|right_value|left_label |right_label |
|:---:|:---:|:--------:|:---------:|:----------:|:----------:|
|C |1 |null |null |null |one |
|C |2 |two |Two |null |null |
|C |3 |three |Three |number three|number Three|
|N |4 |null |null |null |null |
|D |5 |five |null |number five |null |
|I |6 |null |six |null |number six |
### Comparators (Equality)
Values are compared for equality with the default `<=>` operator, which considers values
equal when both sides are `null`, or both sides are not `null` and equal.
The following alternative comparators are provided:
|Comparator|Description|
|:---------|:----------|
|`DiffComparators.epsilon(epsilon)`|Two values are equal when they are at most `epsilon` apart.<br/><br/>The comparator can be configured to use `epsilon` as an absolute (`.asAbsolute()`) threshold, or as relative (`.asRelative()`) to the larger value. Further, the threshold itself can be considered equal (`.asInclusive()`) or not equal (`.asExclusive()`):<ul><li>`DiffComparators.epsilon(epsilon).asAbsolute().asInclusive()`:<br/>`x` and `y` are equal iff `abs(x - y) ≤ epsilon`</li><li>`DiffComparators.epsilon(epsilon).asAbsolute().asExclusive()`:<br/>`x` and `y` are equal iff `abs(x - y) < epsilon`</li><li>`DiffComparators.epsilon(epsilon).asRelative().asInclusive()`:<br/>`x` and `y` are equal iff `abs(x - y) ≤ epsilon * max(abs(x), abs(y))`</li><li>`DiffComparators.epsilon(epsilon).asRelative().asExclusive()`:<br/>`x` and `y` are equal iff `abs(x - y) < epsilon * max(abs(x), abs(y))`</li></ul>|
|`DiffComparators.string()`|Two `StringType` values are compared while ignoring white space differences. For this comparison, sequences of whitespaces are collapesed into single whitespaces, leading and trailing whitespaces are removed. With `DiffComparators.string(false)`, string values are compared with the default comparator.|
|`DiffComparators.duration(duration)`|Two `DateType` or `TimestampType` values are equal when they are at most `duration` apart. That duration is an instance of `java.time.Duration`.<br/><br/>The comparator can be configured to consider `duration` as equal (`.asInclusive()`) or not equal (`.asExclusive()`):<ul><li>`DiffComparators.duration(duration).asInclusive()`:<br/>`x` and `y` are equal iff `x - y ≤ duration`</li><li>`DiffComparators.duration(duration).asExclusive()`:<br/>`x` and `y` are equal iff `x - y < duration`</li></lu>|
|`DiffComparators.map[K,V](keyOrderSensitive)` (Scala only)<br/>`DiffComparators.map(keyType, valueType, keyOrderSensitive)`|Two `Map[K,V]` values are equal when they match in all their keys and values. With `keyOrderSensitive=true`, the order of the keys matters, with `keyOrderSensitive=false` (default), the order of keys is ignored.|
An example:
val left = Seq((1, 1.0), (2, 2.0), (3, 3.0)).toDF("id", "value")
val right = Seq((1, 1.0), (2, 2.02), (3, 3.05)).toDF("id", "value")
left.diff(right, "id").show()
|diff| id|left_value|right_value|
|----|---|----------|-----------|
| N| 1| 1.0| 1.0|
| C| 2| 2.0| 2.02|
| C| 3| 3.0| 3.05|
The second and third rows are considered `"C"`hanged because `2.0 != 2.02` and `3.0 != 3.05`, respectively.
With an inclusive relative epsilon of 1%, `2.0 != 2.02` is considered equal, while `3.0 != 3.05` is still not equal:
val options = DiffOptions.default
.withComparator(DiffComparators.epsilon(0.01).asRelative().asInclusive(), DoubleType)
left.diff(right, options, "id").show()
|diff| id|left_value|right_value|
|----|---|----------|-----------|
| N| 1| 1.0| 1.0|
| N| 2| 2.0| 2.02|
| C| 3| 3.0| 3.05|
The user can provide custom comparator implementations by implementing `scala.math.Equiv[T]`
or `uk.co.gresearch.spark.diff.DiffComparator`:
val intEquiv: Equiv[Int] = (x: Int, y: Int) => x == null && y == null || x != null && y != null && x.equals(y)
val anyEquiv: Equiv[Any] = (x: Any, y: Any) => x == null && y == null || x != null && y != null && x.equals(y)
val comparator: DiffComparator = (left: Column, right: Column) => left <=> right
import spark.implicits._
val options = DiffOptions.default
.withComparator(intEquiv)
.withComparator(anyEquiv, LongType, DoubleType)
.withComparator(anyEquiv, "column1", "column2")
.withComparator(comparator, StringType, FloatType)
.withComparator(comparator, "column3", "column4")
## Methods (Scala)
All Scala methods come in two variants, one without (as shown below) and one with an `options: DiffOptions` argument.
* `def diff(other: Dataset[T], idColumns: String*): DataFrame`
* `def diff[U](other: Dataset[U], idColumns: Seq[String], ignoreColumns: Seq[String]): DataFrame`
* `def diffAs[V](other: Dataset[T], idColumns: String*)(implicit diffEncoder: Encoder[V]): Dataset[V]`
* `def diffAs[U, V](other: Dataset[U], idColumns: Seq[String], ignoreColumns: Seq[String])(implicit diffEncoder: Encoder[V]): Dataset[V]`
* `def diffAs[V](other: Dataset[T], diffEncoder: Encoder[U], idColumns: String*): Dataset[V]`
* `def diffAs[U, V](other: Dataset[U], diffEncoder: Encoder[U], idColumns: Seq[String], ignoreColumns: Seq[String]): Dataset[V]`
* `def diffWith(other: Dataset[T], idColumns: String*): Dataset[(String, T, T)]`
* `def diffWith[U](other: Dataset[U], idColumns: Seq[String], ignoreColumns: Seq[String]): Dataset[(String, T, U)]`
## Methods (Java)
* `Dataset<Row> Diff.of[T](Dataset<T> left, Dataset<T> right, String... idColumns)`
* `Dataset<Row> Diff.of[T, U](Dataset<T> left, Dataset<U> right, List<String> idColumns, List<String> ignoreColumns)`
* `Dataset<V> Diff.ofAs[T, V](Dataset<T> left, Dataset<T> right, Encoder<V> diffEncoder, String... idColumns)`
* `Dataset<V> Diff.ofAs[T, U, V](Dataset<T> left, Dataset<U> right, Encoder<V> diffEncoder, List<String> idColumns, List<String> ignoreColumns)`
* `Dataset<Tuple3<String, T, T>> Diff.ofWith[T](Dataset<T> left, Dataset<T> right, String... idColumns)`
* `Dataset<Tuple3<String, T, U>> Diff.ofWith[T](Dataset<T> left, Dataset<U> right, List<String> idColumns, List<String> ignoreColumns)`
Given a `DiffOptions`, a customized `Differ` can be instantiated as `Differ differ = new Differ(options)`:
* `Dataset<Row> Differ.diff[T](Dataset<T> left, Dataset<T> right, String... idColumns)`
* `Dataset<Row> Differ.diff[T, U](Dataset<T> left, Dataset<U> right, List<String> idColumns, List<String> ignoreColumns)`
* `Dataset<U> Differ.diffAs[T, V](Dataset<T> left, Dataset<T> right, Encoder<V> diffEncoder, String... idColumns)`
* `Dataset<U> Differ.diffAs[T, U, V](Dataset<T> left, Dataset<U> right, Encoder<V> diffEncoder, List<String> idColumns, List<String> ignoreColumns)`
* `Dataset<Row> Differ.diffWith[T](Dataset<T> left, Dataset<T> right, String... idColumns)`
* `Dataset<Row> Differ.diffWith[T, U](Dataset<T> left, Dataset<U> right, List<String> idColumns, List<String> ignoreColumns)`
## Methods (Python)
* `def diff(self: DataFrame, other: DataFrame, *id_columns: str) -> DataFrame`
* `def diff(self: DataFrame, other: DataFrame, id_columns: List[str], ignore_columns: List[str]) -> DataFrame`
* `def diff(self: DataFrame, other: DataFrame, options: DiffOptions, *id_columns: str) -> DataFrame`
* `def diff(self: DataFrame, other: DataFrame, options: DiffOptions, id_columns: List[str], ignore_columns: List[str]) -> DataFrame`
* `def diffwith(self: DataFrame, other: DataFrame, *id_columns: str) -> DataFrame:`
* `def diffwith(self: DataFrame, other: DataFrame, id_columns: List[str], ignore_columns: List[str]) -> DataFrame`
* `def diffwith(self: DataFrame, other: DataFrame, options: DiffOptions, *id_columns: str) -> DataFrame:`
* `def diffwith(self: DataFrame, other: DataFrame, options: DiffOptions, id_columns: List[str], ignore_columns: List[str]) -> DataFrame`
## Diff Spark application
There is also a Spark application that can be used to create a diff DataFrame. The application reads two DataFrames
`left` and `right` from files or tables, executes the diff transformation and writes the result DataFrame to a file or table.
The Diff app can be run via `spark-submit`:
```shell
# Scala 2.12
spark-submit --packages com.github.scopt:scopt_2.12:4.1.0 spark-extension_2.12-2.7.0-3.4.jar --help
# Scala 2.13
spark-submit --packages com.github.scopt:scopt_2.13:4.1.0 spark-extension_2.13-2.7.0-3.4.jar --help
```
```
Spark Diff app (2.10.0-3.4)
Usage: spark-extension_2.13-2.10.0-3.4.jar [options] left right diff
left file path (requires format option) or table name to read left dataframe
right file path (requires format option) or table name to read right dataframe
diff file path (requires format option) or table name to write diff dataframe
Examples:
- Diff CSV files 'left.csv' and 'right.csv' and write result into CSV file 'diff.csv':
spark-submit --packages com.github.scopt:scopt_2.13:4.1.0 spark-extension_2.13-2.10.0-3.4.jar --format csv left.csv right.csv diff.csv
- Diff CSV file 'left.csv' with Parquet file 'right.parquet' with id column 'id', and write result into Hive table 'diff':
spark-submit --packages com.github.scopt:scopt_2.13:4.1.0 spark-extension_2.13-2.10.0-3.4.jar --left-format csv --right-format parquet --hive --id id left.csv right.parquet diff
Spark session
--master <master> Spark master (local, yarn, ...), not needed with spark-submit
--app-name <app-name> Spark application name
--hive enable Hive support to read from and write to Hive tables
Input and output
-f, --format <format> input and output file format (csv, json, parquet, ...)
--left-format <format> left input file format (csv, json, parquet, ...)
--right-format <format> right input file format (csv, json, parquet, ...)
--output-format <formt> output file format (csv, json, parquet, ...)
-s, --schema <schema> input schema
--left-schema <schema> left input schema
--right-schema <schema> right input schema
--left-option:key=val left input option
--right-option:key=val right input option
--output-option:key=val output option
--id <name> id column name
--ignore <name> ignore column name
--save-mode <save-mode> save mode for writing output (Append, Overwrite, ErrorIfExists, Ignore, default ErrorIfExists)
--filter <filter> Filters for rows with these diff actions, with default diffing options use 'N', 'I', 'D', or 'C' (see 'Diffing options' section)
--statistics Only output statistics on how many rows exist per diff action (see 'Diffing options' section)
Diffing options
--diff-column <name> column name for diff column (default 'diff')
--left-prefix <prefix> prefix for left column names (default 'left')
--right-prefix <prefix> prefix for right column names (default 'right')
--insert-value <value> value for insertion (default 'I')
--change-value <value> value for change (default 'C')
--delete-value <value> value for deletion (default 'D')
--no-change-value <val> value for no change (default 'N')
--change-column <name> column name for change column (default is no such column)
--diff-mode <mode> diff mode (ColumnByColumn, SideBySide, LeftSide, RightSide, default ColumnByColumn)
--sparse enable sparse diff
General
--help prints this usage text
```
### Examples
Diff CSV files `left.csv` and `right.csv` and write result into CSV file `diff.csv`:
```shell
spark-submit --packages com.github.scopt:scopt_2.13:4.1.0 spark-extension_2.13-2.7.0-3.4.jar --format csv left.csv right.csv diff.csv
```
Diff CSV file `left.csv` with Parquet file `right.parquet` with id column `id`, and write result into Hive table `diff`:
```shell
spark-submit --packages com.github.scopt:scopt_2.13:4.1.0 spark-extension_2.13-2.7.0-3.4.jar --left-format csv --right-format parquet --hive --id id left.csv right.parquet diff
```
================================================
FILE: GROUPS.md
================================================
# Sorted Groups
Spark provides the ability to group rows by an arbitrary key,
while then providing an iterator for each of these groups.
This allows to iterate over groups that are too large to fit into memory:
```scala
import org.apache.spark.sql.Dataset
import spark.implicits._
case class Val(id: Int, seq: Int, value: Double)
val ds: Dataset[Val] = Seq(
Val(1, 1, 1.1),
Val(1, 2, 1.2),
Val(1, 3, 1.3),
Val(2, 1, 2.1),
Val(2, 2, 2.2),
Val(2, 3, 2.3),
Val(3, 1, 3.1)
).reverse.toDS().repartition(3).cache()
// order of iterator IS NOT guaranteed
ds.groupByKey(v => v.id)
.flatMapGroups((key, it) => it.zipWithIndex.map(v => (key, v._2, v._1.seq, v._1.value)))
.toDF("key", "index", "seq", "value")
.show(false)
+---+-----+---+-----+
|key|index|seq|value|
+---+-----+---+-----+
|1 |0 |3 |1.3 |
|1 |1 |2 |1.2 |
|1 |2 |1 |1.1 |
|2 |0 |1 |2.1 |
|2 |1 |3 |2.3 |
|2 |2 |2 |2.2 |
|3 |0 |1 |3.1 |
+---+-----+---+-----+
```
However, we have no control over the order of the group iterators.
If we want the iterators to be ordered according to `seq`, we can do the following:
```scala
import uk.co.gresearch.spark._
// the group key $"id" needs an ordering
implicit val ordering: Ordering.Int.type = Ordering.Int
// order of iterator IS guaranteed
ds.groupBySorted($"id")($"seq")
.flatMapSortedGroups((key, it) => it.zipWithIndex.map(v => (key, v._2, v._1.seq, v._1.value)))
.toDF("key", "index", "seq", "value")
.show(false)
+---+-----+---+-----+
|key|index|seq|value|
+---+-----+---+-----+
|1 |0 |1 |1.1 |
|1 |1 |2 |1.2 |
|1 |2 |3 |1.3 |
|2 |0 |1 |2.1 |
|2 |1 |2 |2.2 |
|2 |2 |3 |2.3 |
|3 |0 |1 |3.1 |
+---+-----+---+-----+
```
Now, iterators are ordered according to `seq`, which is proven by the value of `index`,
that has been generated by `it.zipWithIndex`.
Instead of column expressions, we can also use lambdas to define group key and group order:
```scala
ds.groupByKeySorted(v => v.id)(v => v.seq)
.flatMapSortedGroups((key, it) => it.zipWithIndex.map(v => (key, v._2, v._1.seq, v._1.value)))
.toDF("key", "index", "seq", "value")
.show(false)
```
**Note:** Using lambdas here hides from Spark which columns we use for grouping and sorting.
Query optimization cannot improve partitioning and sorting in this case. Use column expressions when possible.
================================================
FILE: HISTOGRAM.md
================================================
# Histogram
For a table `df` like
|user |score|
|:-----:|:---:|
|Alice |101 |
|Alice |221 |
|Alice |211 |
|Alice |176 |
|Bob |276 |
|Bob |232 |
|Bon |258 |
|Charlie|221 |
you can compute the histogram for each user
|user |≤100 |≤200 |>200 |
|:-----:|:---:|:---:|:---:|
|Alice |0 |2 |2 |
|Bob |0 |0 |3 |
|Charlie|0 |0 |1 |
as follows:
df.withColumn("≤100", when($"score" <= 100, 1).otherwise(0))
.withColumn("≤200", when($"score" > 100 && $"score" <= 200, 1).otherwise(0))
.withColumn(">200", when($"score" > 200, 1).otherwise(0))
.groupBy($"user")
.agg(
sum($"≤100").as("≤100"),
sum($"≤200").as("≤200"),
sum($">200").as(">200")
)
.orderBy($"user")
Equivalent to that query is:
import uk.co.gresearch.spark._
df.histogram(Seq(100, 200), $"score", $"user").orderBy($"user")
The first argument is a sequence of thresholds, the second argument provides the value column.
The subsequent arguments refer to the aggregation columns (`groupBy`). Only aggregation columns
will be in the result DataFrame.
In Java, call:
import uk.co.gresearch.spark.Histogram;
Histogram.of(df, Arrays.asList(100, 200), new Column("score")), new Column("user")).orderBy($"user")
In Python, call:
import gresearch.spark
df.histogram([100, 200], 'user').orderBy('user')
Note that this feature is not supported in Python when connected with a [Spark Connect server](README.md#spark-connect-server).
================================================
FILE: LICENSE
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
================================================
FILE: MAINTAINERS.md
================================================
## Current maintainers of the project
| Maintainer | GitHub ID |
| ---------------------- | ------------------------------------------------------- |
| Enrico Minack | [EnricoMi](https://github.com/EnricoMi) |
================================================
FILE: PARQUET.md
================================================
# Parquet Metadata
The structure of Parquet files (the metadata, not the data stored in Parquet) can be inspected similar to [parquet-tools](https://pypi.org/project/parquet-tools/)
or [parquet-cli](https://pypi.org/project/parquet-cli/)
by reading from a simple Spark data source.
Parquet metadata can be read on [file level](#parquet-file-metadata),
[schema level](#parquet-file-schema),
[row group level](#parquet-block--rowgroup-metadata),
[column chunk level](#parquet-block-column-metadata) and
[Spark Parquet partition level](#parquet-partition-metadata).
Multiple files can be inspected at once.
Any location that can be read by Spark (`spark.read.parquet(…)`) can be inspected.
This means the path can point to a single Parquet file, a directory with Parquet files,
or multiple paths separated by a comma (`,`). Paths can contain wildcards like `*`.
Multiple files will be inspected in parallel and distributed by Spark.
No actual rows or values will be read from the Parquet files, only metadata, which is very fast.
This allows to inspect Parquet files that have different schemata with one `spark.read` operation.
First, import the new Parquet metadata data sources:
```scala
// Scala
import uk.co.gresearch.spark.parquet._
```
```python
# Python
import gresearch.spark.parquet
```
Then, the following metadata become available:
## Parquet file metadata
Read the metadata of Parquet files into a Dataframe:
```scala
// Scala
spark.read.parquetMetadata("/path/to/parquet").show()
```
```python
# Python
spark.read.parquet_metadata("/path/to/parquet").show()
```
```
+-------------+------+---------------+-----------------+----+-------+------+-----+--------------------+--------------------+-----------+--------------------+
| filename|blocks|compressedBytes|uncompressedBytes|rows|columns|values|nulls| createdBy| schema| encryption| keyValues|
+-------------+------+---------------+-----------------+----+-------+------+-----+--------------------+--------------------+-----------+--------------------+
|file1.parquet| 1| 1268| 1652| 100| 2| 200| 0|parquet-mr versio...|message spark_sch...|UNENCRYPTED|{org.apache.spark...|
|file2.parquet| 2| 2539| 3302| 200| 2| 400| 0|parquet-mr versio...|message spark_sch...|UNENCRYPTED|{org.apache.spark...|
+-------------+------+---------------+-----------------+----+-------+------+-----+--------------------+--------------------+-----------+--------------------+
```
The Dataframe provides the following per-file information:
|column |type | description |
|:-----------------|:----:|:-------------------------------------------------------------------------------|
|filename |string| The Parquet file name |
|blocks |int | Number of blocks / RowGroups in the Parquet file |
|compressedBytes |long | Number of compressed bytes of all blocks |
|uncompressedBytes |long | Number of uncompressed bytes of all blocks |
|rows |long | Number of rows in the file |
|columns |int | Number of columns in the file |
|values |long | Number of values in the file |
|nulls |long | Number of null values in the file |
|createdBy |string| The createdBy string of the Parquet file, e.g. library used to write the file |
|schema |string| The schema |
|encryption |string| The encryption (requires `org.apache.parquet:parquet-hadoop:1.12.4` and above) |
|keyValues |string-to-string map| Key-value data of the file |
## Parquet file schema
Read the schema of Parquet files into a Dataframe:
```scala
// Scala
spark.read.parquetSchema("/path/to/parquet").show()
```
```python
# Python
spark.read.parquet_schema("/path/to/parquet").show()
```
```
+------------+----------+------------------+----------+------+------+----------------+--------------------+-----------+-------------+------------------+------------------+------------------+
| filename|columnName| columnPath|repetition| type|length| originalType| logicalType|isPrimitive|primitiveType| primitiveOrder|maxDefinitionLevel|maxRepetitionLevel|
+------------+----------+------------------+----------+------+------+----------------+--------------------+-----------+-------------+------------------+------------------+------------------+
|file.parquet| a| [a]| REQUIRED| INT64| 0| NULL| NULL| true| INT64|TYPE_DEFINED_ORDER| 0| 0|
|file.parquet| x| [b, x]| REQUIRED| INT32| 0| NULL| NULL| true| INT32|TYPE_DEFINED_ORDER| 1| 0|
|file.parquet| y| [b, y]| REQUIRED|DOUBLE| 0| NULL| NULL| true| DOUBLE|TYPE_DEFINED_ORDER| 1| 0|
|file.parquet| z| [b, z]| OPTIONAL| INT64| 0|TIMESTAMP_MICROS|TIMESTAMP(MICROS,...| true| INT64|TYPE_DEFINED_ORDER| 2| 0|
|file.parquet| element|[c, list, element]| OPTIONAL|BINARY| 0| UTF8| STRING| true| BINARY|TYPE_DEFINED_ORDER| 3| 1|
+------------+----------+------------------+----------+------+------+----------------+--------------------+-----------+-------------+------------------+------------------+------------------+
```
The Dataframe provides the following per-file information:
|column | type | description |
|:-----------------|:------------:|:----------------------------------------------------------------------------------|
|filename | string | The Parquet file name |
|columnName | string | The column name |
|columnPath | string array | The column path |
|repetition | string | The repetition |
|type | string | The data type |
|length | int | The length of the type |
|originalType | string | The original type (requires `org.apache.parquet:parquet-hadoop:1.11.0` and above) |
|isPrimitive | boolean | True if type is primitive |
|primitiveType | string | The primitive type |
|primitiveOrder | string | The order of the primitive type |
|maxDefinitionLevel| int | The max definition level |
|maxRepetitionLevel| int | The max repetition level |
## Parquet block / RowGroup metadata
Read the metadata of Parquet blocks / RowGroups into a Dataframe:
```scala
// Scala
spark.read.parquetBlocks("/path/to/parquet").show()
```
```python
# Python
spark.read.parquet_blocks("/path/to/parquet").show()
```
```
+-------------+-----+----------+---------------+-----------------+----+-------+------+-----+
| filename|block|blockStart|compressedBytes|uncompressedBytes|rows|columns|values|nulls|
+-------------+-----+----------+---------------+-----------------+----+-------+------+-----+
|file1.parquet| 1| 4| 1269| 1651| 100| 2| 200| 0|
|file2.parquet| 1| 4| 1268| 1652| 100| 2| 200| 0|
|file2.parquet| 2| 1273| 1270| 1651| 100| 2| 200| 0|
+-------------+-----+----------+---------------+-----------------+----+-------+------+-----+
```
|column |type |description |
|:-----------------|:----:|:----------------------------------------------|
|filename |string|The Parquet file name |
|block |int |Block / RowGroup number starting at 1 |
|blockStart |long |Start position of the block in the Parquet file|
|compressedBytes |long |Number of compressed bytes in block |
|uncompressedBytes |long |Number of uncompressed bytes in block |
|rows |long |Number of rows in block |
|columns |int |Number of columns in block |
|values |long |Number of values in block |
|nulls |long |Number of null values in block |
## Parquet block column metadata
Read the metadata of Parquet block columns into a Dataframe:
```scala
// Scala
spark.read.parquetBlockColumns("/path/to/parquet").show()
```
```python
# Python
spark.read.parquet_block_columns("/path/to/parquet").show()
```
```
+-------------+-----+------+------+-------------------+-------------------+--------------------+------------------+-----------+---------------+-----------------+------+-----+
| filename|block|column| codec| type| encodings| minValue| maxValue|columnStart|compressedBytes|uncompressedBytes|values|nulls|
+-------------+-----+------+------+-------------------+-------------------+--------------------+------------------+-----------+---------------+-----------------+------+-----+
|file1.parquet| 1| [id]|SNAPPY| required int64 id|[BIT_PACKED, PLAIN]| 0| 99| 4| 437| 826| 100| 0|
|file1.parquet| 1| [val]|SNAPPY|required double val|[BIT_PACKED, PLAIN]|0.005067503372006343|0.9973357672164814| 441| 831| 826| 100| 0|
|file2.parquet| 1| [id]|SNAPPY| required int64 id|[BIT_PACKED, PLAIN]| 100| 199| 4| 438| 825| 100| 0|
|file2.parquet| 1| [val]|SNAPPY|required double val|[BIT_PACKED, PLAIN]|0.010617521596503865| 0.999189783846449| 442| 831| 826| 100| 0|
|file2.parquet| 2| [id]|SNAPPY| required int64 id|[BIT_PACKED, PLAIN]| 200| 299| 1273| 440| 826| 100| 0|
|file2.parquet| 2| [val]|SNAPPY|required double val|[BIT_PACKED, PLAIN]|0.011277044401634018| 0.970525681750662| 1713| 830| 825| 100| 0|
+-------------+-----+------+------+-------------------+-------------------+--------------------+------------------+-----------+---------------+-----------------+------+-----+
```
| column | type | description |
|:------------------|:-------------:|:--------------------------------------------------------------------------------------------------|
| filename | string | The Parquet file name |
| block | int | Block / RowGroup number starting at 1 |
| column | array<string> | Block / RowGroup column name |
| codec | string | The coded used to compress the block column values |
| type | string | The data type of the block column |
| encodings | array<string> | Encodings of the block column |
| isEncrypted | boolean | Whether block column is encrypted (requires `org.apache.parquet:parquet-hadoop:1.12.3` and above) |
| minValue | string | Minimum value of this column in this block |
| maxValue | string | Maximum value of this column in this block |
| columnStart | long | Start position of the block column in the Parquet file |
| compressedBytes | long | Number of compressed bytes of this block column |
| uncompressedBytes | long | Number of uncompressed bytes of this block column |
| values | long | Number of values in this block column |
| nulls | long | Number of null values in this block column |
## Parquet partition metadata
Read the metadata of how Spark partitions Parquet files into a Dataframe:
```scala
// Scala
spark.read.parquetPartitions("/path/to/parquet").show()
```
```python
# Python
spark.read.parquet_partitions("/path/to/parquet").show()
```
```
+---------+-----+----+------+------+---------------+-----------------+----+-------+------+-----+-------------+----------+
|partition|start| end|length|blocks|compressedBytes|uncompressedBytes|rows|columns|values|nulls| filename|fileLength|
+---------+-----+----+------+------+---------------+-----------------+----+-------+------+-----+-------------+----------+
| 1| 0|1024| 1024| 1| 1268| 1652| 100| 2| 200| 0|file1.parquet| 1930|
| 2| 1024|1930| 906| 0| 0| 0| 0| 0| 0| 0|file1.parquet| 1930|
| 3| 0|1024| 1024| 1| 1269| 1651| 100| 2| 200| 0|file2.parquet| 3493|
| 4| 1024|2048| 1024| 1| 1270| 1651| 100| 2| 200| 0|file2.parquet| 3493|
| 5| 2048|3072| 1024| 0| 0| 0| 0| 0| 0| 0|file2.parquet| 3493|
| 6| 3072|3493| 421| 0| 0| 0| 0| 0| 0| 0|file2.parquet| 3493|
+---------+-----+----+------+------+---------------+-----------------+----+-------+------+-----+-------------+----------+
```
|column |type |description |
|:----------------|:----:|:---------------------------------------------------------|
|partition |int |The Spark partition id |
|start |long |The start position of the partition |
|end |long |The end position of the partition |
|length |long |The length of the partition |
|blocks |int |The number of Parquet blocks / RowGroups in this partition|
|compressedBytes |long |The number of compressed bytes in this partition |
|uncompressedBytes|long |The number of uncompressed bytes in this partition |
|rows |long |The number of rows in this partition |
|columns |int |The number of columns in this partition |
|values |long |The number of values in this partition |
|nulls |long |The number of null values in this partition |
|filename |string|The Parquet file name |
|fileLength |long |The length of the Parquet file |
## Performance
Retrieving Parquet metadata is parallelized and distributed by Spark. The result Dataframe
has as many partitions as there are Parquet files in the given `path`, but at most
`spark.sparkContext.defaultParallelism` partitions.
Each result partition reads Parquet metadata from its Parquet files sequentially,
while partitions are executed in parallel (depending on the number of Spark cores of your Spark job).
You can control the number of partitions via the `parallelism` parameter:
```scala
// Scala
spark.read.parquetMetadata(100, "/path/to/parquet")
spark.read.parquetSchema(100, "/path/to/parquet")
spark.read.parquetBlocks(100, "/path/to/parquet")
spark.read.parquetBlockColumns(100, "/path/to/parquet")
spark.read.parquetPartitions(100, "/path/to/parquet")
```
```python
# Python
spark.read.parquet_metadata("/path/to/parquet", parallelism=100)
spark.read.parquet_schema("/path/to/parquet", parallelism=100)
spark.read.parquet_blocks("/path/to/parquet", parallelism=100)
spark.read.parquet_block_columns("/path/to/parquet", parallelism=100)
spark.read.parquet_partitions("/path/to/parquet", parallelism=100)
```
## Encryption
Reading [encrypted Parquet is supported](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#columnar-encryption).
Files encrypted with [plaintext footer](https://github.com/apache/parquet-format/blob/master/Encryption.md#55-plaintext-footer-mode)
can be read without any encryption keys, while encrypted Parquet metadata are then show as `NULL` values in the result Dataframe.
Encrypted Parquet files with encrypted footer requires the footer encryption key only. No column encryption keys are needed.
## Known Issues
Note that this feature is not supported in Python when connected with a [Spark Connect server](README.md#spark-connect-server).
================================================
FILE: PARTITIONING.md
================================================
# Partitioned Writing
If you have ever used `Dataset[T].write.partitionBy`, here is how you can minimize the number of
written files and obtain same-size files.
Spark has two different concepts both referred to as partitioning. Central to Spark is the
concept of how a `Dataset[T]` is split into partitions where a Spark worker processes
a single partition at a time. This is the fundamental concept of how Spark scales with data.
When writing a `Dataset` `ds` to a file-based storage, that output file is actually a directory:
<!--
import java.sql.Timestamp
import java.sql.Timestamp
case class Value(id: Int, ts: Timestamp, property: String, value: String)
val ds = Seq(
Value(1, Timestamp.valueOf("2020-07-01 12:00:00"), "label", "one"),
Value(1, Timestamp.valueOf("2020-07-02 12:00:00"), "descr", "number one"),
Value(1, Timestamp.valueOf("2020-07-03 12:00:00"), "label", "ONE"),
Value(2, Timestamp.valueOf("2020-07-01 12:00:00"), "label", "two"),
Value(2, Timestamp.valueOf("2020-07-03 12:00:00"), "label", "TWO"),
Value(2, Timestamp.valueOf("2020-07-04 12:00:00"), "descr", "number two"),
Value(3, Timestamp.valueOf("2020-07-03 12:00:00"), "label", "THREE"),
Value(3, Timestamp.valueOf("2020-07-03 12:00:00"), "descr", "number three"),
Value(4, Timestamp.valueOf("2020-07-01 12:00:00"), "label", "four"),
Value(4, Timestamp.valueOf("2020-07-03 12:00:00"), "descr", "number four"),
Value(5, Timestamp.valueOf("2020-07-01 12:00:00"), "label", "five"),
Value(5, Timestamp.valueOf("2020-07-03 12:00:00"), "descr", "number five"),
Value(6, Timestamp.valueOf("2020-07-01 12:00:00"), "label", "six"),
Value(6, Timestamp.valueOf("2020-07-01 12:00:00"), "descr", "number six"),
).toDS()
-->
```scala
ds.write.csv("file.csv")
```
The directory structure looks like:
file.csv
file.csv/part-00000-7d34816f-bb53-4f44-ab9d-a62d570e5de0-c000.csv
file.csv/part-00001-7d34816f-bb53-4f44-ab9d-a62d570e5de0-c000.csv
file.csv/part-00002-7d34816f-bb53-4f44-ab9d-a62d570e5de0-c000.csv
file.csv/part-00003-7d34816f-bb53-4f44-ab9d-a62d570e5de0-c000.csv
file.csv/part-00004-7d34816f-bb53-4f44-ab9d-a62d570e5de0-c000.csv
file.csv/_SUCCESS
When writing, the output can be `partitionBy` one or more columns of the `Dataset`.
For each distinct `value` in that column `col` an individual sub-directory is created in your output path.
The name is of the format `col=value`. Inside the sub-directory, multiple partitions exists,
all containing only data where column `col` has value `value`. To remove redundancy, those
files do not contain that column anymore.
file.csv/property=descr/part-00001-8eb44de1-2c33-4f95-a1be-8d1b4e35eb4a.c000.csv
file.csv/property=descr/part-00002-8eb44de1-2c33-4f95-a1be-8d1b4e35eb4a.c000.csv
file.csv/property=descr/part-00003-8eb44de1-2c33-4f95-a1be-8d1b4e35eb4a.c000.csv
file.csv/property=descr/part-00004-8eb44de1-2c33-4f95-a1be-8d1b4e35eb4a.c000.csv
file.csv/property=label/part-00001-8eb44de1-2c33-4f95-a1be-8d1b4e35eb4a.c000.csv
file.csv/property=label/part-00002-8eb44de1-2c33-4f95-a1be-8d1b4e35eb4a.c000.csv
file.csv/property=label/part-00003-8eb44de1-2c33-4f95-a1be-8d1b4e35eb4a.c000.csv
file.csv/property=label/part-00004-8eb44de1-2c33-4f95-a1be-8d1b4e35eb4a.c000.csv
file.csv/_SUCCESS
Data that is mis-organized when written end up with the same number of files
in each of the sub-directories, even if some sub-directories contain only a fraction of
the number of rows than others. What you would like to have is have fewer files in smaller
and more files in larger partition sub-directories. Further, all files should have
roughly the same number of rows.
For this, you have to first range partition the `Dataset` according to your partition columns.
ds.repartitionByRange($"property", $"id")
.write
.partitionBy("property")
.csv("file.csv")
This organizes the data optimally for partition-writing them by column `property`.
file.csv/property=descr/part-00000-6317db5e-5161-41f1-8227-ffeaf06a3e41.c000.csv
file.csv/property=descr/part-00001-6317db5e-5161-41f1-8227-ffeaf06a3e41.c000.csv
file.csv/property=label/part-00002-6317db5e-5161-41f1-8227-ffeaf06a3e41.c000.csv
file.csv/property=label/part-00003-6317db5e-5161-41f1-8227-ffeaf06a3e41.c000.csv
file.csv/property=label/part-00004-6317db5e-5161-41f1-8227-ffeaf06a3e41.c000.csv
file.csv/_SUCCESS
This brings all rows with the same value in the `property` and `id` column into the same file.
If you need each file to further be sorted by additional columns, e.g. `ts`, then you can do this with `sortWithinPartitions`.
ds.repartitionByRange($"property", $"id")
.sortWithinPartitions($"property", $"id", $"ts")
.cache // this is needed for Spark 3.0 to 3.3 with AQE enabled: SPARK-40588
.write
.partitionBy("property")
.csv("file.csv")
Sometimes you want to write-partition by some expression that is not a column of your data,
e.g. the date-representation of the `ts` column.
ds.withColumn("date", $"ts".cast(DateType))
.repartitionByRange($"date", $"id")
.sortWithinPartitions($"date", $"id", $"ts")
.cache // this is needed for Spark 3.0 to 3.3 with AQE enabled: SPARK-40588
.write
.partitionBy("date")
.csv("file.csv")
All those above constructs can be replaced with a single meaningful operation:
ds.writePartitionedBy(Seq($"ts".cast(DateType).as("date")), Seq($"id"), Seq($"ts"))
.csv("file.csv")
For Spark 3.0 to 3.3 with AQE enabled (see [SPARK-40588](https://issues.apache.org/jira/browse/SPARK-40588)),
`writePartitionedBy` has to cache an internally created DataFrame. This can be unpersisted after writing
is finished. Provide an `UnpersistHandle` for this purpose:
val unpersist = UnpersistHandle()
ds.writePartitionedBy(…, unpersistHandle = Some(unpersist))
.csv("file.csv")
unpersist()
More details about this issue can be found [here](https://www.gresearch.co.uk/blog/article/guaranteeing-in-partition-order-for-partitioned-writing-in-apache-spark/).
<!--
# Other Approaches
problems with `repartition()` instead of `repartitionByRange()`
problems with `repartitionByRange(cols).write.partitionBy(cols)`
-->
================================================
FILE: PYSPARK-DEPS.md
================================================
# PySpark dependencies
Using PySpark on a cluster requires all cluster nodes to have those Python packages installed that are required by the PySpark job.
Such a deployment can be cumbersome, especially when running in an interactive notebook.
The `spark-extension` package allows installing Python packages programmatically by the PySpark application itself (PySpark ≥ 3.1.0).
These packages are only accessible by that PySpark application, and they are removed on calling `spark.stop()`.
Either install the `spark-extension` Maven package, or the `pyspark-extension` PyPi package (on the driver only),
as described [here](README.md#using-spark-extension).
## Installing packages with `pip`
Python packages can be installed with `pip` as follows:
```python
# noinspection PyUnresolvedReferences
from gresearch.spark import *
spark.install_pip_package("pandas", "pyarrow")
```
Above example installs PIP packages `pandas` and `pyarrow` via `pip`. Method `install_pip_package` takes any `pip` command line argument:
```python
# install packages with version specs
spark.install_pip_package("pandas==1.4.3", "pyarrow~=8.0.0")
# install packages from package sources (e.g. git clone https://github.com/pandas-dev/pandas.git)
spark.install_pip_package("./pandas/")
# install packages from git repo
spark.install_pip_package("git+https://github.com/pandas-dev/pandas.git@main")
# use a pip cache directory to cache downloaded and built whl files
spark.install_pip_package("pandas", "pyarrow", "--cache-dir", "/home/user/.cache/pip")
# use an alternative index url (other than https://pypi.org/simple)
spark.install_pip_package("pandas", "pyarrow", "--index-url", "https://artifacts.company.com/pypi/simple")
# install pip packages quietly (only disables output of PIP)
spark.install_pip_package("pandas", "pyarrow", "--quiet")
```
## Installing Python projects with Poetry
Python projects can be installed from sources, including their dependencies, using [Poetry](https://python-poetry.org/):
```python
# noinspection PyUnresolvedReferences
from gresearch.spark import *
spark.install_poetry_project("../my-poetry-project/", poetry_python="../venv-poetry/bin/python")
```
## Example
This example uses `install_pip_package` in a Spark standalone cluster.
First checkout the example code:
```shell
git clone https://github.com/G-Research/spark-extension.git
cd spark-extension/examples/python-deps
```
Build a Docker image based on the official Spark release:
```shell
docker build -t spark-extension-example-docker .
```
Start the example Spark standalone cluster consisting of a Spark master and one worker:
```shell
docker compose -f docker-compose.yml up -d
```
Run the `example.py` Spark application on the example cluster:
```shell
docker exec spark-master spark-submit --master spark://master:7077 --packages uk.co.gresearch.spark:spark-extension_2.12:2.15.0-3.5 /example/example.py
```
The `--packages uk.co.gresearch.spark:spark-extension_2.12:2.15.0-3.5` argument
tells `spark-submit` to add the `spark-extension` Maven package to the Spark job.
Alternatively, install the `pyspark-extension` PyPi package via `pip install` and remove the `--packages` argument from `spark-submit`:
```shell
docker exec spark-master pip install --user pyspark_extension==2.11.1.3.5
docker exec spark-master spark-submit --master spark://master:7077 /example/example.py
```
This output proves that PySpark could call into the function `func`, wich only works when Pandas and PyArrow are installed:
```
+---+
| id|
+---+
| 0|
| 1|
| 2|
+---+
```
Test that `spark.install_pip_package("pandas", "pyarrow")` is really required by this example by removing this line from `example.py` …
```diff
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder.appName("spark_app").getOrCreate()
def func(df):
return df
from gresearch.spark import install_pip_package
- spark.install_pip_package("pandas", "pyarrow")
spark.range(0, 3, 1, 5).mapInPandas(func, "id long").show()
if __name__ == "__main__":
main()
```
… and running the `spark-submit` command again. The example does not work anymore,
because the Pandas and PyArrow packages are missing from the driver:
```
Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/pandas/utils.py", line 27, in require_minimum_pandas_version
ModuleNotFoundError: No module named 'pandas'
```
Finally, shutdown the example cluster:
```shell
docker compose -f docker-compose.yml down
```
## Known Issues
Note that this feature is not supported in Python when connected with a [Spark Connect server](README.md#spark-connect-server).
================================================
FILE: README.md
================================================
# Spark Extension
This project provides extensions to the [Apache Spark project](https://spark.apache.org/) in Scala and Python:
**[Diff](DIFF.md):** A `diff` transformation and application for `Dataset`s that computes the differences between
two datasets, i.e. which rows to _add_, _delete_ or _change_ to get from one dataset to the other.
**[SortedGroups](GROUPS.md):** A `groupByKey` transformation that groups rows by a key while providing
a **sorted** iterator for each group. Similar to `Dataset.groupByKey.flatMapGroups`, but with order guarantees
for the iterator.
**[Histogram](HISTOGRAM.md) [<sup>[*]</sup>](#spark-connect-server):** A `histogram` transformation that computes the histogram DataFrame for a value column.
**[Global Row Number](ROW_NUMBER.md) [<sup>[*]</sup>](#spark-connect-server):** A `withRowNumbers` transformation that provides the global row number w.r.t.
the current order of the Dataset, or any given order. In contrast to the existing SQL function `row_number`, which
requires a window spec, this transformation provides the row number across the entire Dataset without scaling problems.
**[Partitioned Writing](PARTITIONING.md):** The `writePartitionedBy` action writes your `Dataset` partitioned and
efficiently laid out with a single operation.
**[Inspect Parquet files](PARQUET.md) [<sup>[*]</sup>](#spark-connect-server):** The structure of Parquet files (the metadata, not the data stored in Parquet) can be inspected similar to [parquet-tools](https://pypi.org/project/parquet-tools/)
or [parquet-cli](https://pypi.org/project/parquet-cli/) by reading from a simple Spark data source.
This simplifies identifying why some Parquet files cannot be split by Spark into scalable partitions.
**[Install Python packages into PySpark job](PYSPARK-DEPS.md) [<sup>[*]</sup>](#spark-connect-server):** Install Python dependencies via PIP or Poetry programatically into your running PySpark job (PySpark ≥ 3.1.0):
```python
# noinspection PyUnresolvedReferences
from gresearch.spark import *
# using PIP
spark.install_pip_package("pandas==1.4.3", "pyarrow")
spark.install_pip_package("-r", "requirements.txt")
# using Poetry
spark.install_poetry_project("../my-poetry-project/", poetry_python="../venv-poetry/bin/python")
```
**[Fluent method call](CONDITIONAL.md):** `T.call(transformation: T => R): R`: Turns a transformation `T => R`,
that is not part of `T` into a fluent method call on `T`. This allows writing fluent code like:
```scala
import uk.co.gresearch._
i.doThis()
.doThat()
.call(transformation)
.doMore()
```
**[Fluent conditional method call](CONDITIONAL.md):** `T.when(condition: Boolean).call(transformation: T => T): T`:
Perform a transformation fluently only if the given condition is true.
This allows writing fluent code like:
```scala
import uk.co.gresearch._
i.doThis()
.doThat()
.when(condition).call(transformation)
.doMore()
```
**[Shortcut for groupBy.as](https://github.com/G-Research/spark-extension/pull/213#issue-2032837105)**: Calling `Dataset.groupBy(Column*).as[K, T]`
should be preferred over calling `Dataset.groupByKey(V => K)` whenever possible. The former allows Catalyst to exploit
existing partitioning and ordering of the Dataset, while the latter hides from Catalyst which columns are used to create the keys.
This can have a significant performance penalty.
<details>
<summary>Details:</summary>
The new column-expression-based `groupByKey[K](Column*)` method makes it easier to group by a column expression key. Instead of
ds.groupBy($"id").as[Int, V]
use:
ds.groupByKey[Int]($"id")
</details>
**Backticks:** `backticks(string: String, strings: String*): String)`: Encloses the given column name with backticks (`` ` ``) when needed.
This is a handy way to ensure column names with special characters like dots (`.`) work with `col()` or `select()`.
**Count null values:** `count_null(e: Column)`: an aggregation function like `count` that counts null values in column `e`.
This is equivalent to calling `count(when(e.isNull, lit(1)))`.
**.Net DateTime.Ticks[<sup>[*]</sup>](#spark-connect-server):** Convert .Net (C#, F#, Visual Basic) `DateTime.Ticks` into Spark timestamps, seconds and nanoseconds.
<details>
<summary>Available methods:</summary>
```scala
// Scala
dotNetTicksToTimestamp(Column): Column // returns timestamp as TimestampType
dotNetTicksToUnixEpoch(Column): Column // returns Unix epoch seconds as DecimalType
dotNetTicksToUnixEpochNanos(Column): Column // returns Unix epoch nanoseconds as LongType
```
The reverse is provided by (all return `LongType` .Net ticks):
```scala
// Scala
timestampToDotNetTicks(Column): Column
unixEpochToDotNetTicks(Column): Column
unixEpochNanosToDotNetTicks(Column): Column
```
These methods are also available in Python:
```python
# Python
dotnet_ticks_to_timestamp(column_or_name) # returns timestamp as TimestampType
dotnet_ticks_to_unix_epoch(column_or_name) # returns Unix epoch seconds as DecimalType
dotnet_ticks_to_unix_epoch_nanos(column_or_name) # returns Unix epoch nanoseconds as LongType
timestamp_to_dotnet_ticks(column_or_name)
unix_epoch_to_dotnet_ticks(column_or_name)
unix_epoch_nanos_to_dotnet_ticks(column_or_name)
```
</details>
**Spark temporary directory[<sup>[*]</sup>](#spark-connect-server)**: Create a temporary directory that will be removed on Spark application shutdown.
<details>
<summary>Examples:</summary>
Scala:
```scala
import uk.co.gresearch.spark.createTemporaryDir
val dir = createTemporaryDir("prefix")
```
Python:
```python
# noinspection PyUnresolvedReferences
from gresearch.spark import *
dir = spark.create_temporary_dir("prefix")
```
</details>
**Spark job description[<sup>[*]</sup>](#spark-connect-server):** Set Spark job description for all Spark jobs within a context.
<details>
<summary>Examples:</summary>
```scala
import uk.co.gresearch.spark._
implicit val session: SparkSession = spark
withJobDescription("parquet file") {
val df = spark.read.parquet("data.parquet")
val count = appendJobDescription("count") {
df.count
}
appendJobDescription("write") {
df.write.csv("data.csv")
}
}
```
| Without job description | With job description |
|:---:|:---:|
|  |  |
Note that setting a description in one thread while calling the action (e.g. `.count`) in a different thread
does not work, unless the different thread is spawned from the current thread _after_ the description has been set.
Working example with parallel collections:
```scala
import java.util.concurrent.ForkJoinPool
import scala.collection.parallel.CollectionConverters.seqIsParallelizable
import scala.collection.parallel.ForkJoinTaskSupport
val files = Seq("data1.csv", "data2.csv").par
val counts = withJobDescription("Counting rows") {
// new thread pool required to spawn new threads from this thread
// so that the job description is actually used
files.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool())
files.map(filename => spark.read.csv(filename).count).sum
}(spark)
```
</details>
## Using Spark Extension
The `spark-extension` package is available for all Spark 3.2, 3.3, 3.4 and 3.5 versions.
The package version has the following semantics: `spark-extension_{SCALA_COMPAT_VERSION}-{VERSION}-{SPARK_COMPAT_VERSION}`:
- `SCALA_COMPAT_VERSION`: Scala binary compatibility (minor) version. Available are `2.12` and `2.13`.
- `SPARK_COMPAT_VERSION`: Apache Spark binary compatibility (minor) version. Available are `3.2`, `3.3`, `3.4`, `3.5` and `4.0`.
- `VERSION`: The package version, e.g. `2.14.0`.
### SBT
Add this line to your `build.sbt` file:
```sbt
libraryDependencies += "uk.co.gresearch.spark" %% "spark-extension" % "2.15.0-3.5"
```
### Maven
Add this dependency to your `pom.xml` file:
```xml
<dependency>
<groupId>uk.co.gresearch.spark</groupId>
<artifactId>spark-extension_2.12</artifactId>
<version>2.15.0-3.5</version>
</dependency>
```
### Gradle
Add this dependency to your `build.gradle` file:
```groovy
dependencies {
implementation "uk.co.gresearch.spark:spark-extension_2.12:2.15.0-3.5"
}
```
### Spark Submit
Submit your Spark app with the Spark Extension dependency (version ≥1.1.0) as follows:
```shell script
spark-submit --packages uk.co.gresearch.spark:spark-extension_2.12:2.15.0-3.5 [jar]
```
Note: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depending on your Spark version.
### Spark Shell
Launch a Spark Shell with the Spark Extension dependency (version ≥1.1.0) as follows:
```shell script
spark-shell --packages uk.co.gresearch.spark:spark-extension_2.12:2.15.0-3.5
```
Note: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depending on your Spark Shell version.
### Python
#### PySpark API
Start a PySpark session with the Spark Extension dependency (version ≥1.1.0) as follows:
```python
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.config("spark.jars.packages", "uk.co.gresearch.spark:spark-extension_2.12:2.15.0-3.5") \
.getOrCreate()
```
Note: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depending on your PySpark version.
#### PySpark REPL
Launch the Python Spark REPL with the Spark Extension dependency (version ≥1.1.0) as follows:
```shell script
pyspark --packages uk.co.gresearch.spark:spark-extension_2.12:2.15.0-3.5
```
Note: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depending on your PySpark version.
#### PySpark `spark-submit`
Run your Python scripts that use PySpark via `spark-submit`:
```shell script
spark-submit --packages uk.co.gresearch.spark:spark-extension_2.12:2.15.0-3.5 [script.py]
```
Note: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depending on your Spark version.
#### PyPi package (local Spark cluster only)
You may want to install the `pyspark-extension` python package from PyPi into your development environment.
This provides you code completion, typing and test capabilities during your development phase.
Running your Python application on a Spark cluster will still require one of the above ways
to add the Scala package to the Spark environment.
```shell script
pip install pyspark-extension==2.15.0.3.5
```
Note: Pick the right Spark version (here 3.5) depending on your PySpark version.
### Your favorite Data Science notebook
There are plenty of [Data Science notebooks](https://datasciencenotebook.org/) around. To use this library,
add **a jar dependency** to your notebook using these **Maven coordinates**:
uk.co.gresearch.spark:spark-extension_2.12:2.15.0-3.5
Or [download the jar](https://mvnrepository.com/artifact/uk.co.gresearch.spark/spark-extension) and place it
on a filesystem where it is accessible by the notebook, and reference that jar file directly.
Check the documentation of your favorite notebook to learn how to add jars to your Spark environment.
## Known issues
### Spark Connect Server
Most features are not supported **in Python** in conjunction with a [Spark Connect server](https://spark.apache.org/docs/latest/spark-connect-overview.html).
This also holds for Databricks Runtime environment 13.x and above. Details can be found [in this blog](https://semyonsinchenko.github.io/ssinchenko/post/how-databricks-14x-breaks-3dparty-compatibility/).
Calling any of those features when connected to a Spark Connect server will raise this error:
This feature is not supported for Spark Connect.
Use a classic connection to a Spark cluster instead.
## Build
You can build this project against different versions of Spark and Scala.
### Switch Spark and Scala version
If you want to build for a Spark or Scala version different to what is defined in the `pom.xml` file, then run
```shell script
sh set-version.sh [SPARK-VERSION] [SCALA-VERSION]
```
For example, switch to Spark 3.5.0 and Scala 2.13.8 by running `sh set-version.sh 3.5.0 2.13.8`.
### Build the Scala project
Then execute `mvn package` to create a jar from the sources. It can be found in `target/`.
## Testing
Run the Scala tests via `mvn test`.
### Setup Python environment
In order to run the Python tests, setup a Python environment as follows:
```shell script
virtualenv -p python3 venv
source venv/bin/activate
pip install python/[test]
```
### Run Python tests
Run the Python tests via `env PYTHONPATH=python/test python -m pytest python/test`.
### Build Python package
Run the following commands in the project root directory to create a whl from the sources:
```shell script
pip install build
python -m build python/
```
It can be found in `python/dist/`.
## Publications
- ***Guaranteeing in-partition order for partitioned-writing in Apache Spark**, Enrico Minack, 20/01/2023*:<br/>https://www.gresearch.com/blog/article/guaranteeing-in-partition-order-for-partitioned-writing-in-apache-spark/
- ***Un-pivot, sorted groups and many bug fixes: Celebrating the first Spark 3.4 release**, Enrico Minack, 21/03/2023*:<br/>https://www.gresearch.com/blog/article/un-pivot-sorted-groups-and-many-bug-fixes-celebrating-the-first-spark-3-4-release/
- ***A PySpark bug makes co-grouping with window function partition-key-order-sensitive**, Enrico Minack, 29/03/2023*:<br/>https://www.gresearch.com/blog/article/a-pyspark-bug-makes-co-grouping-with-window-function-partition-key-order-sensitive/
- ***Spark’s groupByKey should be avoided – and here’s why**, Enrico Minack, 13/06/2023*:<br/>https://www.gresearch.com/blog/article/sparks-groupbykey-should-be-avoided-and-heres-why/
- ***Inspecting Parquet files with Spark**, Enrico Minack, 28/07/2023*:<br/>https://www.gresearch.com/blog/article/parquet-files-know-your-scaling-limits/
- ***Enhancing Spark’s UI with Job Descriptions**, Enrico Minack, 12/12/2023*:<br/>https://www.gresearch.com/blog/article/enhancing-sparks-ui-with-job-descriptions/
- ***PySpark apps with dependencies: Managing Python dependencies in code**, Enrico Minack, 24/01/2024*:<br/>https://www.gresearch.com/news/pyspark-apps-with-dependencies-managing-python-dependencies-in-code/
- ***Observing Spark Aggregates: Cheap Metrics from Datasets**, Enrico Minack, 06/02/2024*:<br/>https://www.gresearch.com/news/observing-spark-aggregates-cheap-metrics-from-datasets-2/
## Security
Please see our [security policy](https://github.com/G-Research/spark-extension/blob/master/SECURITY.md) for details on reporting security vulnerabilities.
================================================
FILE: RELEASE.md
================================================
# Releasing Spark Extension
This provides instructions on how to release a version of `spark-extension`. We release this library
for a number of Spark and Scala environments, but all from the same git tag. Release for the environment
that is set in the `pom.xml` and create a tag. On success, release from that tag for all other environments
as described below.
Use the `release.sh` script to test and release all versions. Or execute the following steps manually.
## Testing master for all environments
The following steps release a snapshot and test it. Test all versions listed [further down](#releasing-master-for-other-environments).
- Set the version with `./set-version.sh`, e.g. `./set-version.sh 3.4.0 2.12.17`
- Release a snapshot (make sure the version in the `pom.xml` file ends with `SNAPSHOT`): `mvn clean deploy`
- Test the released snapshot: `./test-release.sh`
## Releasing from master
Follow this procedure to release a new version:
- Add a new entry to `CHANGELOG.md` listing all notable changes of this release.
Use the heading `## [VERSION] - YYYY-MM-dd`, e.g. `## [1.1.0] - 2020-03-12`.
- Remove the `-SNAPSHOT` suffix from the version, e.g. `./set-version 1.1.0`.
- Update the versions in the `README.md` and `python/README.md` file to the version of your `pom.xml` to reflect the latest version,
e.g. replace all `1.0.0-3.1` with `1.1.0-3.1` and `1.0.0.3.1` with `1.1.0.3.1`, respectively.
- Commit the change to your local git repository, use a commit message like `Releasing 1.1.0`. Do not push to github yet.
- Tag that commit with a version tag like `v1.1.0` and message like `Release v1.1.0`. Do not push to github yet.
- Release the version with `mvn clean deploy`. This will be put into a staging repository and not automatically released (due to `<autoReleaseAfterClose>false</autoReleaseAfterClose>` in your [`pom.xml`](pom.xml) file).
- Inspect and test the staged version. Use `./test-release.sh` or the `spark-examples` project for that. If you are happy with everything:
- Push the commit and tag to origin.
- Release the package with `mvn nexus-staging:release`.
- Bump the version to the next [minor version](https://semver.org/) and append the `-SNAPSHOT` suffix again: `./set-version 1.2.0-SNAPSHOT`.
- Commit this change to your local git repository, use a commit message like `Post-release version bump to 1.2.0`.
- Push all local commits to origin.
- Otherwise drop it with `mvn nexus-staging:drop`. Remove the last two commits from your local history.
## Releasing master for other environments
Once you have released the new version, release from the same tag for all other Spark and Scala environments as well:
- Release for these environments, one of these has been released above, that should be the tagged version:
|Spark|Scala|
|:----|:----|
|3.2 |2.12.15 and 2.13.5|
|3.3 |2.12.15 and 2.13.8|
|3.4 |2.12.17 and 2.13.8|
|3.5 |2.12.17 and 2.13.8|
- Always use the latest Spark version per Spark minor version
- Release process:
- Checkout the release tag, e.g. `git checkout v1.0.0`
- Set the version in the `pom.xml` file via `set-version.sh`, e.g. `./set-version.sh 3.4.0 2.12.17`
- Review the `pom.xml` file changes: `git diff pom.xml`
- Release the version with `mvn clean deploy`
- Inspect and test the staged version. Use `./test-release.sh` or the `spark-examples` project for that.
- If you are happy with everything, release the package with `mvn nexus-staging:release`.
- Otherwise drop it with `mvn nexus-staging:drop`.
- Revert the changes done to the `pom.xml` file: `git checkout pom.xml`
## Releasing a bug-fix version
A bug-fix version needs to be released from a [minor-version branch](https://semver.org/), e.g. `branch-1.1`.
### Create a bug-fix branch
If there is no bug-fix branch yet, create it:
- Create such a branch from the respective [minor-version tag](https://semver.org/), e.g. create minor version branch `branch-1.1` from tag `v1.1.0`.
- Bump the version to the next [patch version](https://semver.org/) in `pom.xml` and append the `-SNAPSHOT` suffix again, e.g. `1.1.0` → `1.1.1-SNAPSHOT`.
- Commit this change to your local git repository, use a commit message like `Post-release version bump to 1.1.1`.
- Push this commit to origin.
Merge your bug fixes into this branch as you would normally do for master, use PRs for that.
### Release from a bug-fix branch
This is very similar to [releasing from master](#releasing-from-master),
but the version increment occurs on [patch level](https://semver.org/):
- Add a new entry to `CHANGELOG.md` listing all notable changes of this release.
Use the heading `## [VERSION] - YYYY-MM-dd`, e.g. `## [1.1.1] - 2020-03-12`.
- Remove the `-SNAPSHOT` suffix from the version, e.g. `./set-version 1.1.1`.
- Update the versions in the `README.md` and `python/README.md` file to the version of your `pom.xml` to reflect the latest version,
e.g. replace all `1.1.0-3.1` with `1.1.1-3.1` and `1.1.0.3.1` with `1.1.1.3.1`, respectively.
- Commit the change to your local git repository, use a commit message like `Releasing 1.1.1`. Do not push to github yet.
- Tag that commit with a version tag like `v1.1.1` and message like `Release v1.1.1`. Do not push to github yet.
- Release the version with `mvn clean deploy`. This will be put into a staging repository and not automatically released (due to `<autoReleaseAfterClose>false</autoReleaseAfterClose>` in your [`pom.xml`](pom.xml) file).
- Inspect and test the staged version. Use `./test-release.sh` or the `spark-examples` project for that. If you are happy with everything:
- Push the commit and tag to origin.
- Release the package with `mvn nexus-staging:release`.
- Bump the version to the next [patch version](https://semver.org/) and append the `-SNAPSHOT` suffix again: `./set-version 1.1.2-SNAPSHOT`.
- Commit this change to your local git repository, use a commit message like `Post-release version bump to 1.1.2`.
- Push all local commits to origin.
- Otherwise drop it with `mvn nexus-staging:drop`. Remove the last two commits from your local history.
Consider releasing the bug-fix version for other environments as well. See [above](#releasing-master-for-other-environments) section for details.
================================================
FILE: ROW_NUMBER.md
================================================
# Global Row Number
Spark provides the [SQL function `row_number`](https://spark.apache.org/docs/latest/api/sql/index.html#row_number),
which assigns each row a consecutive number, starting from 1. This function works on a [Window](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/expressions/Window.html).
Assigning a row number over the entire Dataset will load the entire dataset into a single partition / executor.
This does not scale.
Spark extensions provide the `Dataset` transformation `withRowNumbers`, which assigns a global row number while scaling:
```scala
val df = Seq((1, "one"), (2, "TWO"), (2, "two"), (3, "three")).toDF("id", "value")
df.show()
// +---+-----+
// | id|value|
// +---+-----+
// | 1| one|
// | 2| TWO|
// | 2| two|
// | 3|three|
// +---+-----+
import uk.co.gresearch.spark._
df.withRowNumbers().show()
// +---+-----+----------+
// | id|value|row_number|
// +---+-----+----------+
// | 1| one| 1|
// | 2| two| 2|
// | 2| TWO| 3|
// | 3|three| 4|
// +---+-----+----------+
```
In Java:
```java
import uk.co.gresearch.spark.RowNumbers;
RowNumbers.of(df).show();
// +---+-----+----------+
// | id|value|row_number|
// +---+-----+----------+
// | 1| one| 1|
// | 2| two| 2|
// | 2| TWO| 3|
// | 3|three| 4|
// +---+-----+----------+
```
In Python:
```python
import gresearch.spark
df.with_row_numbers().show()
# +---+-----+----------+
# | id|value|row_number|
# +---+-----+----------+
# | 1| one| 1|
# | 2| two| 2|
# | 2| TWO| 3|
# | 3|three| 4|
# +---+-----+----------+
```
## Row number order
Row numbers are assigned in the current order of the Dataset. If you want a specific order, provide columns as follows:
```scala
df.withRowNumbers($"id".desc, $"value").show()
// +---+-----+----------+
// | id|value|row_number|
// +---+-----+----------+
// | 3|three| 1|
// | 2| TWO| 2|
// | 2| two| 3|
// | 1| one| 4|
// +---+-----+----------+
```
In Java:
```java
RowNumbers.withOrderColumns(df.col("id").desc(), df.col("value")).of(df).show();
// +---+-----+----------+
// | id|value|row_number|
// +---+-----+----------+
// | 3|three| 1|
// | 2| TWO| 2|
// | 2| two| 3|
// | 1| one| 4|
// +---+-----+----------+
```
In Python:
```python
df.with_row_numbers(order=[df.id.desc(), df.value]).show()
# +---+-----+----------+
# | id|value|row_number|
# +---+-----+----------+
# | 3|three| 1|
# | 2| TWO| 2|
# | 2| two| 3|
# | 1| one| 4|
# +---+-----+----------+
```
## Row number column name
The column name that contains the row number can be changed by providing the `rowNumberColumnName` argument:
```scala
df.withRowNumbers(rowNumberColumnName="row").show()
// +---+-----+---+
// | id|value|row|
// +---+-----+---+
// | 1| one| 1|
// | 2| TWO| 2|
// | 2| two| 3|
// | 3|three| 4|
// +---+-----+---+
```
In Java:
```java
RowNumbers.withRowNumberColumnName("row").of(df).show();
// +---+-----+---+
// | id|value|row|
// +---+-----+---+
// | 1| one| 1|
// | 2| TWO| 2|
// | 2| two| 3|
// | 3|three| 4|
// +---+-----+---+
```
In Python:
```python
df.with_row_numbers(row_number_column_name='row').show()
# +---+-----+---+
# | id|value|row|
# +---+-----+---+
# | 1| one| 1|
# | 2| TWO| 2|
# | 2| two| 3|
# | 3|three| 4|
# +---+-----+---+
```
## Cached / persisted intermediate Dataset
The `withRowNumbers` transformation requires the input Dataset to be
[cached](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html#cache():Dataset.this.type) /
[persisted](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html#persist(newLevel:org.apache.spark.storage.StorageLevel):Dataset.this.type),
after adding an intermediate column. You can specify the level of persistence through the `storageLevel` parameter.
```scala
import org.apache.spark.storage.StorageLevel
val dfWithRowNumbers = df.withRowNumbers(storageLevel=StorageLevel.DISK_ONLY)
```
In Java:
```java
import org.apache.spark.storage.StorageLevel;
Dataset<Row> dfWithRowNumbers = RowNumbers.withStorageLevel(StorageLevel.DISK_ONLY()).of(df);
```
In Python:
```python
from pyspark.storagelevel import StorageLevel
df_with_row_numbers = df.with_row_numbers(storage_level=StorageLevel.DISK_ONLY)
```
## Un-persist intermediate Dataset
If you want control over when to un-persist this intermediate Dataset, you can provide an `UnpersistHandle` and call it
when you are done with the result Dataset:
```scala
import uk.co.gresearch.spark.UnpersistHandle
val unpersist = UnpersistHandle()
val dfWithRowNumbers = df.withRowNumbers(unpersistHandle=unpersist);
// after you are done with dfWithRowNumbers you may want to call unpersist()
unpersist(blocking=false)
```
In Java:
```java
import uk.co.gresearch.spark.UnpersistHandle;
UnpersistHandle unpersist = new UnpersistHandle();
Dataset<Row> dfWithRowNumbers = RowNumbers.withUnpersistHandle(unpersist).of(df);
// after you are done with dfWithRowNumbers you may want to call unpersist()
unpersist.apply(true);
```
In Python:
```python
unpersist = spark.unpersist_handle()
df_with_row_numbers = df.with_row_numbers(unpersist_handle=unpersist)
# after you are done with df_with_row_numbers you may want to call unpersist()
unpersist(blocking=True)
```
## Spark warning
You will recognize that Spark logs the following warning:
```
WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
```
This warning is unavoidable, because `withRowNumbers` has to pull information about the initial partitions into a single partition.
Fortunately, there are only 12 Bytes per input partition required, so this amount of data usually fits into a single partition and the warning can safely be ignored.
## Known issues
Note that this feature is not supported in Python when connected with a [Spark Connect server](README.md#spark-connect-server).
================================================
FILE: SECURITY.md
================================================
# Security and Coordinated Vulnerability Disclosure Policy
This project appreciates and encourages coordinated disclosure of security vulnerabilities. We prefer that you use the GitHub reporting mechanism to privately report vulnerabilities. Under the main repository's security tab, click "Report a vulnerability" to open the advisory form.
If you are unable to report it via GitHub, have received no response after repeated attempts, or have other security related questions, please contact security@gr-oss.io and mention this project in the subject line.
================================================
FILE: build-whl.sh
================================================
#!/bin/bash
set -eo pipefail
base=$(cd "$(dirname "$0")"; pwd)
version=$(grep --max-count=1 "<version>.*</version>" "$base/pom.xml" | sed -E -e "s/\s*<[^>]+>//g")
artifact_id=$(grep --max-count=1 "<artifactId>.*</artifactId>" "$base/pom.xml" | sed -E -e "s/\s*<[^>]+>//g")
rm -rf "$base/python/pyspark/jars/$artifact_id-*.jar"
pip install build
python -m build "$base/python/"
# check for missing modules in whl file
pyversion=${version/SNAPSHOT/dev0}
pyversion=${pyversion//-/.}
missing="$(diff <(cd "$base/python"; find gresearch -type f | grep -v ".pyc$" | sort) <(unzip -l "$base/python/dist/pyspark_extension-${pyversion}-*.whl" | tail -n +4 | head -n -2 | sed -E -e "s/^ +//" -e "s/ +/ /g" | cut -d " " -f 4- | sort) | grep "^<" || true)"
if [ -n "$missing" ]
then
echo "These files are missing from the whl file:"
echo "$missing"
exit 1
fi
jars=$(unzip -l "$base/python/dist/pyspark_extension-${pyversion}-*.whl" | grep ".jar" | wc -l)
if [ $jars -ne 1 ]
then
echo "Expected exactly one jar in whl file, but $jars found!"
exit 1
fi
================================================
FILE: bump-version.sh
================================================
#!/bin/bash
#
# Copyright 2020 G-Research
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, s
gitextract_5zjc6xfa/ ├── .github/ │ ├── actions/ │ │ ├── build-whl/ │ │ │ └── action.yml │ │ ├── check-compat/ │ │ │ └── action.yml │ │ ├── prime-caches/ │ │ │ └── action.yml │ │ ├── test-jvm/ │ │ │ └── action.yml │ │ ├── test-python/ │ │ │ └── action.yml │ │ └── test-release/ │ │ └── action.yml │ ├── dependabot.yml │ ├── show-spark-versions.sh │ └── workflows/ │ ├── build-jvm.yml │ ├── build-python.yml │ ├── build-snapshots.yml │ ├── check.yml │ ├── ci.yml │ ├── clear-caches.yaml │ ├── prepare-release.yml │ ├── prime-caches.yml │ ├── publish-release.yml │ ├── publish-snapshot.yml │ ├── test-jvm.yml │ ├── test-python.yml │ ├── test-release.yml │ ├── test-results.yml │ └── test-snapshots.yml ├── .gitignore ├── .scalafmt.conf ├── CHANGELOG.md ├── CONDITIONAL.md ├── DIFF.md ├── GROUPS.md ├── HISTOGRAM.md ├── LICENSE ├── MAINTAINERS.md ├── PARQUET.md ├── PARTITIONING.md ├── PYSPARK-DEPS.md ├── README.md ├── RELEASE.md ├── ROW_NUMBER.md ├── SECURITY.md ├── build-whl.sh ├── bump-version.sh ├── examples/ │ └── python-deps/ │ ├── Dockerfile │ ├── docker-compose.yml │ └── example.py ├── pom.xml ├── python/ │ ├── README.md │ ├── gresearch/ │ │ ├── __init__.py │ │ └── spark/ │ │ ├── __init__.py │ │ ├── diff/ │ │ │ ├── __init__.py │ │ │ └── comparator/ │ │ │ └── __init__.py │ │ └── parquet/ │ │ └── __init__.py │ ├── pyproject.toml │ ├── pyspark/ │ │ └── jars/ │ │ └── .gitignore │ ├── setup.py │ └── test/ │ ├── __init__.py │ ├── spark_common.py │ ├── test_diff.py │ ├── test_histogram.py │ ├── test_job_description.py │ ├── test_jvm.py │ ├── test_package.py │ ├── test_parquet.py │ └── test_row_number.py ├── release.sh ├── set-version.sh ├── src/ │ ├── main/ │ │ ├── scala/ │ │ │ └── uk/ │ │ │ └── co/ │ │ │ └── gresearch/ │ │ │ ├── package.scala │ │ │ └── spark/ │ │ │ ├── BuildVersion.scala │ │ │ ├── Histogram.scala │ │ │ ├── RowNumbers.scala │ │ │ ├── SparkVersion.scala │ │ │ ├── UnpersistHandle.scala │ │ │ ├── diff/ │ │ │ │ ├── App.scala │ │ │ │ ├── Diff.scala │ │ │ │ ├── DiffComparators.scala │ │ │ │ ├── DiffOptions.scala │ │ │ │ ├── comparator/ │ │ │ │ │ ├── DefaultDiffComparator.scala │ │ │ │ │ ├── DiffComparator.scala │ │ │ │ │ ├── DurationDiffComparator.scala │ │ │ │ │ ├── EpsilonDiffComparator.scala │ │ │ │ │ ├── EquivDiffComparator.scala │ │ │ │ │ ├── MapDiffComparator.scala │ │ │ │ │ ├── NullSafeEqualDiffComparator.scala │ │ │ │ │ ├── TypedDiffComparator.scala │ │ │ │ │ └── WhitespaceDiffComparator.scala │ │ │ │ └── package.scala │ │ │ ├── group/ │ │ │ │ └── package.scala │ │ │ ├── package.scala │ │ │ └── parquet/ │ │ │ ├── ParquetMetaDataUtil.scala │ │ │ └── package.scala │ │ ├── scala-spark-3.2/ │ │ │ └── uk/ │ │ │ └── co/ │ │ │ └── gresearch/ │ │ │ └── spark/ │ │ │ └── parquet/ │ │ │ └── SplitFile.scala │ │ ├── scala-spark-3.3/ │ │ │ └── uk/ │ │ │ └── co/ │ │ │ └── gresearch/ │ │ │ └── spark/ │ │ │ └── parquet/ │ │ │ └── SplitFile.scala │ │ ├── scala-spark-3.5/ │ │ │ ├── org/ │ │ │ │ └── apache/ │ │ │ │ └── spark/ │ │ │ │ └── sql/ │ │ │ │ └── extension/ │ │ │ │ └── package.scala │ │ │ └── uk/ │ │ │ └── co/ │ │ │ └── gresearch/ │ │ │ └── spark/ │ │ │ └── Backticks.scala │ │ └── scala-spark-4.0/ │ │ ├── org/ │ │ │ └── apache/ │ │ │ └── spark/ │ │ │ └── sql/ │ │ │ └── extension/ │ │ │ └── package.scala │ │ └── uk/ │ │ └── co/ │ │ └── gresearch/ │ │ └── spark/ │ │ ├── Backticks.scala │ │ └── parquet/ │ │ └── SplitFile.scala │ └── test/ │ ├── files/ │ │ ├── encrypted1.parquet │ │ ├── encrypted2.parquet │ │ ├── nested.parquet │ │ └── test.parquet/ │ │ ├── file1.parquet │ │ └── file2.parquet │ ├── java/ │ │ └── uk/ │ │ └── co/ │ │ └── gresearch/ │ │ └── test/ │ │ ├── SparkJavaTests.java │ │ └── diff/ │ │ ├── DiffJavaTests.java │ │ ├── JavaValue.java │ │ └── JavaValueAs.java │ ├── resources/ │ │ ├── log4j.properties │ │ └── log4j2.properties │ ├── scala/ │ │ └── uk/ │ │ └── co/ │ │ └── gresearch/ │ │ ├── spark/ │ │ │ ├── GroupBySuite.scala │ │ │ ├── HistogramSuite.scala │ │ │ ├── SparkSuite.scala │ │ │ ├── SparkTestSession.scala │ │ │ ├── WritePartitionedSuite.scala │ │ │ ├── diff/ │ │ │ │ ├── AppSuite.scala │ │ │ │ ├── DiffComparatorSuite.scala │ │ │ │ ├── DiffOptionsSuite.scala │ │ │ │ ├── DiffSuite.scala │ │ │ │ └── examples/ │ │ │ │ └── Examples.scala │ │ │ ├── group/ │ │ │ │ └── GroupSuite.scala │ │ │ ├── parquet/ │ │ │ │ └── ParquetSuite.scala │ │ │ └── test/ │ │ │ └── package.scala │ │ └── test/ │ │ ├── ClasspathSuite.scala │ │ ├── Spec.scala │ │ └── Suite.scala │ ├── scala-spark-3/ │ │ └── uk/ │ │ └── co/ │ │ └── gresearch/ │ │ └── spark/ │ │ └── SparkSuiteHelper.scala │ └── scala-spark-4/ │ └── uk/ │ └── co/ │ └── gresearch/ │ └── spark/ │ └── SparkSuiteHelper.scala ├── test-release.py ├── test-release.scala └── test-release.sh
SYMBOL INDEX (297 symbols across 18 files)
FILE: examples/python-deps/example.py
function main (line 3) | def main():
FILE: python/gresearch/spark/__init__.py
function _is_column (line 66) | def _is_column(obj: Any) -> bool:
function _is_column_or_str (line 70) | def _is_column_or_str(obj: Any) -> bool:
function _is_dataframe (line 74) | def _is_dataframe(obj: Any) -> bool:
function _check_java_pkg_is_installed (line 78) | def _check_java_pkg_is_installed(jvm: JVMView) -> bool:
function _get_jvm (line 91) | def _get_jvm(obj: Any) -> JVMView:
function _to_seq (line 127) | def _to_seq(jvm: JVMView, list: List[Any]) -> JavaObject:
function _to_map (line 132) | def _to_map(jvm: JVMView, map: Mapping[Any, Any]) -> JavaObject:
function backticks (line 136) | def backticks(*name_parts: str) -> str:
function distinct_prefix_for (line 145) | def distinct_prefix_for(existing: List[str]) -> str:
function handle_configured_case_sensitivity (line 158) | def handle_configured_case_sensitivity(column_name: str, case_sensitive:...
function list_contains_case_sensitivity (line 171) | def list_contains_case_sensitivity(column_names: Iterable[str], columnNa...
function list_filter_case_sensitivity (line 181) | def list_filter_case_sensitivity(column_names: Iterable[str], filter: It...
function list_diff_case_sensitivity (line 194) | def list_diff_case_sensitivity(column_names: Iterable[str], other: Itera...
function dotnet_ticks_to_timestamp (line 207) | def dotnet_ticks_to_timestamp(tick_column: Union[str, Column]) -> Column:
function dotnet_ticks_to_unix_epoch (line 237) | def dotnet_ticks_to_unix_epoch(tick_column: Union[str, Column]) -> Column:
function dotnet_ticks_to_unix_epoch_nanos (line 267) | def dotnet_ticks_to_unix_epoch_nanos(tick_column: Union[str, Column]) ->...
function timestamp_to_dotnet_ticks (line 297) | def timestamp_to_dotnet_ticks(timestamp_column: Union[str, Column]) -> C...
function unix_epoch_to_dotnet_ticks (line 326) | def unix_epoch_to_dotnet_ticks(unix_column: Union[str, Column]) -> Column:
function unix_epoch_nanos_to_dotnet_ticks (line 357) | def unix_epoch_nanos_to_dotnet_ticks(unix_column: Union[str, Column]) ->...
function count_null (line 389) | def count_null(e: "ColumnOrName") -> Column:
function histogram (line 410) | def histogram(self: DataFrame,
class UnpersistHandle (line 441) | class UnpersistHandle:
method __init__ (line 442) | def __init__(self, handle):
method __call__ (line 445) | def __call__(self, blocking: Optional[bool] = None):
function unpersist_handle (line 453) | def unpersist_handle(self: SparkSession) -> UnpersistHandle:
function _get_sort_cols (line 462) | def _get_sort_cols(df: DataFrame, order: Union[str, Column, List[Union[s...
function with_row_numbers (line 472) | def with_row_numbers(self: DataFrame,
function session (line 500) | def session(self: DataFrame) -> SparkSession:
function session_or_ctx (line 504) | def session_or_ctx(self: DataFrame) -> Union[SparkSession, SQLContext]:
function set_description (line 515) | def set_description(description: Optional[str], if_not_set: bool = False):
function job_description (line 527) | def job_description(description: str, if_not_set: bool = False):
function append_description (line 555) | def append_description(extra_description: str, separator: str = " - "):
function append_job_description (line 566) | def append_job_description(extra_description: str, separator: str = " - "):
function create_temporary_dir (line 593) | def create_temporary_dir(spark: Union[SparkSession, SparkContext], prefi...
function install_pip_package (line 612) | def install_pip_package(spark: Union[SparkSession, SparkContext], *packa...
function install_poetry_project (line 652) | def install_poetry_project(spark: Union[SparkSession, SparkContext],
FILE: python/gresearch/spark/diff/__init__.py
class deprecated (line 41) | class deprecated:
method __init__ (line 42) | def __init__(self, msg: str) -> None:
method __call__ (line 45) | def __call__(self, func: _T) -> _T:
class DiffMode (line 55) | class DiffMode(Enum):
method _to_java (line 64) | def _to_java(self, jvm: JVMView) -> JavaObject:
class DiffOptions (line 69) | class DiffOptions:
method with_diff_column (line 108) | def with_diff_column(self, diff_column: str) -> 'DiffOptions':
method with_left_column_prefix (line 121) | def with_left_column_prefix(self, left_column_prefix: str) -> 'DiffOpt...
method with_right_column_prefix (line 134) | def with_right_column_prefix(self, right_column_prefix: str) -> 'DiffO...
method with_insert_diff_value (line 147) | def with_insert_diff_value(self, insert_diff_value: str) -> 'DiffOptio...
method with_change_diff_value (line 160) | def with_change_diff_value(self, change_diff_value: str) -> 'DiffOptio...
method with_delete_diff_value (line 173) | def with_delete_diff_value(self, delete_diff_value: str) -> 'DiffOptio...
method with_nochange_diff_value (line 186) | def with_nochange_diff_value(self, nochange_diff_value: str) -> 'DiffO...
method with_change_column (line 199) | def with_change_column(self, change_column: str) -> 'DiffOptions':
method without_change_column (line 212) | def without_change_column(self) -> 'DiffOptions':
method with_diff_mode (line 222) | def with_diff_mode(self, diff_mode: DiffMode) -> 'DiffOptions':
method with_sparse_mode (line 235) | def with_sparse_mode(self, sparse_mode: bool) -> 'DiffOptions':
method with_default_comparator (line 248) | def with_default_comparator(self, comparator: DiffComparator) -> 'Diff...
method with_data_type_comparator (line 252) | def with_data_type_comparator(self, comparator: DiffComparator, *data_...
method with_column_name_comparator (line 267) | def with_column_name_comparator(self, comparator: DiffComparator, *col...
method comparator_for (line 282) | def comparator_for(self, column: StructField) -> DiffComparator:
class Differ (line 292) | class Differ:
method __init__ (line 299) | def __init__(self, options: DiffOptions = None):
method diff (line 303) | def diff(self, left: DataFrame, right: DataFrame, *id_columns: str) ->...
method diff (line 306) | def diff(self, left: DataFrame, right: DataFrame, id_columns: Iterable...
method diff (line 308) | def diff(self, left: DataFrame, right: DataFrame, *id_or_ignore_column...
method _columns_of_side (line 392) | def _columns_of_side(df: DataFrame, id_columns: List[str], side_prefix...
method diffwith (line 398) | def diffwith(self, left: DataFrame, right: DataFrame, *id_columns: str...
method diffwith (line 401) | def diffwith(self, left: DataFrame, right: DataFrame, id_columns: Iter...
method diffwith (line 403) | def diffwith(self, left: DataFrame, right: DataFrame, *id_or_ignore_co...
method _check_schema (line 448) | def _check_schema(self, left: DataFrame, right: DataFrame, id_columns:...
method _get_change_column (line 551) | def _get_change_column(self,
method _do_diff (line 566) | def _do_diff(self, left: DataFrame, right: DataFrame, id_columns: List...
method _get_diff_id_columns (line 604) | def _get_diff_id_columns(self, pk_columns: List[str],
method _get_diff_value_columns (line 609) | def _get_diff_value_columns(self, pk_columns: List[str],
method _get_diff_columns (line 667) | def _get_diff_columns(self, pk_columns: List[str],
function diff (line 678) | def diff(self: DataFrame, other: DataFrame, *id_columns: str) -> DataFra...
function diff (line 682) | def diff(self: DataFrame, other: DataFrame, id_columns: Iterable[str], i...
function diff (line 686) | def diff(self: DataFrame, other: DataFrame, *id_or_ignore_columns: Union...
function diff (line 690) | def diff(self: DataFrame, other: DataFrame, options: DiffOptions, *id_co...
function diff (line 694) | def diff(self: DataFrame, other: DataFrame, options: DiffOptions, id_col...
function diff (line 698) | def diff(self: DataFrame, other: DataFrame, options: DiffOptions, *id_or...
function diff (line 701) | def diff(self: DataFrame, other: DataFrame, *options_or_id_or_ignore_col...
function diffwith (line 784) | def diffwith(self: DataFrame, other: DataFrame, *id_columns: str) -> Dat...
function diffwith (line 788) | def diffwith(self: DataFrame, other: DataFrame, id_columns: Iterable[str...
function diffwith (line 792) | def diffwith(self: DataFrame, other: DataFrame, *id_or_ignore_columns: U...
function diffwith (line 796) | def diffwith(self: DataFrame, other: DataFrame, options: DiffOptions, *i...
function diffwith (line 800) | def diffwith(self: DataFrame, other: DataFrame, options: DiffOptions, id...
function diffwith (line 804) | def diffwith(self: DataFrame, other: DataFrame, options: DiffOptions, *i...
function diffwith (line 807) | def diffwith(self: DataFrame, other: DataFrame, *options_or_id_or_ignore...
function diff_with_options (line 842) | def diff_with_options(self: DataFrame, other: DataFrame, options: DiffOp...
function diff_with_options (line 846) | def diff_with_options(self: DataFrame, other: DataFrame, options: DiffOp...
function diff_with_options (line 850) | def diff_with_options(self: DataFrame, other: DataFrame, options: DiffOp...
function diffwith_with_options (line 872) | def diffwith_with_options(self: DataFrame, other: DataFrame, options: Di...
function diffwith_with_options (line 876) | def diffwith_with_options(self: DataFrame, other: DataFrame, options: Di...
function diffwith_with_options (line 880) | def diffwith_with_options(self: DataFrame, other: DataFrame, options: Di...
FILE: python/gresearch/spark/diff/comparator/__init__.py
class DiffComparator (line 27) | class DiffComparator(abc.ABC):
method equiv (line 29) | def equiv(self, left: Column, right: Column) -> Column:
class DiffComparators (line 33) | class DiffComparators:
method default (line 35) | def default() -> 'DefaultDiffComparator':
method nullSafeEqual (line 39) | def nullSafeEqual() -> 'NullSafeEqualDiffComparator':
method epsilon (line 43) | def epsilon(epsilon: float) -> 'EpsilonDiffComparator':
method string (line 48) | def string(whitespace_agnostic: bool = True) -> 'StringDiffComparator':
method duration (line 53) | def duration(duration: str) -> 'DurationDiffComparator':
method map (line 58) | def map(key_type: DataType, value_type: DataType, key_order_sensitive:...
class NullSafeEqualDiffComparator (line 65) | class NullSafeEqualDiffComparator(DiffComparator):
method equiv (line 66) | def equiv(self, left: Column, right: Column) -> Column:
class DefaultDiffComparator (line 72) | class DefaultDiffComparator(NullSafeEqualDiffComparator):
method _to_java (line 74) | def _to_java(self, jvm: JVMView) -> JavaObject:
class EpsilonDiffComparator (line 79) | class EpsilonDiffComparator(DiffComparator):
method as_relative (line 84) | def as_relative(self) -> 'EpsilonDiffComparator':
method as_absolute (line 87) | def as_absolute(self) -> 'EpsilonDiffComparator':
method as_inclusive (line 90) | def as_inclusive(self) -> 'EpsilonDiffComparator':
method as_exclusive (line 93) | def as_exclusive(self) -> 'EpsilonDiffComparator':
method equiv (line 96) | def equiv(self, left: Column, right: Column) -> Column:
class StringDiffComparator (line 113) | class StringDiffComparator(DiffComparator):
method equiv (line 116) | def equiv(self, left: Column, right: Column) -> Column:
class DurationDiffComparator (line 123) | class DurationDiffComparator(DiffComparator):
method as_inclusive (line 127) | def as_inclusive(self) -> 'DurationDiffComparator':
method as_exclusive (line 130) | def as_exclusive(self) -> 'DurationDiffComparator':
method equiv (line 133) | def equiv(self, left: Column, right: Column) -> Column:
class MapDiffComparator (line 140) | class MapDiffComparator(DiffComparator):
method equiv (line 145) | def equiv(self, left: Column, right: Column) -> Column:
FILE: python/gresearch/spark/parquet/__init__.py
function _jreader (line 29) | def _jreader(reader: DataFrameReader) -> JavaObject:
function parquet_metadata (line 34) | def parquet_metadata(self: DataFrameReader, *paths: str, parallelism: Op...
function parquet_schema (line 69) | def parquet_schema(self: DataFrameReader, *paths: str, parallelism: Opti...
function parquet_blocks (line 104) | def parquet_blocks(self: DataFrameReader, *paths: str, parallelism: Opti...
function parquet_block_columns (line 136) | def parquet_block_columns(self: DataFrameReader, *paths: str, parallelis...
function parquet_partitions (line 172) | def parquet_partitions(self: DataFrameReader, *paths: str, parallelism: ...
FILE: python/setup.py
class custom_sdist (line 36) | class custom_sdist(sdist):
method make_distribution (line 37) | def make_distribution(self):
FILE: python/test/spark_common.py
function spark_session (line 30) | def spark_session():
class SparkTest (line 38) | class SparkTest(unittest.TestCase):
method main (line 41) | def main(file: str):
method get_pom_path (line 58) | def get_pom_path() -> str:
method get_spark_config (line 66) | def get_spark_config(path) -> SparkConf:
method get_spark_session (line 80) | def get_spark_session(cls) -> SparkSession:
method setUpClass (line 101) | def setUpClass(cls):
method tearDownClass (line 107) | def tearDownClass(cls):
method sql_conf (line 113) | def sql_conf(self, pairs):
FILE: python/test/test_diff.py
class DiffTest (line 27) | class DiffTest(SparkTest):
method assert_requirement (line 32) | def assert_requirement(self, error_message: str):
method setUpClass (line 38) | def setUpClass(cls):
method test_check_schema (line 206) | def test_check_schema(self):
method test_dataframe_diff (line 493) | def test_dataframe_diff(self):
method test_dataframe_diff_with_ids_ignored (line 497) | def test_dataframe_diff_with_ids_ignored(self):
method test_dataframe_diff_with_wrong_argument_types (line 501) | def test_dataframe_diff_with_wrong_argument_types(self):
method test_dataframe_diffwith (line 545) | def test_dataframe_diffwith(self):
method test_dataframe_diffwith_with_default_options (line 550) | def test_dataframe_diffwith_with_default_options(self):
method test_dataframe_diffwith_with_options (line 555) | def test_dataframe_diffwith_with_options(self):
method test_dataframe_diffwith_with_ignored (line 561) | def test_dataframe_diffwith_with_ignored(self):
method test_dataframe_diffwith_with_wrong_argument_types (line 566) | def test_dataframe_diffwith_with_wrong_argument_types(self):
method test_dataframe_diff_with_default_options (line 610) | def test_dataframe_diff_with_default_options(self):
method test_dataframe_diff_with_options (line 616) | def test_dataframe_diff_with_options(self):
method test_dataframe_diff_with_options_and_ignored (line 623) | def test_dataframe_diff_with_options_and_ignored(self):
method test_dataframe_diff_with_changes (line 630) | def test_dataframe_diff_with_changes(self):
method test_dataframe_diff_with_diff_mode_column_by_column (line 637) | def test_dataframe_diff_with_diff_mode_column_by_column(self):
method test_dataframe_diff_with_diff_mode_side_by_side (line 644) | def test_dataframe_diff_with_diff_mode_side_by_side(self):
method test_dataframe_diff_with_diff_mode_left_side (line 651) | def test_dataframe_diff_with_diff_mode_left_side(self):
method test_dataframe_diff_with_diff_mode_right_side (line 658) | def test_dataframe_diff_with_diff_mode_right_side(self):
method test_dataframe_diff_with_sparse_mode (line 665) | def test_dataframe_diff_with_sparse_mode(self):
method test_differ_diff (line 672) | def test_differ_diff(self):
method test_differ_diffwith (line 676) | def test_differ_diffwith(self):
method test_differ_diff_with_default_options (line 681) | def test_differ_diff_with_default_options(self):
method test_differ_diff_with_options (line 686) | def test_differ_diff_with_options(self):
method test_differ_diff_with_changes (line 691) | def test_differ_diff_with_changes(self):
method test_differ_diff_in_diff_mode_column_by_column (line 696) | def test_differ_diff_in_diff_mode_column_by_column(self):
method test_differ_diff_in_diff_mode_side_by_side (line 701) | def test_differ_diff_in_diff_mode_side_by_side(self):
method test_differ_diff_in_diff_mode_left_side (line 706) | def test_differ_diff_in_diff_mode_left_side(self):
method test_differ_diff_in_diff_mode_right_side (line 711) | def test_differ_diff_in_diff_mode_right_side(self):
method test_differ_diff_with_sparse_mode (line 716) | def test_differ_diff_with_sparse_mode(self):
method test_diff_options_default (line 722) | def test_diff_options_default(self):
method test_diff_mode_consts (line 747) | def test_diff_mode_consts(self):
method test_diff_options_comparator_for (line 759) | def test_diff_options_comparator_for(self):
method test_diff_fluent_setters (line 774) | def test_diff_fluent_setters(self):
method test_diff_with_epsilon_comparator (line 834) | def test_diff_with_epsilon_comparator(self):
method test_diff_options_with_duplicate_comparators (line 862) | def test_diff_options_with_duplicate_comparators(self):
FILE: python/test/test_histogram.py
class HistogramTest (line 22) | class HistogramTest(SparkTest):
method setUpClass (line 25) | def setUpClass(cls):
method test_histogram_with_ints (line 37) | def test_histogram_with_ints(self):
method test_histogram_with_floats (line 45) | def test_histogram_with_floats(self):
FILE: python/test/test_job_description.py
class JobDescriptionTest (line 25) | class JobDescriptionTest(SparkTest):
method _assert_job_description (line 27) | def _assert_job_description(self, expected: Optional[str]):
method setUp (line 41) | def setUp(self) -> None:
method test_with_job_description (line 44) | def test_with_job_description(self):
method test_append_job_description (line 59) | def test_append_job_description(self):
FILE: python/test/test_jvm.py
class PackageTest (line 31) | class PackageTest(SparkTest):
method setUpClass (line 35) | def setUpClass(cls):
method test_get_jvm_classic (line 40) | def test_get_jvm_classic(self):
method test_get_jvm_connect (line 51) | def test_get_jvm_connect(self):
method test_get_jvm_check_java_pkg_is_installed (line 64) | def test_get_jvm_check_java_pkg_is_installed(self):
method test_dotnet_ticks (line 79) | def test_dotnet_ticks(self):
method test_histogram (line 94) | def test_histogram(self):
method test_with_row_numbers (line 100) | def test_with_row_numbers(self):
method test_job_description (line 106) | def test_job_description(self):
method test_create_temp_dir (line 118) | def test_create_temp_dir(self):
method test_install_pip_package (line 124) | def test_install_pip_package(self):
method test_install_poetry_project (line 130) | def test_install_poetry_project(self):
method test_parquet (line 136) | def test_parquet(self):
FILE: python/test/test_package.py
class PackageTest (line 40) | class PackageTest(SparkTest):
method setUpClass (line 43) | def setUpClass(cls):
method compare_dfs (line 106) | def compare_dfs(self, expected, actual):
method test_backticks (line 116) | def test_backticks(self):
method test_distinct_prefix_for (line 124) | def test_distinct_prefix_for(self):
method test_handle_configured_case_sensitivity (line 133) | def test_handle_configured_case_sensitivity(self):
method test_list_contains_case_sensitivity (line 146) | def test_list_contains_case_sensitivity(self):
method test_list_filter_case_sensitivity (line 158) | def test_list_filter_case_sensitivity(self):
method test_list_diff_case_sensitivity (line 170) | def test_list_diff_case_sensitivity(self):
method test_dotnet_ticks_to_timestamp (line 183) | def test_dotnet_ticks_to_timestamp(self):
method test_dotnet_ticks_to_unix_epoch (line 191) | def test_dotnet_ticks_to_unix_epoch(self):
method test_dotnet_ticks_to_unix_epoch_nanos (line 199) | def test_dotnet_ticks_to_unix_epoch_nanos(self):
method test_timestamp_to_dotnet_ticks (line 208) | def test_timestamp_to_dotnet_ticks(self):
method test_unix_epoch_dotnet_ticks (line 218) | def test_unix_epoch_dotnet_ticks(self):
method test_unix_epoch_nanos_to_dotnet_ticks (line 226) | def test_unix_epoch_nanos_to_dotnet_ticks(self):
method test_count_null (line 233) | def test_count_null(self):
method test_session (line 242) | def test_session(self):
method test_session_or_ctx (line 246) | def test_session_or_ctx(self):
method test_create_temp_dir (line 251) | def test_create_temp_dir(self):
method test_install_pip_package (line 259) | def test_install_pip_package(self):
method test_install_pip_package_unknown_argument (line 283) | def test_install_pip_package_unknown_argument(self):
method test_install_pip_package_package_not_found (line 289) | def test_install_pip_package_package_not_found(self):
method test_install_pip_package_not_supported (line 295) | def test_install_pip_package_not_supported(self):
method test_install_poetry_project (line 306) | def test_install_poetry_project(self):
method test_install_poetry_project_wrong_arguments (line 342) | def test_install_poetry_project_wrong_arguments(self):
method test_install_poetry_project_not_supported (line 353) | def test_install_poetry_project_not_supported(self):
FILE: python/test/test_parquet.py
class ParquetTest (line 23) | class ParquetTest(SparkTest):
method test_parquet_metadata (line 27) | def test_parquet_metadata(self):
method test_parquet_schema (line 33) | def test_parquet_schema(self):
method test_parquet_blocks (line 39) | def test_parquet_blocks(self):
method test_parquet_block_columns (line 45) | def test_parquet_block_columns(self):
method test_parquet_partitions (line 51) | def test_parquet_partitions(self):
FILE: python/test/test_row_number.py
class RowNumberTest (line 24) | class RowNumberTest(SparkTest):
method setUpClass (line 27) | def setUpClass(cls):
method test_row_numbers (line 68) | def test_row_numbers(self):
method test_row_numbers_order_one_column (line 72) | def test_row_numbers_order_one_column(self):
method test_row_numbers_order_two_columns (line 78) | def test_row_numbers_order_two_columns(self):
method test_row_numbers_order_not_asc_one_column (line 84) | def test_row_numbers_order_not_asc_one_column(self):
method test_row_numbers_order_not_asc_two_columns (line 90) | def test_row_numbers_order_not_asc_two_columns(self):
method test_row_numbers_order_desc_one_column (line 96) | def test_row_numbers_order_desc_one_column(self):
method test_row_numbers_order_desc_two_columns (line 102) | def test_row_numbers_order_desc_two_columns(self):
method test_row_numbers_unpersist (line 108) | def test_row_numbers_unpersist(self):
method test_row_numbers_row_number_col_name (line 130) | def test_row_numbers_row_number_col_name(self):
FILE: src/test/java/uk/co/gresearch/test/SparkJavaTests.java
class SparkJavaTests (line 37) | public class SparkJavaTests {
method beforeClass (line 41) | @BeforeClass
method testBackticks (line 57) | @Test
method testHistogram (line 68) | @Test
method testHistogramWithAggColumn (line 75) | @Test
method testRowNumbers (line 86) | @Test
method testRowNumbersOrderOneColumn (line 97) | @Test
method testRowNumbersOrderTwoColumns (line 108) | @Test
method testRowNumbersOrderDesc (line 119) | @Test
method testRowNumbersUnpersist (line 130) | @Test
method testRowNumbersStorageLevelAndUnpersist (line 150) | @Test
method testRowNumbersColumnName (line 164) | @Test
method afterClass (line 177) | @AfterClass
FILE: src/test/java/uk/co/gresearch/test/diff/DiffJavaTests.java
class DiffJavaTests (line 36) | public class DiffJavaTests {
method beforeClass (line 41) | @BeforeClass
method testDiff (line 61) | @Test
method testDiffNoKey (line 73) | @Test
method testDiffSingleKey (line 86) | @Test
method testDiffMultipleKeys (line 98) | @Test
method testDiffIgnoredColumn (line 110) | @Test
method testDiffAs (line 122) | @Test
method testDiffOfWith (line 135) | @Test
method testDiffer (line 147) | @Test
method testDifferWithIgnored (line 162) | @Test
method testDiffWithOptions (line 179) | @Test
method testDiffWithComparators (line 203) | @Test
method testDiffWithComparator (line 221) | private void testDiffWithComparator(DiffOptions options) {
method afterClass (line 235) | @AfterClass
FILE: src/test/java/uk/co/gresearch/test/diff/JavaValue.java
class JavaValue (line 22) | public class JavaValue implements Serializable {
method JavaValue (line 27) | public JavaValue() { }
method JavaValue (line 29) | public JavaValue(Integer id, String label, Double score) {
method getId (line 35) | public Integer getId() {
method setId (line 39) | public void setId(Integer id) {
method getLabel (line 43) | public String getLabel() {
method setLabel (line 47) | public void setLabel(String label) {
method getScore (line 51) | public Double getScore() {
method setScore (line 55) | public void setScore(Double score) {
method equals (line 59) | @Override
method hashCode (line 68) | @Override
method toString (line 73) | @Override
FILE: src/test/java/uk/co/gresearch/test/diff/JavaValueAs.java
class JavaValueAs (line 22) | public class JavaValueAs implements Serializable {
method JavaValueAs (line 30) | public JavaValueAs() { }
method JavaValueAs (line 32) | public JavaValueAs(String diff, Integer id, String left_label, String ...
method getDiff (line 41) | public String getDiff() {
method setDiff (line 45) | public void setDiff(String diff) {
method getId (line 49) | public Integer getId() {
method setId (line 53) | public void setId(Integer id) {
method getLeft_label (line 57) | public String getLeft_label() {
method setLeft_label (line 61) | public void setLeft_label(String left_label) {
method getRight_label (line 65) | public String getRight_label() {
method setRight_label (line 69) | public void setRight_label(String right_label) {
method getLeft_score (line 73) | public Double getLeft_score() {
method setLeft_score (line 77) | public void setLeft_score(Double left_score) {
method getRight_score (line 81) | public Double getRight_score() {
method setRight_score (line 85) | public void setRight_score(Double right_score) {
method equals (line 89) | @Override
method hashCode (line 98) | @Override
method toString (line 103) | @Override
Condensed preview — 128 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (922K chars).
[
{
"path": ".github/actions/build-whl/action.yml",
"chars": 3795,
"preview": "name: 'Build Whl'\nauthor: 'EnricoMi'\ndescription: 'A GitHub Action that builds pyspark-extension package'\n\ninputs:\n spa"
},
{
"path": ".github/actions/check-compat/action.yml",
"chars": 3810,
"preview": "name: 'Check'\nauthor: 'EnricoMi'\ndescription: 'A GitHub Action that checks compatibility of spark-extension'\n\ninputs:\n "
},
{
"path": ".github/actions/prime-caches/action.yml",
"chars": 3815,
"preview": "name: 'Prime caches'\nauthor: 'EnricoMi'\ndescription: 'A GitHub Action that primes caches'\n\ninputs:\n spark-version:\n "
},
{
"path": ".github/actions/test-jvm/action.yml",
"chars": 3755,
"preview": "name: 'Test JVM'\nauthor: 'EnricoMi'\ndescription: 'A GitHub Action that tests JVM spark-extension'\n\ninputs:\n spark-versi"
},
{
"path": ".github/actions/test-python/action.yml",
"chars": 9408,
"preview": "name: 'Test Python'\nauthor: 'EnricoMi'\ndescription: 'A GitHub Action that tests Python spark-extension'\n\n# pyspark is no"
},
{
"path": ".github/actions/test-release/action.yml",
"chars": 8094,
"preview": "name: 'Test Release'\nauthor: 'EnricoMi'\ndescription: 'A GitHub Action that tests spark-extension release'\n\n# pyspark is "
},
{
"path": ".github/dependabot.yml",
"chars": 208,
"preview": "version: 2\nupdates:\n - package-ecosystem: \"github-actions\"\n directory: \"/\"\n schedule:\n interval: \"monthly\"\n\n"
},
{
"path": ".github/show-spark-versions.sh",
"chars": 989,
"preview": "#!/bin/bash\n\nbase=$(cd \"$(dirname \"$0\")\"; pwd)\n\ngrep -- \"-version\" \"$base\"/workflows/prime-caches.yml | sed -e \"s/ -//g\""
},
{
"path": ".github/workflows/build-jvm.yml",
"chars": 3193,
"preview": "name: Build JVM\n\non:\n workflow_call:\n\njobs:\n build:\n name: Build (Spark ${{ matrix.spark-version }} Scala ${{ matri"
},
{
"path": ".github/workflows/build-python.yml",
"chars": 2463,
"preview": "name: Build Python\n\non:\n workflow_call:\n\njobs:\n # pyspark<4 is not available for snapshots or scala other than 2.12\n "
},
{
"path": ".github/workflows/build-snapshots.yml",
"chars": 2887,
"preview": "name: Build Snapshots\n\non:\n workflow_call:\n\njobs:\n build:\n name: Build (Spark ${{ matrix.spark-version }} Scala ${{"
},
{
"path": ".github/workflows/check.yml",
"chars": 3530,
"preview": "name: Check\n\non:\n workflow_call:\n\njobs:\n lint:\n name: Scala lint\n runs-on: ubuntu-latest\n\n steps:\n - nam"
},
{
"path": ".github/workflows/ci.yml",
"chars": 2126,
"preview": "name: CI\n\non:\n schedule:\n - cron: '0 8 */10 * *'\n push:\n branches:\n - 'master'\n tags:\n - '*'\n merg"
},
{
"path": ".github/workflows/clear-caches.yaml",
"chars": 758,
"preview": "name: Clear caches\n\non:\n workflow_dispatch:\n\npermissions:\n actions: write\n\njobs:\n clear-cache:\n runs-on: ubuntu-la"
},
{
"path": ".github/workflows/prepare-release.yml",
"chars": 7156,
"preview": "name: Prepare release\n\non:\n workflow_dispatch:\n inputs:\n github_release_latest:\n description: 'Make the "
},
{
"path": ".github/workflows/prime-caches.yml",
"chars": 5653,
"preview": "name: Prime caches\n\non:\n workflow_dispatch:\n\njobs:\n prime:\n name: Spark ${{ matrix.spark-compat-version }}.${{ matr"
},
{
"path": ".github/workflows/publish-release.yml",
"chars": 8117,
"preview": "name: Publish release\n\non:\n workflow_dispatch:\n inputs:\n versions:\n required: true\n type: string\n"
},
{
"path": ".github/workflows/publish-snapshot.yml",
"chars": 5561,
"preview": "name: Publish snapshot\n\non:\n workflow_dispatch:\n push:\n branches: [\"master\"]\n\nenv:\n PYTHON_VERSION: \"3.10\"\n\njobs:\n"
},
{
"path": ".github/workflows/test-jvm.yml",
"chars": 3458,
"preview": "name: Test JVM\n\non:\n workflow_call:\n\njobs:\n test:\n name: Test (Spark ${{ matrix.spark-compat-version }}.${{ matrix."
},
{
"path": ".github/workflows/test-python.yml",
"chars": 3712,
"preview": "name: Test Python\n\non:\n workflow_call:\n\njobs:\n # pyspark is not available for snapshots or scala other than 2.12\n # w"
},
{
"path": ".github/workflows/test-release.yml",
"chars": 3567,
"preview": "name: Test release\n\non:\n workflow_call:\n\njobs:\n test:\n name: Test Release Spark ${{ matrix.spark-version }} Scala $"
},
{
"path": ".github/workflows/test-results.yml",
"chars": 971,
"preview": "name: Test Results\n\non:\n workflow_run:\n workflows: [\"CI\"]\n types:\n - completed\npermissions: {}\n\njobs:\n publ"
},
{
"path": ".github/workflows/test-snapshots.yml",
"chars": 2926,
"preview": "name: Test Snapshots\n\non:\n workflow_call:\n\njobs:\n test:\n name: Test (Spark ${{ matrix.spark-version }} Scala ${{ ma"
},
{
"path": ".gitignore",
"chars": 427,
"preview": "# use glob syntax.\nsyntax: glob\n*.ser\n*.class\n*~\n*.bak\n#*.off\n*.old\n\n# eclipse conf file\n.settings\n.classpath\n.project\n."
},
{
"path": ".scalafmt.conf",
"chars": 124,
"preview": "version = 3.7.17\nrunner.dialect = scala213\nrewrite.trailingCommas.style = keep\ndocstrings.style = Asterisk\nmaxColumn = 1"
},
{
"path": "CHANGELOG.md",
"chars": 5328,
"preview": "# Changelog\nAll notable changes to this project will be documented in this file.\n\nThe format is based on [Keep a Changel"
},
{
"path": "CONDITIONAL.md",
"chars": 1675,
"preview": "# DataFrame Transformations\n\nThe Spark `Dataset` API allows for chaining transformations as in the following example:\n\n`"
},
{
"path": "DIFF.md",
"chars": 23271,
"preview": "# Spark Diff\n\nAdd the following `import` to your Scala code:\n\n```scala\nimport uk.co.gresearch.spark.diff._\n```\n\nor this "
},
{
"path": "GROUPS.md",
"chars": 2387,
"preview": "# Sorted Groups\n\nSpark provides the ability to group rows by an arbitrary key,\nwhile then providing an iterator for each"
},
{
"path": "HISTOGRAM.md",
"chars": 1530,
"preview": "# Histogram\n\nFor a table `df` like\n\n|user |score|\n|:-----:|:---:|\n|Alice |101 |\n|Alice |221 |\n|Alice |211 |\n|Ali"
},
{
"path": "LICENSE",
"chars": 11358,
"preview": "\n Apache License\n Version 2.0, January 2004\n "
},
{
"path": "MAINTAINERS.md",
"chars": 294,
"preview": "## Current maintainers of the project\n\n| Maintainer | GitHub ID "
},
{
"path": "PARQUET.md",
"chars": 18403,
"preview": "# Parquet Metadata\n\nThe structure of Parquet files (the metadata, not the data stored in Parquet) can be inspected simil"
},
{
"path": "PARTITIONING.md",
"chars": 6266,
"preview": "# Partitioned Writing\n\nIf you have ever used `Dataset[T].write.partitionBy`, here is how you can minimize the number of\n"
},
{
"path": "PYSPARK-DEPS.md",
"chars": 4681,
"preview": "# PySpark dependencies\n\nUsing PySpark on a cluster requires all cluster nodes to have those Python packages installed th"
},
{
"path": "README.md",
"chars": 14602,
"preview": "# Spark Extension\n\nThis project provides extensions to the [Apache Spark project](https://spark.apache.org/) in Scala an"
},
{
"path": "RELEASE.md",
"chars": 6253,
"preview": "# Releasing Spark Extension\n\nThis provides instructions on how to release a version of `spark-extension`. We release thi"
},
{
"path": "ROW_NUMBER.md",
"chars": 6125,
"preview": "# Global Row Number\n\nSpark provides the [SQL function `row_number`](https://spark.apache.org/docs/latest/api/sql/index.h"
},
{
"path": "SECURITY.md",
"chars": 559,
"preview": "# Security and Coordinated Vulnerability Disclosure Policy\n\nThis project appreciates and encourages coordinated disclosu"
},
{
"path": "build-whl.sh",
"chars": 1057,
"preview": "#!/bin/bash\n\nset -eo pipefail\n\nbase=$(cd \"$(dirname \"$0\")\"; pwd)\n\nversion=$(grep --max-count=1 \"<version>.*</version>\" \""
},
{
"path": "bump-version.sh",
"chars": 2106,
"preview": "#!/bin/bash\n#\n# Copyright 2020 G-Research\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
},
{
"path": "examples/python-deps/Dockerfile",
"chars": 137,
"preview": "FROM apache/spark:3.5.0\n\nENV PATH=\"${PATH}:/opt/spark/bin\"\n\nUSER root\nRUN mkdir -p /home/spark; chown spark:spark /home/"
},
{
"path": "examples/python-deps/docker-compose.yml",
"chars": 946,
"preview": "version: \"3\"\nservices:\n master:\n container_name: spark-master\n image: spark-extension-example-docker\n command:"
},
{
"path": "examples/python-deps/example.py",
"chars": 363,
"preview": "from pyspark.sql import SparkSession\n\ndef main():\n spark = SparkSession.builder.appName(\"spark_app\").getOrCreate()\n\n "
},
{
"path": "pom.xml",
"chars": 15231,
"preview": "<project xmlns=\"http://maven.apache.org/POM/4.0.0\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocat"
},
{
"path": "python/README.md",
"chars": 5964,
"preview": "# Spark Extension\n\nThis project provides extensions to the [Apache Spark project](https://spark.apache.org/) in Scala an"
},
{
"path": "python/gresearch/__init__.py",
"chars": 586,
"preview": "# Copyright 2020 G-Research\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use thi"
},
{
"path": "python/gresearch/spark/__init__.py",
"chars": 27916,
"preview": "# Copyright 2020 G-Research\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use thi"
},
{
"path": "python/gresearch/spark/diff/__init__.py",
"chars": 44396,
"preview": "# Copyright 2020 G-Research\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use thi"
},
{
"path": "python/gresearch/spark/diff/comparator/__init__.py",
"chars": 5015,
"preview": "# Copyright 2022 G-Research\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use thi"
},
{
"path": "python/gresearch/spark/parquet/__init__.py",
"chars": 9968,
"preview": "# Copyright 2023 G-Research\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use thi"
},
{
"path": "python/pyproject.toml",
"chars": 81,
"preview": "[build-system]\nrequires = [\"setuptools\"]\nbuild-backend = \"setuptools.build_meta\"\n"
},
{
"path": "python/pyspark/jars/.gitignore",
"chars": 71,
"preview": "# Ignore everything in this directory\n*\n# Except this file\n!.gitignore\n"
},
{
"path": "python/setup.py",
"chars": 4685,
"preview": "#!/usr/bin/env python3\n\n# Copyright 2023 G-Research\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\")"
},
{
"path": "python/test/__init__.py",
"chars": 586,
"preview": "# Copyright 2020 G-Research\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use thi"
},
{
"path": "python/test/spark_common.py",
"chars": 4871,
"preview": "# Copyright 2020 G-Research\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use thi"
},
{
"path": "python/test/test_diff.py",
"chars": 51605,
"preview": "# Copyright 2020 G-Research\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use thi"
},
{
"path": "python/test/test_histogram.py",
"chars": 1944,
"preview": "# Copyright 2020 G-Research\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use thi"
},
{
"path": "python/test/test_job_description.py",
"chars": 3029,
"preview": "# Copyright 2023 G-Research\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use thi"
},
{
"path": "python/test/test_jvm.py",
"chars": 7073,
"preview": "# Copyright 2024 G-Research\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use thi"
},
{
"path": "python/test/test_package.py",
"chars": 19364,
"preview": "# Copyright 2023 G-Research\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use thi"
},
{
"path": "python/test/test_parquet.py",
"chars": 3274,
"preview": "# Copyright 2023 G-Research\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use thi"
},
{
"path": "python/test/test_row_number.py",
"chars": 6073,
"preview": "# Copyright 2022 G-Research\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use thi"
},
{
"path": "release.sh",
"chars": 5347,
"preview": "#!/bin/bash\n#\n# Copyright 2020 G-Research\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may "
},
{
"path": "set-version.sh",
"chars": 1901,
"preview": "#!/bin/bash\n\nif [ $# -eq 1 ]\nthen\n IFS=-\n read version flavour <<< \"$1\"\n\n echo \"setting version=$version${flavo"
},
{
"path": "src/main/scala/uk/co/gresearch/package.scala",
"chars": 3620,
"preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala/uk/co/gresearch/spark/BuildVersion.scala",
"chars": 2365,
"preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala/uk/co/gresearch/spark/Histogram.scala",
"chars": 4062,
"preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala/uk/co/gresearch/spark/RowNumbers.scala",
"chars": 4915,
"preview": "/*\n * Copyright 2023 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala/uk/co/gresearch/spark/SparkVersion.scala",
"chars": 1253,
"preview": "/*\n * Copyright 2023 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala/uk/co/gresearch/spark/UnpersistHandle.scala",
"chars": 2344,
"preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala/uk/co/gresearch/spark/diff/App.scala",
"chars": 12225,
"preview": "/*\n * Copyright 2023 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala/uk/co/gresearch/spark/diff/Diff.scala",
"chars": 39451,
"preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala/uk/co/gresearch/spark/diff/DiffComparators.scala",
"chars": 5079,
"preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala/uk/co/gresearch/spark/diff/DiffOptions.scala",
"chars": 17896,
"preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala/uk/co/gresearch/spark/diff/comparator/DefaultDiffComparator.scala",
"chars": 847,
"preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala/uk/co/gresearch/spark/diff/comparator/DiffComparator.scala",
"chars": 753,
"preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala/uk/co/gresearch/spark/diff/comparator/DurationDiffComparator.scala",
"chars": 2175,
"preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala/uk/co/gresearch/spark/diff/comparator/EpsilonDiffComparator.scala",
"chars": 1662,
"preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala/uk/co/gresearch/spark/diff/comparator/EquivDiffComparator.scala",
"chars": 4767,
"preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala/uk/co/gresearch/spark/diff/comparator/MapDiffComparator.scala",
"chars": 3572,
"preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala/uk/co/gresearch/spark/diff/comparator/NullSafeEqualDiffComparator.scala",
"chars": 821,
"preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala/uk/co/gresearch/spark/diff/comparator/TypedDiffComparator.scala",
"chars": 1091,
"preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala/uk/co/gresearch/spark/diff/comparator/WhitespaceDiffComparator.scala",
"chars": 1034,
"preview": "/*\n * Copyright 2023 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala/uk/co/gresearch/spark/diff/package.scala",
"chars": 16219,
"preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala/uk/co/gresearch/spark/group/package.scala",
"chars": 7997,
"preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala/uk/co/gresearch/spark/package.scala",
"chars": 40809,
"preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala/uk/co/gresearch/spark/parquet/ParquetMetaDataUtil.scala",
"chars": 4267,
"preview": "/*\n * Copyright 2023 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala/uk/co/gresearch/spark/parquet/package.scala",
"chars": 24395,
"preview": "/*\n * Copyright 2023 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala-spark-3.2/uk/co/gresearch/spark/parquet/SplitFile.scala",
"chars": 948,
"preview": "/*\n * Copyright 2023 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala-spark-3.3/uk/co/gresearch/spark/parquet/SplitFile.scala",
"chars": 963,
"preview": "/*\n * Copyright 2023 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala-spark-3.5/org/apache/spark/sql/extension/package.scala",
"chars": 967,
"preview": "/*\n * Copyright 2024 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala-spark-3.5/uk/co/gresearch/spark/Backticks.scala",
"chars": 2628,
"preview": "/*\n * Copyright 2021 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala-spark-4.0/org/apache/spark/sql/extension/package.scala",
"chars": 1033,
"preview": "/*\n * Copyright 2024 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala-spark-4.0/uk/co/gresearch/spark/Backticks.scala",
"chars": 2085,
"preview": "/*\n * Copyright 2021 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/main/scala-spark-4.0/uk/co/gresearch/spark/parquet/SplitFile.scala",
"chars": 972,
"preview": "/*\n * Copyright 2023 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/test/java/uk/co/gresearch/test/SparkJavaTests.java",
"chars": 7408,
"preview": "/*\n * Copyright 2021 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/test/java/uk/co/gresearch/test/diff/DiffJavaTests.java",
"chars": 10479,
"preview": "/*\n * Copyright 2021 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/test/java/uk/co/gresearch/test/diff/JavaValue.java",
"chars": 1960,
"preview": "/*\n * Copyright 2021 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/test/java/uk/co/gresearch/test/diff/JavaValueAs.java",
"chars": 3220,
"preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/test/resources/log4j.properties",
"chars": 1900,
"preview": "#\n# Copyright 2020 G-Research\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this"
},
{
"path": "src/test/resources/log4j2.properties",
"chars": 3259,
"preview": "#\n# Copyright 2020 G-Research\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this"
},
{
"path": "src/test/scala/uk/co/gresearch/spark/GroupBySuite.scala",
"chars": 10120,
"preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/test/scala/uk/co/gresearch/spark/HistogramSuite.scala",
"chars": 9241,
"preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/test/scala/uk/co/gresearch/spark/SparkSuite.scala",
"chars": 26482,
"preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/test/scala/uk/co/gresearch/spark/SparkTestSession.scala",
"chars": 1188,
"preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/test/scala/uk/co/gresearch/spark/WritePartitionedSuite.scala",
"chars": 9869,
"preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/test/scala/uk/co/gresearch/spark/diff/AppSuite.scala",
"chars": 4712,
"preview": "/*\n * Copyright 2023 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/test/scala/uk/co/gresearch/spark/diff/DiffComparatorSuite.scala",
"chars": 25605,
"preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/test/scala/uk/co/gresearch/spark/diff/DiffOptionsSuite.scala",
"chars": 9802,
"preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/test/scala/uk/co/gresearch/spark/diff/DiffSuite.scala",
"chars": 73772,
"preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/test/scala/uk/co/gresearch/spark/diff/examples/Examples.scala",
"chars": 2456,
"preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/test/scala/uk/co/gresearch/spark/group/GroupSuite.scala",
"chars": 8998,
"preview": "/*\n * Copyright 2022 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/test/scala/uk/co/gresearch/spark/parquet/ParquetSuite.scala",
"chars": 24186,
"preview": "/*\n * Copyright 2023 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/test/scala/uk/co/gresearch/spark/test/package.scala",
"chars": 1022,
"preview": "/*\n * Copyright 2020 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/test/scala/uk/co/gresearch/test/ClasspathSuite.scala",
"chars": 2035,
"preview": "/*\n * Copyright 2025 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/test/scala/uk/co/gresearch/test/Spec.scala",
"chars": 806,
"preview": "/*\n * Copyright 2025 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/test/scala/uk/co/gresearch/test/Suite.scala",
"chars": 810,
"preview": "/*\n * Copyright 2025 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/test/scala-spark-3/uk/co/gresearch/spark/SparkSuiteHelper.scala",
"chars": 840,
"preview": "/*\n * Copyright 2024 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "src/test/scala-spark-4/uk/co/gresearch/spark/SparkSuiteHelper.scala",
"chars": 917,
"preview": "/*\n * Copyright 2024 G-Research\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use"
},
{
"path": "test-release.py",
"chars": 2166,
"preview": "# this requires parquet-hadoop-*-tests.jar\n# fetch with mvn dependency:get -Dtransitive=false -Dartifact=org.apache.parq"
},
{
"path": "test-release.scala",
"chars": 2483,
"preview": "// this requires parquet-hadoop-*-tests.jar\n// fetch with mvn dependency:get -Dtransitive=false -Dartifact=org.apache.pa"
},
{
"path": "test-release.sh",
"chars": 2565,
"preview": "#!/bin/bash\n\nset -eo pipefail\n\nversion=$(grep --max-count=1 \"<version>.*</version>\" pom.xml | sed -E -e \"s/\\s*<[^>]+>//g"
}
]
// ... and 5 more files (download for full content)
About this extraction
This page contains the full source code of the G-Research/spark-extension GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 128 files (857.8 KB), approximately 229.2k tokens, and a symbol index with 297 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.