Repository: G-Research/spark-extension Branch: master Commit: 65c3dda4a96b Files: 128 Total size: 857.8 KB Directory structure: gitextract_5zjc6xfa/ ├── .github/ │ ├── actions/ │ │ ├── build-whl/ │ │ │ └── action.yml │ │ ├── check-compat/ │ │ │ └── action.yml │ │ ├── prime-caches/ │ │ │ └── action.yml │ │ ├── test-jvm/ │ │ │ └── action.yml │ │ ├── test-python/ │ │ │ └── action.yml │ │ └── test-release/ │ │ └── action.yml │ ├── dependabot.yml │ ├── show-spark-versions.sh │ └── workflows/ │ ├── build-jvm.yml │ ├── build-python.yml │ ├── build-snapshots.yml │ ├── check.yml │ ├── ci.yml │ ├── clear-caches.yaml │ ├── prepare-release.yml │ ├── prime-caches.yml │ ├── publish-release.yml │ ├── publish-snapshot.yml │ ├── test-jvm.yml │ ├── test-python.yml │ ├── test-release.yml │ ├── test-results.yml │ └── test-snapshots.yml ├── .gitignore ├── .scalafmt.conf ├── CHANGELOG.md ├── CONDITIONAL.md ├── DIFF.md ├── GROUPS.md ├── HISTOGRAM.md ├── LICENSE ├── MAINTAINERS.md ├── PARQUET.md ├── PARTITIONING.md ├── PYSPARK-DEPS.md ├── README.md ├── RELEASE.md ├── ROW_NUMBER.md ├── SECURITY.md ├── build-whl.sh ├── bump-version.sh ├── examples/ │ └── python-deps/ │ ├── Dockerfile │ ├── docker-compose.yml │ └── example.py ├── pom.xml ├── python/ │ ├── README.md │ ├── gresearch/ │ │ ├── __init__.py │ │ └── spark/ │ │ ├── __init__.py │ │ ├── diff/ │ │ │ ├── __init__.py │ │ │ └── comparator/ │ │ │ └── __init__.py │ │ └── parquet/ │ │ └── __init__.py │ ├── pyproject.toml │ ├── pyspark/ │ │ └── jars/ │ │ └── .gitignore │ ├── setup.py │ └── test/ │ ├── __init__.py │ ├── spark_common.py │ ├── test_diff.py │ ├── test_histogram.py │ ├── test_job_description.py │ ├── test_jvm.py │ ├── test_package.py │ ├── test_parquet.py │ └── test_row_number.py ├── release.sh ├── set-version.sh ├── src/ │ ├── main/ │ │ ├── scala/ │ │ │ └── uk/ │ │ │ └── co/ │ │ │ └── gresearch/ │ │ │ ├── package.scala │ │ │ └── spark/ │ │ │ ├── BuildVersion.scala │ │ │ ├── Histogram.scala │ │ │ ├── RowNumbers.scala │ │ │ ├── SparkVersion.scala │ │ │ ├── UnpersistHandle.scala │ │ │ ├── diff/ │ │ │ │ ├── App.scala │ │ │ │ ├── Diff.scala │ │ │ │ ├── DiffComparators.scala │ │ │ │ ├── DiffOptions.scala │ │ │ │ ├── comparator/ │ │ │ │ │ ├── DefaultDiffComparator.scala │ │ │ │ │ ├── DiffComparator.scala │ │ │ │ │ ├── DurationDiffComparator.scala │ │ │ │ │ ├── EpsilonDiffComparator.scala │ │ │ │ │ ├── EquivDiffComparator.scala │ │ │ │ │ ├── MapDiffComparator.scala │ │ │ │ │ ├── NullSafeEqualDiffComparator.scala │ │ │ │ │ ├── TypedDiffComparator.scala │ │ │ │ │ └── WhitespaceDiffComparator.scala │ │ │ │ └── package.scala │ │ │ ├── group/ │ │ │ │ └── package.scala │ │ │ ├── package.scala │ │ │ └── parquet/ │ │ │ ├── ParquetMetaDataUtil.scala │ │ │ └── package.scala │ │ ├── scala-spark-3.2/ │ │ │ └── uk/ │ │ │ └── co/ │ │ │ └── gresearch/ │ │ │ └── spark/ │ │ │ └── parquet/ │ │ │ └── SplitFile.scala │ │ ├── scala-spark-3.3/ │ │ │ └── uk/ │ │ │ └── co/ │ │ │ └── gresearch/ │ │ │ └── spark/ │ │ │ └── parquet/ │ │ │ └── SplitFile.scala │ │ ├── scala-spark-3.5/ │ │ │ ├── org/ │ │ │ │ └── apache/ │ │ │ │ └── spark/ │ │ │ │ └── sql/ │ │ │ │ └── extension/ │ │ │ │ └── package.scala │ │ │ └── uk/ │ │ │ └── co/ │ │ │ └── gresearch/ │ │ │ └── spark/ │ │ │ └── Backticks.scala │ │ └── scala-spark-4.0/ │ │ ├── org/ │ │ │ └── apache/ │ │ │ └── spark/ │ │ │ └── sql/ │ │ │ └── extension/ │ │ │ └── package.scala │ │ └── uk/ │ │ └── co/ │ │ └── gresearch/ │ │ └── spark/ │ │ ├── Backticks.scala │ │ └── parquet/ │ │ └── SplitFile.scala │ └── test/ │ ├── files/ │ │ ├── encrypted1.parquet │ │ ├── encrypted2.parquet │ │ ├── nested.parquet │ │ └── test.parquet/ │ │ ├── file1.parquet │ │ └── file2.parquet │ ├── java/ │ │ └── uk/ │ │ └── co/ │ │ └── gresearch/ │ │ └── test/ │ │ ├── SparkJavaTests.java │ │ └── diff/ │ │ ├── DiffJavaTests.java │ │ ├── JavaValue.java │ │ └── JavaValueAs.java │ ├── resources/ │ │ ├── log4j.properties │ │ └── log4j2.properties │ ├── scala/ │ │ └── uk/ │ │ └── co/ │ │ └── gresearch/ │ │ ├── spark/ │ │ │ ├── GroupBySuite.scala │ │ │ ├── HistogramSuite.scala │ │ │ ├── SparkSuite.scala │ │ │ ├── SparkTestSession.scala │ │ │ ├── WritePartitionedSuite.scala │ │ │ ├── diff/ │ │ │ │ ├── AppSuite.scala │ │ │ │ ├── DiffComparatorSuite.scala │ │ │ │ ├── DiffOptionsSuite.scala │ │ │ │ ├── DiffSuite.scala │ │ │ │ └── examples/ │ │ │ │ └── Examples.scala │ │ │ ├── group/ │ │ │ │ └── GroupSuite.scala │ │ │ ├── parquet/ │ │ │ │ └── ParquetSuite.scala │ │ │ └── test/ │ │ │ └── package.scala │ │ └── test/ │ │ ├── ClasspathSuite.scala │ │ ├── Spec.scala │ │ └── Suite.scala │ ├── scala-spark-3/ │ │ └── uk/ │ │ └── co/ │ │ └── gresearch/ │ │ └── spark/ │ │ └── SparkSuiteHelper.scala │ └── scala-spark-4/ │ └── uk/ │ └── co/ │ └── gresearch/ │ └── spark/ │ └── SparkSuiteHelper.scala ├── test-release.py ├── test-release.scala └── test-release.sh ================================================ FILE CONTENTS ================================================ ================================================ FILE: .github/actions/build-whl/action.yml ================================================ name: 'Build Whl' author: 'EnricoMi' description: 'A GitHub Action that builds pyspark-extension package' inputs: spark-version: description: Spark version, e.g. 3.4.0, 3.4.0-SNAPSHOT, or 4.0.0-preview1 required: true scala-version: description: Scala version, e.g. 2.12.15 required: true spark-compat-version: description: Spark compatibility version, e.g. 3.4 required: true scala-compat-version: description: Scala compatibility version, e.g. 2.12 required: true java-compat-version: description: Java compatibility version, e.g. 8 required: true python-version: description: Python version, e.g. 3.8 required: true runs: using: 'composite' steps: - name: Fetch Binaries Artifact uses: actions/download-artifact@v4 with: name: Binaries-${{ inputs.spark-compat-version }}-${{ inputs.scala-compat-version }} path: . - name: Set versions in pom.xml run: | ./set-version.sh ${{ inputs.spark-version }} ${{ inputs.scala-version }} git diff shell: bash - name: Make this work with PySpark preview versions if: contains(inputs.spark-version, 'preview') run: | sed -i -e 's/f"\(pyspark~=.*\)"/f"\1.dev1"/' -e 's/f"\({spark_compat_version}.0\)"/"${{ inputs.spark-version }}"/g' python/setup.py git diff python/setup.py shell: bash - name: Restore Maven packages cache if: github.event_name != 'schedule' uses: actions/cache/restore@v4 with: path: ~/.m2/repository key: ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }} restore-keys: | ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }} ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}- - name: Setup JDK ${{ inputs.java-compat-version }} uses: actions/setup-java@v4 with: java-version: ${{ inputs.java-compat-version }} distribution: 'zulu' - name: Fetch Release Test Dependencies run: | # Fetch Release Test Dependencies echo "::group::mvn dependency:get" mvn dependency:get -Dtransitive=false -Dartifact=org.apache.parquet:parquet-hadoop:1.16.0:jar:tests echo "::endgroup::" shell: bash - name: Setup Python uses: actions/setup-python@v5 with: python-version: ${{ inputs.python-version }} - name: Install Python dependencies run: | # Install Python dependencies echo "::group::mvn compile" python -m pip install --upgrade pip build twine echo "::endgroup::" shell: bash - name: Build whl run: | # Build whl echo "::group::build-whl.sh" ./build-whl.sh echo "::endgroup::" shell: bash - name: Test whl run: | # Test whl echo "::group::test-release.py" twine check python/dist/* # .dev1 allows this to work with preview versions pip install python/dist/*.whl "pyspark~=${{ inputs.spark-compat-version }}.0.dev1" python test-release.py echo "::endgroup::" shell: bash - name: Upload whl uses: actions/upload-artifact@v4 with: name: Whl (Spark ${{ inputs.spark-compat-version }} Scala ${{ inputs.scala-compat-version }}) path: | python/dist/*.whl - name: Build whl with mvn env: JDK_JAVA_OPTIONS: --add-exports java.base/sun.nio.ch=ALL-UNNAMED --add-exports java.base/sun.util.calendar=ALL-UNNAMED run: | # Build whl with mvn rm -rf target python/dist python/pyspark_extension.egg-info pyspark/jars/*.jar echo "::group::build-whl.sh" ./build-whl.sh echo "::endgroup::" shell: bash branding: icon: 'check-circle' color: 'green' ================================================ FILE: .github/actions/check-compat/action.yml ================================================ name: 'Check' author: 'EnricoMi' description: 'A GitHub Action that checks compatibility of spark-extension' inputs: spark-version: description: Spark version, e.g. 3.4.0 or 3.4.0-SNAPSHOT required: true scala-version: description: Scala version, e.g. 2.12.15 required: true spark-compat-version: description: Spark compatibility version, e.g. 3.4 required: true scala-compat-version: description: Scala compatibility version, e.g. 2.12 required: true package-version: description: Spark-Extension version to check against required: true runs: using: 'composite' steps: - name: Fetch Binaries Artifact uses: actions/download-artifact@v4 with: name: Binaries-${{ inputs.spark-compat-version }}-${{ inputs.scala-compat-version }} path: . - name: Set versions in pom.xml run: | ./set-version.sh ${{ inputs.spark-version }} ${{ inputs.scala-version }} git diff shell: bash - name: Restore Maven packages cache if: github.event_name != 'schedule' uses: actions/cache/restore@v4 with: path: ~/.m2/repository key: ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }} restore-keys: | ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }} ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}- - name: Setup JDK 1.8 uses: actions/setup-java@v4 with: java-version: '8' distribution: 'zulu' - name: Install Checker run: | # Install Checker echo "::group::apt update install" sudo apt update sudo apt install japi-compliance-checker echo "::endgroup::" shell: bash - name: Release exists id: exists continue-on-error: true run: | # Release exists curl --head --fail https://repo1.maven.org/maven2/uk/co/gresearch/spark/spark-extension_${{ inputs.scala-compat-version }}/${{ inputs.package-version }}-${{ inputs.spark-compat-version }}/spark-extension_${{ inputs.scala-compat-version }}-${{ inputs.package-version }}-${{ inputs.spark-compat-version }}.jar shell: bash - name: Fetch package if: steps.exists.outcome == 'success' run: | # Fetch package echo "::group::mvn dependency:get" mvn dependency:get -Dtransitive=false -DremoteRepositories -Dartifact=uk.co.gresearch.spark:spark-extension_${{ inputs.scala-compat-version }}:${{ inputs.package-version }}-${{ inputs.spark-compat-version }} echo "::endgroup::" shell: bash - name: Check if: steps.exists.outcome == 'success' continue-on-error: ${{ github.ref == 'refs/heads/master' }} run: | # Check echo "::group::japi-compliance-checker" ls -lah ~/.m2/repository/uk/co/gresearch/spark/spark-extension_${{ inputs.scala-compat-version }}/${{ inputs.package-version }}-${{ inputs.spark-compat-version }}/spark-extension_${{ inputs.scala-compat-version }}-${{ inputs.package-version }}-${{ inputs.spark-compat-version }}.jar target/spark-extension*.jar japi-compliance-checker ~/.m2/repository/uk/co/gresearch/spark/spark-extension_${{ inputs.scala-compat-version }}/${{ inputs.package-version }}-${{ inputs.spark-compat-version }}/spark-extension_${{ inputs.scala-compat-version }}-${{ inputs.package-version }}-${{ inputs.spark-compat-version }}.jar target/spark-extension*.jar echo "::endgroup::" shell: bash - name: Upload Report uses: actions/upload-artifact@v4 if: always() && steps.exists.outcome == 'success' with: name: Compat-Report-${{ inputs.spark-compat-version }} path: compat_reports/spark-extension/* branding: icon: 'check-circle' color: 'green' ================================================ FILE: .github/actions/prime-caches/action.yml ================================================ name: 'Prime caches' author: 'EnricoMi' description: 'A GitHub Action that primes caches' inputs: spark-version: description: Spark version, e.g. 3.4.0 or 3.4.0-SNAPSHOT required: true scala-version: description: Scala version, e.g. 2.12.15 required: true spark-compat-version: description: Spark compatibility version, e.g. 3.4 required: true scala-compat-version: description: Scala compatibility version, e.g. 2.12 required: true java-compat-version: description: Java compatibility version, e.g. 8 required: true hadoop-version: description: Hadoop version, e.g. 2.7 or 2 required: true runs: using: 'composite' steps: - name: Set versions in pom.xml run: | ./set-version.sh ${{ inputs.spark-version }} ${{ inputs.scala-version }} git diff shell: bash - name: Check Maven packages cache id: mvn-build-cache uses: actions/cache/restore@v4 with: lookup-only: true path: ~/.m2/repository key: ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }} - name: Check Spark Binaries cache id: spark-binaries-cache uses: actions/cache/restore@v4 with: lookup-only: true path: ~/spark key: ${{ runner.os }}-spark-binaries-${{ inputs.spark-version }}-${{ inputs.scala-compat-version }} - name: Prepare priming caches id: setup run: | # Prepare priming caches if [[ "${{ inputs.spark-version }}" == *"-SNAPSHOT" ]] || [[ -z "${{ steps.mvn-build-cache.outputs.cache-hit }}" ]]; then echo "prime-mvn-cache=true" >> "$GITHUB_ENV" echo "prime-some-cache=true" >> "$GITHUB_ENV" fi; if [[ "${{ inputs.spark-version }}" == *"-SNAPSHOT" ]] || [[ -z "${{ steps.spark-binaries-cache.outputs.cache-hit }}" ]]; then echo "prime-spark-cache=true" >> "$GITHUB_ENV" echo "prime-some-cache=true" >> "$GITHUB_ENV" fi; shell: bash - name: Setup JDK ${{ inputs.java-compat-version }} if: env.prime-some-cache uses: actions/setup-java@v4 with: java-version: ${{ inputs.java-compat-version }} distribution: 'zulu' - name: Build if: env.prime-mvn-cache env: JDK_JAVA_OPTIONS: --add-exports java.base/sun.nio.ch=ALL-UNNAMED --add-exports java.base/sun.util.calendar=ALL-UNNAMED run: | # Build echo "::group::mvn dependency:go-offline" mvn --batch-mode dependency:go-offline echo "::endgroup::" shell: bash - name: Save Maven packages cache if: env.prime-mvn-cache uses: actions/cache/save@v4 with: path: ~/.m2/repository key: ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }}-${{ github.run_id }} - name: Setup Spark Binaries if: env.prime-spark-cache && ! contains(inputs.spark-version, '-SNAPSHOT') env: SPARK_PACKAGE: spark-${{ inputs.spark-version }}/spark-${{ inputs.spark-version }}-bin-hadoop${{ inputs.hadoop-version }}${{ startsWith(inputs.spark-version, '3.') && inputs.scala-compat-version == '2.13' && '-scala2.13' || '' }}.tgz run: | wget --progress=dot:giga "https://www.apache.org/dyn/closer.lua/spark/${SPARK_PACKAGE}?action=download" -O - | tar -xzC "${{ runner.temp }}" archive=$(basename "${SPARK_PACKAGE}") bash -c "mv -v "${{ runner.temp }}/\${archive/%.tgz/}" ~/spark" shell: bash - name: Save Spark Binaries cache if: env.prime-spark-cache && ! contains(inputs.spark-version, '-SNAPSHOT') uses: actions/cache/save@v4 with: path: ~/spark key: ${{ runner.os }}-spark-binaries-${{ inputs.spark-version }}-${{ inputs.scala-compat-version }}-${{ github.run_id }} branding: icon: 'check-circle' color: 'green' ================================================ FILE: .github/actions/test-jvm/action.yml ================================================ name: 'Test JVM' author: 'EnricoMi' description: 'A GitHub Action that tests JVM spark-extension' inputs: spark-version: description: Spark version, e.g. 3.4.0, 3.4.0-SNAPSHOT or 4.0.0-preview1 required: true spark-compat-version: description: Spark compatibility version, e.g. 3.4 required: true spark-archive-url: description: The URL to download the Spark binary distribution required: false scala-version: description: Scala version, e.g. 2.12.15 required: true scala-compat-version: description: Scala compatibility version, e.g. 2.12 required: true hadoop-version: description: Hadoop version, e.g. 2.7 or 2 required: true java-compat-version: description: Java compatibility version, e.g. 8 required: true runs: using: 'composite' steps: - name: Fetch Binaries Artifact uses: actions/download-artifact@v4 with: name: Binaries-${{ inputs.spark-compat-version }}-${{ inputs.scala-compat-version }} path: . - name: Set versions in pom.xml run: | ./set-version.sh ${{ inputs.spark-version }} ${{ inputs.scala-version }} git diff shell: bash - name: Restore Spark Binaries cache if: github.event_name != 'schedule' && ! contains(inputs.spark-version, '-SNAPSHOT') uses: actions/cache/restore@v4 with: path: ~/spark key: ${{ runner.os }}-spark-binaries-${{ inputs.spark-version }}-${{ inputs.scala-compat-version }} restore-keys: | ${{ runner.os }}-spark-binaries-${{ inputs.spark-version }}-${{ inputs.scala-compat-version }} - name: Setup Spark Binaries if: ( ! contains(inputs.spark-version, '-SNAPSHOT') ) env: SPARK_PACKAGE: spark-${{ inputs.spark-version }}/spark-${{ inputs.spark-version }}-bin-hadoop${{ inputs.hadoop-version }}${{ startsWith(inputs.spark-version, '3.') && inputs.scala-compat-version == '2.13' && '-scala2.13' || '' }}.tgz run: | # Setup Spark Binaries if [[ ! -e ~/spark ]] then url="${{ inputs.spark-archive-url }}" wget --progress=dot:giga "${url:-https://www.apache.org/dyn/closer.lua/spark/${SPARK_PACKAGE}?action=download}" -O - | tar -xzC "${{ runner.temp }}" archive=$(basename "${SPARK_PACKAGE}") bash -c "mv -v "${{ runner.temp }}/\${archive/%.tgz/}" ~/spark" fi echo "SPARK_HOME=$(cd ~/spark; pwd)" >> $GITHUB_ENV shell: bash - name: Restore Maven packages cache if: github.event_name != 'schedule' uses: actions/cache/restore@v4 with: path: ~/.m2/repository key: ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }} restore-keys: | ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }} ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}- - name: Setup JDK ${{ inputs.java-compat-version }} uses: actions/setup-java@v4 with: java-version: ${{ inputs.java-compat-version }} distribution: 'zulu' - name: Scala and Java Tests env: JDK_JAVA_OPTIONS: --add-exports java.base/sun.nio.ch=ALL-UNNAMED --add-exports java.base/sun.util.calendar=ALL-UNNAMED run: | # Scala and Java Tests echo "::group::mvn test" mvn --batch-mode --update-snapshots -Dspotless.check.skip test integration-test echo "::endgroup::" shell: bash - name: Upload Test Results if: always() uses: actions/upload-artifact@v4 with: name: JVM Test Results (Spark ${{ inputs.spark-version }} Scala ${{ inputs.scala-version }}) path: | target/surefire-*reports/*.xml branding: icon: 'check-circle' color: 'green' ================================================ FILE: .github/actions/test-python/action.yml ================================================ name: 'Test Python' author: 'EnricoMi' description: 'A GitHub Action that tests Python spark-extension' # pyspark is not available for snapshots or scala other than 2.12 # we would have to compile spark from sources for this, not worth it # so this action only works with scala 2.12 and non-snapshot spark versions inputs: spark-version: description: Spark version, e.g. 3.4.0 or 4.0.0-preview1 required: true scala-version: description: Scala version, e.g. 2.12.15 required: true spark-compat-version: description: Spark compatibility version, e.g. 3.4 required: true spark-archive-url: description: The URL to download the Spark binary distribution required: false spark-package-repo: description: The URL of an alternate maven repository to fetch Spark packages required: false scala-compat-version: description: Scala compatibility version, e.g. 2.12 required: true java-compat-version: description: Java compatibility version, e.g. 8 required: true hadoop-version: description: Hadoop version, e.g. 2.7 or 2 required: true python-version: description: Python version, e.g. 3.8 required: true runs: using: 'composite' steps: - name: Fetch Binaries Artifact uses: actions/download-artifact@v4 with: name: Binaries-${{ inputs.spark-compat-version }}-${{ inputs.scala-compat-version }} path: . - name: Set versions in pom.xml run: | ./set-version.sh ${{ inputs.spark-version }} ${{ inputs.scala-version }} git diff SPARK_EXTENSION_VERSION=$(grep --max-count=1 ".*" pom.xml | sed -E -e "s/\s*<[^>]+>//g") echo "SPARK_EXTENSION_VERSION=$SPARK_EXTENSION_VERSION" | tee -a "$GITHUB_ENV" shell: bash - name: Make this work with PySpark preview versions if: contains(inputs.spark-version, 'preview') run: | sed -i -e 's/\({spark_compat_version}.0\)"/\1.dev1"/' python/setup.py git diff python/setup.py shell: bash - name: Restore Spark Binaries cache if: github.event_name != 'schedule' && ( startsWith(inputs.spark-version, '3.') && inputs.scala-compat-version == '2.12' || startsWith(inputs.spark-version, '4.') ) && ! contains(inputs.spark-version, '-SNAPSHOT') uses: actions/cache/restore@v4 with: path: ~/spark key: ${{ runner.os }}-spark-binaries-${{ inputs.spark-version }}-${{ inputs.scala-compat-version }} restore-keys: | ${{ runner.os }}-spark-binaries-${{ inputs.spark-version }}-${{ inputs.scala-compat-version }} - name: Setup Spark Binaries if: ( startsWith(inputs.spark-version, '3.') && inputs.scala-compat-version == '2.12' || startsWith(inputs.spark-version, '4.') ) && ! contains(inputs.spark-version, '-SNAPSHOT') env: SPARK_PACKAGE: spark-${{ inputs.spark-version }}/spark-${{ inputs.spark-version }}-bin-hadoop${{ inputs.hadoop-version }}${{ startsWith(inputs.spark-version, '3.') && inputs.scala-compat-version == '2.13' && '-scala2.13' || '' }}.tgz run: | # Setup Spark Binaries if [[ ! -e ~/spark ]] then url="${{ inputs.spark-archive-url }}" wget --progress=dot:giga "${url:-https://www.apache.org/dyn/closer.lua/spark/${SPARK_PACKAGE}?action=download}" -O - | tar -xzC "${{ runner.temp }}" archive=$(basename "${SPARK_PACKAGE}") bash -c "mv -v "${{ runner.temp }}/\${archive/%.tgz/}" ~/spark" fi echo "SPARK_BIN_HOME=$(cd ~/spark; pwd)" >> $GITHUB_ENV shell: bash - name: Restore Maven packages cache if: github.event_name != 'schedule' uses: actions/cache/restore@v4 with: path: ~/.m2/repository key: ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }} restore-keys: | ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }} ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}- - name: Setup JDK ${{ inputs.java-compat-version }} uses: actions/setup-java@v4 with: java-version: ${{ inputs.java-compat-version }} distribution: 'zulu' - name: Setup Python uses: actions/setup-python@v5 with: python-version: ${{ inputs.python-version }} - name: Install Python dependencies run: | # Install Python dependencies echo "::group::pip install" python -m venv .pytest-venv .pytest-venv/bin/python -m pip install --upgrade pip .pytest-venv/bin/pip install pypandoc .pytest-venv/bin/pip install -e python/[test] echo "::endgroup::" PYSPARK_HOME=$(.pytest-venv/bin/python -c "import os; import pyspark; print(os.path.dirname(pyspark.__file__))") PYSPARK_BIN_HOME="$(cd ".pytest-venv/"; pwd)" PYSPARK_PYTHON="$PYSPARK_BIN_HOME/bin/python" echo "PYSPARK_HOME=$PYSPARK_HOME" | tee -a "$GITHUB_ENV" echo "PYSPARK_BIN_HOME=$PYSPARK_BIN_HOME" | tee -a "$GITHUB_ENV" echo "PYSPARK_PYTHON=$PYSPARK_PYTHON" | tee -a "$GITHUB_ENV" shell: bash - name: Prepare Poetry tests run: | # Prepare Poetry tests echo "::group::Prepare poetry tests" # install poetry in venv python -m venv .poetry-venv .poetry-venv/bin/python -m pip install poetry # env var needed by poetry tests echo "POETRY_PYTHON=$PWD/.poetry-venv/bin/python" | tee -a "$GITHUB_ENV" # clone example poetry project git clone https://github.com/Textualize/rich.git .rich cd .rich git reset --hard 20024635c06c22879fd2fd1e380ec4cccd9935dd # env var needed by poetry tests echo "RICH_SOURCES=$PWD" | tee -a "$GITHUB_ENV" echo "::endgroup::" shell: bash - name: Python Unit Tests env: SPARK_HOME: ${{ env.PYSPARK_HOME }} PYTHONPATH: python/test run: | .pytest-venv/bin/python -m pytest python/test --junit-xml test-results/pytest-$(date +%s.%N)-$RANDOM.xml shell: bash - name: Install Spark Extension run: | # Install Spark Extension echo "::group::mvn install" mvn --batch-mode --update-snapshots install -Dspotless.check.skip -DskipTests -Dmaven.test.skip=true -Dgpg.skip echo "::endgroup::" shell: bash - name: Start Spark Connect id: spark-connect if: ( contains('3.4,3.5', inputs.spark-compat-version) && inputs.scala-compat-version == '2.12' || startsWith(inputs.spark-version, '4.') ) && ! contains(inputs.spark-version, '-SNAPSHOT') env: SPARK_HOME: ${{ env.SPARK_BIN_HOME }} CONNECT_GRPC_BINDING_ADDRESS: 127.0.0.1 CONNECT_GRPC_BINDING_PORT: 15002 run: | # Start Spark Connect for attempt in {1..10}; do $SPARK_HOME/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_${{ inputs.scala-compat-version }}:${{ inputs.spark-version }} --repositories "${{ inputs.spark-package-repo }}" sleep 10 for log in $SPARK_HOME/logs/spark-*-org.apache.spark.sql.connect.service.SparkConnectServer-*.out; do echo "::group::Spark Connect server log: $log" eoc="EOC-$RANDOM" echo "::stop-commands::$eoc" cat "$log" || true echo "::$eoc::" echo "::endgroup::" done if netstat -an | grep 15002; then break; fi echo "::warning title=Starting Spark Connect server failed::Attempt #$attempt to start Spark Connect server failed" $SPARK_HOME/sbin/stop-connect-server.sh --packages org.apache.spark:spark-connect_${{ inputs.scala-compat-version }}:${{ inputs.spark-version }} sleep 5 done if ! netstat -an | grep 15002; then echo "::error title=Starting Spark Connect server failed::All attempts to start Spark Connect server failed" exit 1 fi shell: bash - name: Python Unit Tests (Spark Connect) if: steps.spark-connect.outcome == 'success' env: SPARK_HOME: ${{ env.PYSPARK_HOME }} PYTHONPATH: python/test TEST_SPARK_CONNECT_SERVER: sc://127.0.0.1:15002 run: | # Python Unit Tests (Spark Connect) echo "::group::pip install" # .dev1 allows this to work with preview versions .pytest-venv/bin/pip install "pyspark[connect]~=${{ inputs.spark-compat-version }}.0.dev1" echo "::endgroup::" .pytest-venv/bin/python -m pytest python/test --junit-xml test-results-connect/pytest-$(date +%s.%N)-$RANDOM.xml shell: bash - name: Stop Spark Connect if: always() && steps.spark-connect.outcome == 'success' env: SPARK_HOME: ${{ env.SPARK_BIN_HOME }} run: | # Stop Spark Connect $SPARK_HOME/sbin/stop-connect-server.sh for log in $SPARK_HOME/logs/spark-*-org.apache.spark.sql.connect.service.SparkConnectServer-*.out; do echo "::group::Spark Connect server log: $log" eoc="EOC-$RANDOM" echo "::stop-commands::$eoc" cat "$log" || true echo "::$eoc::" echo "::endgroup::" done shell: bash - name: Upload Test Results if: always() uses: actions/upload-artifact@v4 with: name: Python Test Results (Spark ${{ inputs.spark-version }} Scala ${{ inputs.scala-version }} Python ${{ inputs.python-version }}) path: | test-results/*.xml test-results-connect/*.xml branding: icon: 'check-circle' color: 'green' ================================================ FILE: .github/actions/test-release/action.yml ================================================ name: 'Test Release' author: 'EnricoMi' description: 'A GitHub Action that tests spark-extension release' # pyspark is not available for snapshots or scala other than 2.12 # we would have to compile spark from sources for this, not worth it # so this action only works with scala 2.12 and non-snapshot spark versions inputs: spark-version: description: Spark version, e.g. 3.4.0 or 4.0.0-preview1 required: true scala-version: description: Scala version, e.g. 2.12.15 required: true spark-compat-version: description: Spark compatibility version, e.g. 3.4 required: true spark-archive-url: description: The URL to download the Spark binary distribution required: false scala-compat-version: description: Scala compatibility version, e.g. 2.12 required: true java-compat-version: description: Java compatibility version, e.g. 8 required: true hadoop-version: description: Hadoop version, e.g. 2.7 or 2 required: true python-version: description: Python version, e.g. 3.8 default: '' required: false runs: using: 'composite' steps: - name: Fetch Binaries Artifact uses: actions/download-artifact@v4 with: name: Binaries-${{ inputs.spark-compat-version }}-${{ inputs.scala-compat-version }} path: . - name: Set versions in pom.xml run: | ./set-version.sh ${{ inputs.spark-version }} ${{ inputs.scala-version }} git diff SPARK_EXTENSION_VERSION=$(grep --max-count=1 ".*" pom.xml | sed -E -e "s/\s*<[^>]+>//g") echo "SPARK_EXTENSION_VERSION=$SPARK_EXTENSION_VERSION" | tee -a "$GITHUB_ENV" shell: bash - name: Restore Spark Binaries cache if: github.event_name != 'schedule' uses: actions/cache/restore@v4 with: path: ~/spark key: ${{ runner.os }}-spark-binaries-${{ inputs.spark-version }}-${{ inputs.scala-compat-version }} restore-keys: | ${{ runner.os }}-spark-binaries-${{ inputs.spark-version }}-${{ inputs.scala-compat-version }} - name: Setup Spark Binaries env: SPARK_PACKAGE: spark-${{ inputs.spark-version }}/spark-${{ inputs.spark-version }}-bin-hadoop${{ inputs.hadoop-version }}${{ startsWith(inputs.spark-version, '3.') && inputs.scala-compat-version == '2.13' && '-scala2.13' || '' }}.tgz run: | # Setup Spark Binaries if [[ ! -e ~/spark ]] then url="${{ inputs.spark-archive-url }}" wget --progress=dot:giga "${url:-https://www.apache.org/dyn/closer.lua/spark/${SPARK_PACKAGE}?action=download}" -O - | tar -xzC "${{ runner.temp }}" archive=$(basename "${SPARK_PACKAGE}") bash -c "mv -v "${{ runner.temp }}/\${archive/%.tgz/}" ~/spark" fi echo "SPARK_BIN_HOME=$(cd ~/spark; pwd)" >> $GITHUB_ENV shell: bash - name: Restore Maven packages cache if: github.event_name != 'schedule' uses: actions/cache/restore@v4 with: path: ~/.m2/repository key: ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }} restore-keys: | ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}-${{ hashFiles('pom.xml') }} ${{ runner.os }}-mvn-build-${{ inputs.spark-version }}-${{ inputs.scala-version }}- - name: Setup JDK ${{ inputs.java-compat-version }} uses: actions/setup-java@v4 with: java-version: ${{ inputs.java-compat-version }} distribution: 'zulu' - name: Diff App test env: SPARK_HOME: ${{ env.SPARK_BIN_HOME }} run: | # Diff App test echo "::group::spark-submit" $SPARK_HOME/bin/spark-submit --packages com.github.scopt:scopt_${{ inputs.scala-compat-version }}:4.1.0 target/spark-extension_*.jar --format parquet --id id src/test/files/test.parquet/file1.parquet src/test/files/test.parquet/file2.parquet diff.parquet echo echo "::endgroup::" echo "::group::spark-shell" $SPARK_HOME/bin/spark-shell <<< 'val df = spark.read.parquet("diff.parquet").orderBy($"id").groupBy($"diff").count; df.show; if (df.count != 2) sys.exit(1)' echo echo "::endgroup::" shell: bash - name: Install Spark Extension run: | # Install Spark Extension echo "::group::mvn install" mvn --batch-mode --update-snapshots install -Dspotless.check.skip -DskipTests -Dmaven.test.skip=true -Dgpg.skip echo "::endgroup::" shell: bash - name: Fetch Release Test Dependencies run: | # Fetch Release Test Dependencies echo "::group::mvn dependency:get" mvn dependency:get -Dtransitive=false -Dartifact=org.apache.parquet:parquet-hadoop:1.16.0:jar:tests echo "::endgroup::" shell: bash - name: Scala Release Test env: SPARK_HOME: ${{ env.SPARK_BIN_HOME }} run: | # Scala Release Test echo "::group::spark-shell" $SPARK_BIN_HOME/bin/spark-shell --packages uk.co.gresearch.spark:spark-extension_${{ inputs.scala-compat-version }}:$SPARK_EXTENSION_VERSION --jars ~/.m2/repository/org/apache/parquet/parquet-hadoop/1.16.0/parquet-hadoop-1.16.0-tests.jar < test-release.scala echo echo "::endgroup::" shell: bash - name: Setup Python uses: actions/setup-python@v5 if: inputs.python-version != '' with: python-version: ${{ inputs.python-version }} - name: Python Release Test if: inputs.python-version != '' env: SPARK_HOME: ${{ env.SPARK_BIN_HOME }} run: | # Python Release Test echo "::group::spark-submit" $SPARK_BIN_HOME/bin/spark-submit --packages uk.co.gresearch.spark:spark-extension_${{ inputs.scala-compat-version }}:$SPARK_EXTENSION_VERSION test-release.py echo echo "::endgroup::" shell: bash - name: Fetch Whl Artifact if: inputs.python-version != '' uses: actions/download-artifact@v4 with: name: Whl (Spark ${{ inputs.spark-compat-version }} Scala ${{ inputs.scala-compat-version }}) path: . - name: Install Python dependencies if: inputs.python-version != '' run: | # Install Python dependencies echo "::group::pip install" python -m venv .pytest-venv .pytest-venv/bin/python -m pip install --upgrade pip .pytest-venv/bin/pip install pypandoc .pytest-venv/bin/pip install $(ls pyspark_extension-*.whl)[test] echo "::endgroup::" PYSPARK_HOME=$(.pytest-venv/bin/python -c "import os; import pyspark; print(os.path.dirname(pyspark.__file__))") PYSPARK_BIN_HOME="$(cd ".pytest-venv/"; pwd)" PYSPARK_PYTHON="$PYSPARK_BIN_HOME/bin/python" echo "PYSPARK_HOME=$PYSPARK_HOME" | tee -a "$GITHUB_ENV" echo "PYSPARK_BIN_HOME=$PYSPARK_BIN_HOME" | tee -a "$GITHUB_ENV" echo "PYSPARK_PYTHON=$PYSPARK_PYTHON" | tee -a "$GITHUB_ENV" shell: bash - name: PySpark Release Test if: inputs.python-version != '' run: | .pytest-venv/bin/python3 test-release.py shell: bash - name: Python Integration Tests if: inputs.python-version != '' env: SPARK_HOME: ${{ env.PYSPARK_HOME }} PYTHONPATH: python:python/test run: | # Python Integration Tests source .pytest-venv/bin/activate find python/test -name 'test*.py' > tests while read test do echo "::group::spark-submit $test" if ! $PYSPARK_BIN_HOME/bin/spark-submit --master "local[2]" --packages uk.co.gresearch.spark:spark-extension_${{ inputs.scala-compat-version }}:$SPARK_EXTENSION_VERSION "$test" test-results-submit then state="fail" fi echo echo "::endgroup::" done < tests if [[ "$state" == "fail" ]]; then exit 1; fi shell: bash - name: Upload Test Results if: always() && inputs.python-version != '' uses: actions/upload-artifact@v4 with: name: Python Release Test Results (Spark ${{ inputs.spark-version }} Scala ${{ inputs.scala-version }} Python ${{ inputs.python-version }}) path: | test-results-submit/*.xml branding: icon: 'check-circle' color: 'green' ================================================ FILE: .github/dependabot.yml ================================================ version: 2 updates: - package-ecosystem: "github-actions" directory: "/" schedule: interval: "monthly" - package-ecosystem: "maven" directory: "/" schedule: interval: "daily" ================================================ FILE: .github/show-spark-versions.sh ================================================ #!/bin/bash base=$(cd "$(dirname "$0")"; pwd) grep -- "-version" "$base"/workflows/prime-caches.yml | sed -e "s/ -//g" -e "s/ //g" -e "s/'//g" | grep -v -e "matrix" -e "]" | while read line do IFS=":" read var compat_version <<< "$line" if [[ "$var" == "spark-compat-version" ]] then while read line do IFS=":" read var patch_version <<< "$line" if [[ "$var" == "spark-patch-version" ]] then echo -n "spark-version: $compat_version.$patch_version" read line if [[ "$line" == "spark-snapshot-version:true" ]] then echo "-SNAPSHOT" else echo fi break fi done fi done > "$base"/workflows/prime-caches.yml.tmp grep spark-version "$base"/workflows/*.yml "$base"/workflows/prime-caches.yml.tmp | cut -d : -f 2- | sed -e "s/^[ -]*//" -e "s/'//g" -e 's/{"params": {"//g' -e 's/params: {//g' -e 's/"//g' -e "s/,.*//" | grep "^spark-version" | grep -v "matrix" | sort | uniq ================================================ FILE: .github/workflows/build-jvm.yml ================================================ name: Build JVM on: workflow_call: jobs: build: name: Build (Spark ${{ matrix.spark-version }} Scala ${{ matrix.scala-version }}) runs-on: ubuntu-latest strategy: fail-fast: false matrix: include: - spark-version: '3.2.4' spark-compat-version: '3.2' scala-compat-version: '2.12' scala-version: '2.12.15' java-compat-version: '8' hadoop-version: '2.7' - spark-version: '3.3.4' spark-compat-version: '3.3' scala-compat-version: '2.12' scala-version: '2.12.15' java-compat-version: '8' hadoop-version: '3' - spark-version: '3.4.4' spark-compat-version: '3.4' scala-compat-version: '2.12' scala-version: '2.12.17' java-compat-version: '8' hadoop-version: '3' - spark-version: '3.5.8' spark-compat-version: '3.5' scala-compat-version: '2.12' scala-version: '2.12.18' java-compat-version: '8' hadoop-version: '3' - spark-version: '3.2.4' spark-compat-version: '3.2' scala-compat-version: '2.13' scala-version: '2.13.5' java-compat-version: '8' hadoop-version: '3.2' - spark-version: '3.3.4' spark-compat-version: '3.3' scala-compat-version: '2.13' scala-version: '2.13.8' java-compat-version: '8' hadoop-version: '3' - spark-version: '3.4.4' spark-compat-version: '3.4' scala-compat-version: '2.13' scala-version: '2.13.8' java-compat-version: '8' hadoop-version: '3' - spark-version: '3.5.8' spark-compat-version: '3.5' scala-compat-version: '2.13' scala-version: '2.13.8' java-compat-version: '8' hadoop-version: '3' - spark-version: '4.0.2' spark-compat-version: '4.0' scala-compat-version: '2.13' scala-version: '2.13.16' java-compat-version: '17' hadoop-version: '3' - spark-version: '4.1.1' spark-compat-version: '4.1' scala-compat-version: '2.13' scala-version: '2.13.17' java-compat-version: '17' hadoop-version: '3' - spark-version: '4.2.0-preview3' spark-compat-version: '4.2' scala-compat-version: '2.13' scala-version: '2.13.18' java-compat-version: '17' hadoop-version: '3' steps: - name: Checkout uses: actions/checkout@v4 - name: Build uses: ./.github/actions/build with: spark-version: ${{ matrix.spark-version }} scala-version: ${{ matrix.scala-version }} spark-compat-version: ${{ matrix.spark-compat-version }} scala-compat-version: ${{ matrix.scala-compat-version }} java-compat-version: ${{ matrix.java-compat-version }} hadoop-version: ${{ matrix.hadoop-version }} ================================================ FILE: .github/workflows/build-python.yml ================================================ name: Build Python on: workflow_call: jobs: # pyspark<4 is not available for snapshots or scala other than 2.12 whl: name: Build whl (Spark ${{ matrix.spark-version }} Scala ${{ matrix.scala-version }}) runs-on: ubuntu-latest strategy: fail-fast: false matrix: include: - spark-compat-version: '3.2' spark-version: '3.2.4' scala-compat-version: '2.12' scala-version: '2.12.15' java-compat-version: '8' python-version: '3.9' - spark-compat-version: '3.3' spark-version: '3.3.4' scala-compat-version: '2.12' scala-version: '2.12.15' java-compat-version: '8' python-version: '3.9' - spark-compat-version: '3.4' spark-version: '3.4.4' scala-compat-version: '2.12' scala-version: '2.12.17' java-compat-version: '8' python-version: '3.9' - spark-compat-version: '3.5' spark-version: '3.5.8' scala-compat-version: '2.12' scala-version: '2.12.18' java-compat-version: '8' python-version: '3.9' - spark-compat-version: '4.0' spark-version: '4.0.2' scala-compat-version: '2.13' scala-version: '2.13.16' java-compat-version: '17' python-version: '3.9' - spark-version: '4.1.1' spark-compat-version: '4.1' scala-compat-version: '2.13' scala-version: '2.13.17' java-compat-version: '17' hadoop-version: '3' python-version: '3.10' - spark-version: '4.2.0-preview3' spark-compat-version: '4.2' scala-compat-version: '2.13' scala-version: '2.13.18' java-compat-version: '17' hadoop-version: '3' python-version: '3.10' steps: - name: Checkout uses: actions/checkout@v4 - name: Build uses: ./.github/actions/build-whl with: spark-version: ${{ matrix.spark-version }} scala-version: ${{ matrix.scala-version }} spark-compat-version: ${{ matrix.spark-compat-version }} scala-compat-version: ${{ matrix.scala-compat-version }} java-compat-version: ${{ matrix.java-compat-version }} python-version: ${{ matrix.python-version }} ================================================ FILE: .github/workflows/build-snapshots.yml ================================================ name: Build Snapshots on: workflow_call: jobs: build: name: Build (Spark ${{ matrix.spark-version }} Scala ${{ matrix.scala-version }}) runs-on: ubuntu-latest strategy: fail-fast: false matrix: include: - spark-compat-version: '3.2' spark-version: '3.2.5-SNAPSHOT' scala-compat-version: '2.12' scala-version: '2.12.15' java-compat-version: '8' - spark-compat-version: '3.3' spark-version: '3.3.5-SNAPSHOT' scala-compat-version: '2.12' scala-version: '2.12.15' java-compat-version: '8' - spark-compat-version: '3.4' spark-version: '3.4.5-SNAPSHOT' scala-compat-version: '2.12' scala-version: '2.12.17' java-compat-version: '8' - spark-compat-version: '3.5' spark-version: '3.5.9-SNAPSHOT' scala-compat-version: '2.12' scala-version: '2.12.18' java-compat-version: '8' - spark-compat-version: '3.2' spark-version: '3.2.5-SNAPSHOT' scala-compat-version: '2.13' scala-version: '2.13.5' java-compat-version: '8' - spark-compat-version: '3.3' spark-version: '3.3.5-SNAPSHOT' scala-compat-version: '2.13' scala-version: '2.13.8' java-compat-version: '8' - spark-compat-version: '3.4' spark-version: '3.4.5-SNAPSHOT' scala-compat-version: '2.13' scala-version: '2.13.8' java-compat-version: '8' - spark-compat-version: '3.5' spark-version: '3.5.9-SNAPSHOT' scala-compat-version: '2.13' scala-version: '2.13.8' java-compat-version: '8' - spark-compat-version: '4.0' spark-version: '4.0.3-SNAPSHOT' scala-compat-version: '2.13' scala-version: '2.13.16' java-compat-version: '17' - spark-compat-version: '4.1' spark-version: '4.1.2-SNAPSHOT' scala-compat-version: '2.13' scala-version: '2.13.17' java-compat-version: '17' - spark-compat-version: '4.2' spark-version: '4.2.0-SNAPSHOT' scala-compat-version: '2.13' scala-version: '2.13.18' java-compat-version: '17' steps: - name: Checkout uses: actions/checkout@v4 - name: Build uses: ./.github/actions/build with: spark-version: ${{ matrix.spark-version }} scala-version: ${{ matrix.scala-version }} spark-compat-version: ${{ matrix.spark-compat-version }}-SNAPSHOT scala-compat-version: ${{ matrix.scala-compat-version }} java-compat-version: ${{ matrix.java-compat-version }} ================================================ FILE: .github/workflows/check.yml ================================================ name: Check on: workflow_call: jobs: lint: name: Scala lint runs-on: ubuntu-latest steps: - name: Checkout uses: actions/checkout@v4 with: fetch-depth: 0 - name: Setup JDK ${{ inputs.java-compat-version }} uses: actions/setup-java@v4 with: java-version: '11' distribution: 'zulu' - name: Check id: check run: | mvn --batch-mode --update-snapshots spotless:check shell: bash - name: Changes if: failure() && steps.check.outcome == 'failure' run: | mvn --batch-mode --update-snapshots spotless:apply git diff shell: bash config: name: Configure compat runs-on: ubuntu-latest outputs: major-version: ${{ steps.versions.outputs.major-version }} release-version: ${{ steps.versions.outputs.release-version }} release-major-version: ${{ steps.versions.outputs.release-major-version }} steps: - name: Checkout uses: actions/checkout@v4 with: fetch-depth: 0 - name: Get versions id: versions run: | version=$(grep -m1 version pom.xml | sed -e "s/<[^>]*>//g" -e "s/ //g") echo "version: $version" echo "major-version: ${version/.*/}" echo "version=$version" >> "$GITHUB_OUTPUT" echo "major-version=${version/.*/}" >> "$GITHUB_OUTPUT" release_version=$(git tag | grep "^v" | sort --version-sort | tail -n1 | sed "s/^v//") echo "release-version: $release_version" echo "release-major-version: ${release_version/.*/}" echo "release-version=$release_version" >> "$GITHUB_OUTPUT" echo "release-major-version=${release_version/.*/}" >> "$GITHUB_OUTPUT" shell: bash compat: name: Compat (Spark ${{ matrix.spark-compat-version }} Scala ${{ matrix.scala-compat-version }}) needs: config runs-on: ubuntu-latest if: needs.config.outputs.major-version == needs.config.outputs.release-major-version strategy: fail-fast: false matrix: include: - spark-compat-version: '3.2' spark-version: '3.2.4' scala-compat-version: '2.12' scala-version: '2.12.15' - spark-compat-version: '3.3' spark-version: '3.3.4' scala-compat-version: '2.12' scala-version: '2.12.15' - spark-compat-version: '3.4' scala-compat-version: '2.12' scala-version: '2.12.17' spark-version: '3.4.4' - spark-compat-version: '3.5' scala-compat-version: '2.12' scala-version: '2.12.18' spark-version: '3.5.8' - spark-compat-version: '4.0' scala-compat-version: '2.13' scala-version: '2.13.16' spark-version: '4.0.2' - spark-compat-version: '4.1' scala-compat-version: '2.13' scala-version: '2.13.17' spark-version: '4.1.1' steps: - name: Checkout uses: actions/checkout@v4 - name: Check uses: ./.github/actions/check-compat with: spark-version: ${{ matrix.spark-version }} scala-version: ${{ matrix.scala-version }} spark-compat-version: ${{ matrix.spark-compat-version }} scala-compat-version: ${{ matrix.scala-compat-version }} package-version: ${{ needs.config.outputs.release-version }} ================================================ FILE: .github/workflows/ci.yml ================================================ name: CI on: schedule: - cron: '0 8 */10 * *' push: branches: - 'master' tags: - '*' merge_group: pull_request: workflow_dispatch: jobs: event_file: name: "Event File" runs-on: ubuntu-latest steps: - name: Upload uses: actions/upload-artifact@v4 with: name: Event File path: ${{ github.event_path }} build-jvm: name: "Build JVM" uses: "./.github/workflows/build-jvm.yml" build-snapshots: name: "Build Snapshots" uses: "./.github/workflows/build-snapshots.yml" build-python: name: "Build Python" needs: build-jvm uses: "./.github/workflows/build-python.yml" test-jvm: name: "Test JVM" needs: build-jvm uses: "./.github/workflows/test-jvm.yml" test-python: name: "Test Python" needs: build-jvm uses: "./.github/workflows/test-python.yml" test-snapshots-jvm: name: "Test Snapshots" needs: build-snapshots uses: "./.github/workflows/test-snapshots.yml" test-release: name: "Test Release" needs: build-jvm uses: "./.github/workflows/test-release.yml" check: name: "Check" needs: build-jvm uses: "./.github/workflows/check.yml" # A single job that succeeds if all jobs listed under 'needs' succeed. # This allows to configure a single job as a required check. # The 'needed' jobs then can be changed through pull-requests. test_success: name: "Test success" if: always() runs-on: ubuntu-latest # the if clauses below have to reflect the number of jobs listed here needs: [build-jvm, build-python, test-jvm, test-python, test-release] env: RESULTS: ${{ join(needs.*.result, ',') }} steps: - name: "Success" # we expect all required jobs to have success result if: env.RESULTS == 'success,success,success,success,success' run: true shell: bash - name: "Failure" # we expect all required jobs to have success result, fail otherwise if: env.RESULTS != 'success,success,success,success,success' run: false shell: bash ================================================ FILE: .github/workflows/clear-caches.yaml ================================================ name: Clear caches on: workflow_dispatch: permissions: actions: write jobs: clear-cache: runs-on: ubuntu-latest steps: - name: Clear caches uses: actions/github-script@v7 with: script: | const caches = await github.paginate( github.rest.actions.getActionsCacheList.endpoint.merge({ owner: context.repo.owner, repo: context.repo.repo, }) ) for (const cache of caches) { console.log(cache) github.rest.actions.deleteActionsCacheById({ owner: context.repo.owner, repo: context.repo.repo, cache_id: cache.id, }) } ================================================ FILE: .github/workflows/prepare-release.yml ================================================ name: Prepare release on: workflow_dispatch: inputs: github_release_latest: description: 'Make the created GitHub release the latest' required: false default: true type: boolean jobs: get-version: name: Get version runs-on: ubuntu-latest outputs: release-tag: ${{ steps.versions.outputs.release-tag }} is-snapshot: ${{ steps.versions.outputs.is-snapshot }} steps: - name: Checkout code uses: actions/checkout@v4 with: fetch-depth: 0 - name: Get versions id: versions run: | # get release version version=$(grep --max-count=1 ".*" pom.xml | sed -E -e "s/\s*<[^>]+>//g" -e "s/-SNAPSHOT//" -e "s/-[0-9.]+//g") is_snapshot=$(if grep -q ".*-SNAPSHOT" pom.xml; then echo "true"; else echo "false"; fi) # share versions echo "release-tag=v${version}" >> "$GITHUB_OUTPUT" echo "is-snapshot=$is_snapshot" >> "$GITHUB_OUTPUT" prepare-release: name: Prepare release runs-on: ubuntu-latest if: ( ! github.event.repository.fork ) needs: get-version # secrets are provided by environment environment: name: tagged url: 'https://github.com/G-Research/spark-extension?version=${{ needs.get-version.outputs.release-tag }}' steps: - name: Create GitHub App token uses: actions/create-github-app-token@v2 id: app-token with: app-id: ${{ vars.APP_ID }} private-key: ${{ secrets.PRIVATE_KEY }} # required to push to a branch permission-contents: write - name: Get GitHub App User ID id: get-user-id run: echo "user-id=$(gh api "/users/${{ steps.app-token.outputs.app-slug }}[bot]" --jq .id)" >> "$GITHUB_OUTPUT" env: GH_TOKEN: ${{ steps.app-token.outputs.token }} - name: Checkout code uses: actions/checkout@v4 with: token: ${{ steps.app-token.outputs.token }} fetch-depth: 0 - name: Check branch setup run: | # Check branch setup if [[ "$GITHUB_REF" != "refs/heads/master" ]] && [[ "$GITHUB_REF" != "refs/heads/master-"* ]] then echo "This workflow must be run on master or master-* branch, not $GITHUB_REF" exit 1 fi - name: Tag and bump version if: needs.get-version.outputs.is-snapshot run: | # check for unreleased entry in CHANGELOG.md readarray -t changes < <(grep -A 100 "^## \[UNRELEASED\] - YYYY-MM-DD" CHANGELOG.md | grep -B 100 --max-count=1 -E "^## \[[0-9.]+\]" | grep "^-") if [ ${#changes[@]} -eq 0 ] then echo "Did not find any changes in CHANGELOG.md under '## [UNRELEASED] - YYYY-MM-DD'" exit 1 fi # get latest and release version latest=$(grep --max-count=1 ".*" README.md | sed -E -e "s/\s*<[^>]+>//g" -e "s/-[0-9.]+//g") version=$(grep --max-count=1 ".*" pom.xml | sed -E -e "s/\s*<[^>]+>//g" -e "s/-SNAPSHOT//" -e "s/-[0-9.]+//g") # update changlog echo "Releasing ${#changes[@]} changes as version $version:" for (( i=0; i<${#changes[@]}; i++ )); do echo "${changes[$i]}" ; done sed -i "s/## \[UNRELEASED\] - YYYY-MM-DD/## [$version] - $(date +%Y-%m-%d)/" CHANGELOG.md sed -i -e "s/$latest-/$version-/g" -e "s/$latest\./$version./g" README.md PYSPARK-DEPS.md python/README.md ./set-version.sh $version # configure git so we can commit changes git config --global user.name '${{ steps.app-token.outputs.app-slug }}[bot]' git config --global user.email '${{ steps.get-user-id.outputs.user-id }}+${{ steps.app-token.outputs.app-slug }}[bot]@users.noreply.github.com' # commit changes to local repo echo "Committing release to local git" git add pom.xml python/setup.py CHANGELOG.md README.md PYSPARK-DEPS.md python/README.md git commit -m "Releasing $version" git tag -a "v${version}" -m "Release v${version}" # bump version # define function to bump version function next_version { local version=$1 local branch=$2 patch=${version/*./} majmin=${version%.${patch}} if [[ $branch == "master" ]] then # minor version bump if [[ $version != *".0" ]] then echo "version is patch version, should be M.m.0: $version" >&2 exit 1 fi maj=${version/.*/} min=${majmin#${maj}.} next=${maj}.$((min+1)).0 echo "$next" else # patch version bump next=${majmin}.$((patch+1)) echo "$next" fi } # get next version pkg_version="${version/-*/}" branch=$(git rev-parse --abbrev-ref HEAD) next_pkg_version="$(next_version "$pkg_version" "$branch")" # bump the version echo "Bump version to $next_pkg_version" ./set-version.sh $next_pkg_version-SNAPSHOT # commit changes to local repo echo "Committing release to local git" git commit -a -m "Post-release version bump to $next_pkg_version" # push all commits and tag to origin echo "Pushing release commit and tag to origin" git push origin "$GITHUB_REF_NAME" "v${version}" --tags # NOTE: This push will not trigger a CI as we are using GITHUB_TOKEN to push # More info on: https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow github-release: name: Create GitHub release runs-on: ubuntu-latest needs: - get-version - prepare-release permissions: contents: write # required to create release steps: - name: Checkout release tag uses: actions/checkout@v4 with: ref: ${{ needs.get-version.outputs.release-tag }} - name: Extract release notes id: release-notes run: | awk '/^## /{if(seen==1)exit; seen++} seen' CHANGELOG.md > ./release-notes.txt # Grab release name name=$(grep -m 1 "^## " CHANGELOG.md | sed "s/^## //") echo "release_name=$name" >> $GITHUB_OUTPUT # provide release notes file path as output echo "release_notes_path=release-notes.txt" >> $GITHUB_OUTPUT - name: Publish GitHub release uses: ncipollo/release-action@2c591bcc8ecdcd2db72b97d6147f871fcd833ba5 id: github-release with: name: ${{ steps.release-notes.outputs.release_name }} bodyFile: ${{ steps.release-notes.outputs.release_notes_path }} makeLatest: ${{ inputs.github_release_latest }} tag: ${{ needs.get-version.outputs.release-tag }} token: ${{ github.token }} ================================================ FILE: .github/workflows/prime-caches.yml ================================================ name: Prime caches on: workflow_dispatch: jobs: prime: name: Spark ${{ matrix.spark-compat-version }}.${{ matrix.spark-patch-version }}${{ matrix.spark-snapshot-version && '-SNAPSHOT' }} Scala ${{ matrix.scala-version }} runs-on: ubuntu-latest strategy: fail-fast: false # keep in-sync with .github/workflows/test-jvm.yml matrix: include: - spark-compat-version: '3.2' scala-compat-version: '2.12' scala-version: '2.12.15' spark-patch-version: '4' hadoop-version: '2.7' - spark-compat-version: '3.3' scala-compat-version: '2.12' scala-version: '2.12.15' spark-patch-version: '4' hadoop-version: '3' - spark-compat-version: '3.4' scala-compat-version: '2.12' scala-version: '2.12.17' spark-patch-version: '4' hadoop-version: '3' - spark-compat-version: '3.5' scala-compat-version: '2.12' scala-version: '2.12.18' spark-patch-version: '8' hadoop-version: '3' - spark-compat-version: '3.2' scala-compat-version: '2.13' scala-version: '2.13.5' spark-patch-version: '4' hadoop-version: '3.2' - spark-compat-version: '3.3' scala-compat-version: '2.13' scala-version: '2.13.8' spark-patch-version: '4' hadoop-version: '3' - spark-compat-version: '3.4' scala-compat-version: '2.13' scala-version: '2.13.8' spark-patch-version: '4' hadoop-version: '3' - spark-compat-version: '3.5' scala-compat-version: '2.13' scala-version: '2.13.8' spark-patch-version: '8' hadoop-version: '3' - spark-compat-version: '4.0' scala-compat-version: '2.13' scala-version: '2.13.16' spark-patch-version: '2' java-compat-version: '17' hadoop-version: '3' - spark-compat-version: '4.1' scala-compat-version: '2.13' scala-version: '2.13.17' spark-patch-version: '1' java-compat-version: '17' hadoop-version: '3' - spark-compat-version: '4.2' scala-compat-version: '2.13' scala-version: '2.13.18' spark-patch-version: '0-preview3' java-compat-version: '17' hadoop-version: '3' - spark-compat-version: '3.2' scala-compat-version: '2.12' scala-version: '2.12.15' spark-patch-version: '5' spark-snapshot-version: true hadoop-version: '2.7' - spark-compat-version: '3.3' scala-compat-version: '2.12' scala-version: '2.12.15' spark-patch-version: '5' spark-snapshot-version: true hadoop-version: '3' - spark-compat-version: '3.4' scala-compat-version: '2.12' scala-version: '2.12.17' spark-patch-version: '5' spark-snapshot-version: true hadoop-version: '3' - spark-compat-version: '3.5' scala-compat-version: '2.12' scala-version: '2.12.18' spark-patch-version: '9' spark-snapshot-version: true hadoop-version: '3' - spark-compat-version: '3.2' scala-compat-version: '2.13' scala-version: '2.13.5' spark-patch-version: '5' spark-snapshot-version: true hadoop-version: '3.2' - spark-compat-version: '3.3' scala-compat-version: '2.13' scala-version: '2.13.8' spark-patch-version: '5' spark-snapshot-version: true hadoop-version: '3' - spark-compat-version: '3.4' scala-compat-version: '2.13' scala-version: '2.13.8' spark-patch-version: '5' spark-snapshot-version: true hadoop-version: '3' - spark-compat-version: '3.5' scala-compat-version: '2.13' scala-version: '2.13.8' spark-patch-version: '9' spark-snapshot-version: true hadoop-version: '3' - spark-compat-version: '4.0' scala-compat-version: '2.13' scala-version: '2.13.16' spark-patch-version: '3' spark-snapshot-version: true hadoop-version: '3' - spark-compat-version: '4.1' scala-compat-version: '2.13' scala-version: '2.13.17' spark-patch-version: '2' spark-snapshot-version: true hadoop-version: '3' - spark-compat-version: '4.2' scala-compat-version: '2.13' scala-version: '2.13.18' spark-patch-version: '0' spark-snapshot-version: true hadoop-version: '3' steps: - name: Checkout uses: actions/checkout@v4 - name: Prime caches uses: ./.github/actions/prime-caches with: spark-version: ${{ matrix.spark-compat-version }}.${{ matrix.spark-patch-version }}${{ matrix.spark-snapshot-version && '-SNAPSHOT' }} scala-version: ${{ matrix.scala-version }} spark-compat-version: ${{ matrix.spark-compat-version }} scala-compat-version: ${{ matrix.scala-compat-version }} hadoop-version: ${{ matrix.hadoop-version }} java-compat-version: '8' ================================================ FILE: .github/workflows/publish-release.yml ================================================ name: Publish release on: workflow_dispatch: inputs: versions: required: true type: string description: 'Example: {"include": [{"params": {"spark-version": "4.0.0","scala-version": "2.13.16"}}]}' default: | { "include": [ {"params": {"spark-version": "3.2.4", "scala-version": "2.12.15", "java-compat-version": "8"}}, {"params": {"spark-version": "3.3.4", "scala-version": "2.12.15", "java-compat-version": "8"}}, {"params": {"spark-version": "3.4.4", "scala-version": "2.12.17", "java-compat-version": "8"}}, {"params": {"spark-version": "3.5.8", "scala-version": "2.12.18", "java-compat-version": "8"}}, {"params": {"spark-version": "3.2.4", "scala-version": "2.13.5", "java-compat-version": "8"}}, {"params": {"spark-version": "3.3.4", "scala-version": "2.13.8", "java-compat-version": "8"}}, {"params": {"spark-version": "3.4.4", "scala-version": "2.13.8", "java-compat-version": "8"}}, {"params": {"spark-version": "3.5.8", "scala-version": "2.13.8", "java-compat-version": "8"}}, {"params": {"spark-version": "4.0.2", "scala-version": "2.13.16", "java-compat-version": "17"}}, {"params": {"spark-version": "4.1.1", "scala-version": "2.13.17", "java-compat-version": "17"}} ] } env: # PySpark 3 versions only work with Python 3.9 PYTHON_VERSION: "3.9" jobs: get-version: name: Get version runs-on: ubuntu-latest outputs: release-tag: ${{ steps.versions.outputs.release-tag }} is-snapshot: ${{ steps.versions.outputs.is-snapshot }} steps: - name: Checkout release tag uses: actions/checkout@v4 - name: Get versions id: versions run: | # get release version version=$(grep --max-count=1 ".*" pom.xml | sed -E -e "s/\s*<[^>]+>//g" -e "s/-SNAPSHOT//" -e "s/-[0-9.]+//g") is_snapshot=$(if grep -q ".*-SNAPSHOT" pom.xml; then echo "true"; else echo "false"; fi) # share versions echo "release-tag=v${version}" >> "$GITHUB_OUTPUT" echo "is-snapshot=$is_snapshot" >> "$GITHUB_OUTPUT" - name: Check tag setup run: | # Check tag setup if [[ "$GITHUB_REF" != "refs/tags/v"* ]] then echo "This workflow must be run on a tag, not $GITHUB_REF" exit 1 fi if [ "${{ steps.versions.outputs.is-snapshot }}" == "true" ] then echo "This is a tagged SNAPSHOT version. This is not allowed for release!" exit 1 fi if [ "${{ github.ref_name }}" != "${{ steps.versions.outputs.release-tag }}" ] then echo "The version in the pom.xml is ${{ steps.versions.outputs.release-tag }}" echo "This tag is ${{ github.ref_name }}, which is different!" exit 1 fi - name: Show matrix run: | echo '${{ github.event.inputs.versions }}' | jq . maven-release: name: Publish maven release (Spark ${{ matrix.params.spark-version }}, Scala ${{ matrix.params.scala-version }}) runs-on: ubuntu-latest needs: get-version if: ( ! github.event.repository.fork ) # secrets are provided by environment environment: name: release # a different URL for each point in the matrix, but the same URLs accross commits url: 'https://github.com/G-Research/spark-extension?version=${{ needs.get-version.outputs.release-tag }}&spark=${{ matrix.params.spark-version }}&scala=${{ matrix.params.scala-version }}&package=maven' permissions: {} strategy: fail-fast: false matrix: ${{ fromJson(github.event.inputs.versions) }} steps: - name: Checkout release tag uses: actions/checkout@v4 - name: Set up JDK and publish to Maven Central uses: actions/setup-java@3a4f6e1af504cf6a31855fa899c6aa5355ba6c12 # v4.7.0 with: java-version: ${{ matrix.params.java-compat-version }} distribution: 'corretto' server-id: central server-username: MAVEN_USERNAME server-password: MAVEN_PASSWORD gpg-private-key: ${{ secrets.MAVEN_GPG_PRIVATE_KEY }} gpg-passphrase: MAVEN_GPG_PASSPHRASE - name: Inspect GPG run: gpg -k - name: Restore Maven packages cache id: cache-maven uses: actions/cache/restore@v4 with: path: ~/.m2/repository key: ${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}-${{ hashFiles('pom.xml') }} restore-keys: | ${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}-${{ hashFiles('pom.xml') }} ${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}- - name: Publish maven artifacts id: publish-maven run: | ./set-version.sh ${{ matrix.params.spark-version }} ${{ matrix.params.scala-version }} mvn clean deploy -Dsign -Dspotless.check.skip -DskipTests -Dmaven.test.skip=true env: MAVEN_USERNAME: ${{ secrets.MAVEN_USERNAME }} MAVEN_PASSWORD: ${{ secrets.MAVEN_PASSWORD }} MAVEN_GPG_PASSPHRASE: ${{ secrets.MAVEN_GPG_PASSPHRASE}} pypi-release: name: Publish PyPi release (Spark ${{ matrix.params.spark-version }}, Scala ${{ matrix.params.scala-version }}) runs-on: ubuntu-latest needs: get-version if: ( ! github.event.repository.fork ) # secrets are provided by environment environment: name: release # a different URL for each point in the matrix, but the same URLs accross commits url: 'https://github.com/G-Research/spark-extension?version=${{ needs.get-version.outputs.release-tag }}&spark=${{ matrix.params.spark-version }}&scala=${{ matrix.params.scala-version }}&package=pypi' permissions: id-token: write # required for PiPy publish strategy: fail-fast: false matrix: ${{ fromJson(github.event.inputs.versions) }} steps: - name: Checkout release tag uses: actions/checkout@v4 - name: Set up JDK uses: actions/setup-java@3a4f6e1af504cf6a31855fa899c6aa5355ba6c12 # v4.7.0 with: java-version: ${{ matrix.params.java-compat-version }} distribution: 'corretto' - uses: actions/setup-python@v5 with: python-version: ${{ env.PYTHON_VERSION }} - name: Restore Maven packages cache id: cache-maven uses: actions/cache/restore@v4 with: path: ~/.m2/repository key: ${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}-${{ hashFiles('pom.xml') }} restore-keys: | ${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}-${{ hashFiles('pom.xml') }} ${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}- - name: Build maven artifacts id: maven if: startsWith(matrix.params.spark-version, '3.') && startsWith(matrix.params.scala-version, '2.12.') || startsWith(matrix.params.spark-version, '4.') && startsWith(matrix.params.scala-version, '2.13.') run: | ./set-version.sh ${{ matrix.params.spark-version }} ${{ matrix.params.scala-version }} mvn clean package -Dspotless.check.skip -DskipTests -Dmaven.test.skip=true - name: Prepare PyPi package id: prepare-pypi-package if: steps.maven.outcome == 'success' run: | ./build-whl.sh - name: Publish package distributions to PyPI uses: pypa/gh-action-pypi-publish@release/v1 if: steps.prepare-pypi-package.outcome == 'success' with: packages-dir: python/dist skip-existing: true verbose: true ================================================ FILE: .github/workflows/publish-snapshot.yml ================================================ name: Publish snapshot on: workflow_dispatch: push: branches: ["master"] env: PYTHON_VERSION: "3.10" jobs: check-version: name: Check SNAPSHOT version if: ( ! github.event.repository.fork ) runs-on: ubuntu-latest permissions: {} outputs: is-snapshot: ${{ steps.check.outputs.is-snapshot }} steps: - name: Checkout code uses: actions/checkout@v4 - name: Check if this is a SNAPSHOT version id: check run: | # check is snapshot version if grep -q ".*-SNAPSHOT" pom.xml then echo "Version in pom IS a SNAPSHOT version" echo "is-snapshot=true" >> "$GITHUB_OUTPUT" else echo "Version in pom is NOT a SNAPSHOT version" echo "is-snapshot=false" >> "$GITHUB_OUTPUT" fi snapshot: name: Snapshot Spark ${{ matrix.params.spark-version }} Scala ${{ matrix.params.scala-version }} needs: check-version # when we release from master, this workflow will see a commit that does not have a SNAPSHOT version # we want this workflow to skip over that commit if: needs.check-version.outputs.is-snapshot == 'true' runs-on: ubuntu-latest # secrets are provided by environment environment: name: snapshot # a different URL for each point in the matrix, but the same URLs accross commits url: 'https://github.com/G-Research/spark-extension?spark=${{ matrix.params.spark-version }}&scala=${{ matrix.params.scala-version }}&snapshot' permissions: {} strategy: fail-fast: false matrix: include: - params: {"spark-version": "3.2.4", "scala-version": "2.12.15", "scala-compat-version": "2.12", "java-compat-version": "8"} - params: {"spark-version": "3.3.4", "scala-version": "2.12.15", "scala-compat-version": "2.12", "java-compat-version": "8"} - params: {"spark-version": "3.4.4", "scala-version": "2.12.17", "scala-compat-version": "2.12", "java-compat-version": "8"} - params: {"spark-version": "3.5.8", "scala-version": "2.12.18", "scala-compat-version": "2.12", "java-compat-version": "8"} - params: {"spark-version": "3.2.4", "scala-version": "2.13.5", "scala-compat-version": "2.13", "java-compat-version": "8"} - params: {"spark-version": "3.3.4", "scala-version": "2.13.8", "scala-compat-version": "2.13", "java-compat-version": "8"} - params: {"spark-version": "3.4.4", "scala-version": "2.13.8", "scala-compat-version": "2.13", "java-compat-version": "8"} - params: {"spark-version": "3.5.8", "scala-version": "2.13.8", "scala-compat-version": "2.13", "java-compat-version": "8"} - params: {"spark-version": "4.0.2", "scala-version": "2.13.16", "scala-compat-version": "2.13", "java-compat-version": "17"} - params: {"spark-version": "4.1.1", "scala-version": "2.13.17", "scala-compat-version": "2.13", "java-compat-version": "17"} steps: - name: Checkout code uses: actions/checkout@v4 - name: Set up JDK and publish to Maven Central uses: actions/setup-java@3a4f6e1af504cf6a31855fa899c6aa5355ba6c12 # v4.7.0 with: java-version: ${{ matrix.params.java-compat-version }} distribution: 'corretto' server-id: central server-username: MAVEN_USERNAME server-password: MAVEN_PASSWORD gpg-private-key: ${{ secrets.MAVEN_GPG_PRIVATE_KEY }} gpg-passphrase: MAVEN_GPG_PASSPHRASE - name: Inspect GPG run: gpg -k - uses: actions/setup-python@v5 with: python-version: ${{ env.PYTHON_VERSION }} - name: Restore Maven packages cache id: cache-maven uses: actions/cache/restore@v4 with: path: ~/.m2/repository key: ${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}-${{ hashFiles('pom.xml') }} restore-keys: | ${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}-${{ hashFiles('pom.xml') }} ${{ runner.os }}-mvn-build-${{ matrix.params.spark-version }}-${{ matrix.params.scala-version }}- - name: Publish snapshot run: | ./set-version.sh ${{ matrix.params.spark-version }} ${{ matrix.params.scala-version }} mvn clean deploy -Dsign -Dspotless.check.skip -DskipTests -Dmaven.test.skip=true env: MAVEN_USERNAME: ${{ secrets.MAVEN_USERNAME }} MAVEN_PASSWORD: ${{ secrets.MAVEN_PASSWORD }} MAVEN_GPG_PASSPHRASE: ${{ secrets.MAVEN_GPG_PASSPHRASE}} - name: Prepare PyPi package to test snapshot if: ${{ matrix.params.scala-version }} == 2.12* run: | # Build whl ./build-whl.sh - name: Restore Spark Binaries cache uses: actions/cache/restore@v4 with: path: ~/spark key: ${{ runner.os }}-spark-binaries-${{ matrix.params.spark-version }}-${{ matrix.params.scala-compat-version }} restore-keys: | ${{ runner.os }}-spark-binaries-${{ matrix.params.spark-version }}-${{ matrix.params.scala-compat-version }} - name: Rename Spark Binaries cache run: | mv ~/spark ./spark-${{ matrix.params.spark-version }}-${{ matrix.params.scala-compat-version }} - name: Test snapshot id: test-package run: | # Test the snapshot (needs whl) ./test-release.sh ================================================ FILE: .github/workflows/test-jvm.yml ================================================ name: Test JVM on: workflow_call: jobs: test: name: Test (Spark ${{ matrix.spark-compat-version }}.${{ matrix.spark-patch-version }} Scala ${{ matrix.scala-version }}) runs-on: ubuntu-latest strategy: fail-fast: false # keep in-sync with .github/workflows/prime-caches.yml matrix: include: - spark-compat-version: '3.2' scala-compat-version: '2.12' scala-version: '2.12.15' spark-patch-version: '4' java-compat-version: '8' hadoop-version: '2.7' - spark-compat-version: '3.3' scala-compat-version: '2.12' scala-version: '2.12.15' spark-patch-version: '4' java-compat-version: '8' hadoop-version: '3' - spark-compat-version: '3.4' scala-compat-version: '2.12' scala-version: '2.12.17' spark-patch-version: '4' java-compat-version: '8' hadoop-version: '3' - spark-compat-version: '3.5' scala-compat-version: '2.12' scala-version: '2.12.18' spark-patch-version: '7' java-compat-version: '8' hadoop-version: '3' - spark-compat-version: '3.2' scala-compat-version: '2.13' scala-version: '2.13.5' spark-patch-version: '4' java-compat-version: '8' hadoop-version: '3.2' - spark-compat-version: '3.3' scala-compat-version: '2.13' scala-version: '2.13.8' spark-patch-version: '4' java-compat-version: '8' hadoop-version: '3' - spark-compat-version: '3.4' scala-compat-version: '2.13' scala-version: '2.13.8' spark-patch-version: '4' java-compat-version: '8' hadoop-version: '3' - spark-compat-version: '3.5' scala-compat-version: '2.13' scala-version: '2.13.8' spark-patch-version: '7' java-compat-version: '8' hadoop-version: '3' - spark-compat-version: '4.0' scala-compat-version: '2.13' scala-version: '2.13.16' spark-patch-version: '2' java-compat-version: '17' hadoop-version: '3' - spark-compat-version: '4.1' scala-compat-version: '2.13' scala-version: '2.13.17' spark-patch-version: '1' java-compat-version: '17' hadoop-version: '3' - spark-compat-version: '4.2' scala-compat-version: '2.13' scala-version: '2.13.18' spark-patch-version: '0-preview3' java-compat-version: '17' hadoop-version: '3' steps: - name: Checkout uses: actions/checkout@v4 - name: Test uses: ./.github/actions/test-jvm env: CI_SLOW_TESTS: 1 with: spark-version: ${{ matrix.spark-compat-version }}.${{ matrix.spark-patch-version }} scala-version: ${{ matrix.scala-version }} spark-compat-version: ${{ matrix.spark-compat-version }} spark-archive-url: ${{ matrix.spark-archive-url }} scala-compat-version: ${{ matrix.scala-compat-version }} java-compat-version: ${{ matrix.java-compat-version }} hadoop-version: ${{ matrix.hadoop-version }} ================================================ FILE: .github/workflows/test-python.yml ================================================ name: Test Python on: workflow_call: jobs: # pyspark is not available for snapshots or scala other than 2.12 # we would have to compile spark from sources for this, not worth it test: name: Test (Spark ${{ matrix.spark-version }} Scala ${{ matrix.scala-version }} Python ${{ matrix.python-version }}) runs-on: ubuntu-latest strategy: fail-fast: false matrix: spark-compat-version: ['3.2', '3.3', '3.4', '3.5', '4.0'] python-version: ['3.9', '3.10', '3.11', '3.12', '3.13'] include: - spark-compat-version: '3.2' spark-version: '3.2.4' scala-compat-version: '2.12' scala-version: '2.12.15' java-compat-version: '8' hadoop-version: '2.7' - spark-compat-version: '3.3' spark-version: '3.3.4' scala-compat-version: '2.12' scala-version: '2.12.15' java-compat-version: '8' hadoop-version: '3' - spark-compat-version: '3.4' spark-version: '3.4.4' scala-compat-version: '2.12' scala-version: '2.12.17' java-compat-version: '8' hadoop-version: '3' - spark-compat-version: '3.5' spark-version: '3.5.8' scala-compat-version: '2.12' scala-version: '2.12.18' java-compat-version: '8' hadoop-version: '3' - spark-compat-version: '4.0' spark-version: '4.0.2' scala-compat-version: '2.13' scala-version: '2.13.16' java-compat-version: '17' hadoop-version: '3' - spark-compat-version: '4.1' spark-version: '4.1.1' scala-compat-version: '2.13' scala-version: '2.13.17' java-compat-version: '17' hadoop-version: '3' python-version: '3.10' - spark-compat-version: '4.2' spark-version: '4.2.0-preview3' scala-compat-version: '2.13' scala-version: '2.13.18' java-compat-version: '17' hadoop-version: '3' python-version: '3.10' exclude: - spark-compat-version: '3.2' python-version: '3.10' - spark-compat-version: '3.2' python-version: '3.11' - spark-compat-version: '3.2' python-version: '3.12' - spark-compat-version: '3.2' python-version: '3.13' - spark-compat-version: '3.3' python-version: '3.11' - spark-compat-version: '3.3' python-version: '3.12' - spark-compat-version: '3.3' python-version: '3.13' - spark-compat-version: '3.4' python-version: '3.12' - spark-compat-version: '3.4' python-version: '3.13' - spark-compat-version: '3.5' python-version: '3.12' - spark-compat-version: '3.5' python-version: '3.13' steps: - name: Checkout uses: actions/checkout@v4 - name: Test uses: ./.github/actions/test-python with: spark-version: ${{ matrix.spark-version }} scala-version: ${{ matrix.scala-version }} spark-compat-version: ${{ matrix.spark-compat-version }} spark-archive-url: ${{ matrix.spark-archive-url }} spark-package-repo: ${{ matrix.spark-package-repo }} scala-compat-version: ${{ matrix.scala-compat-version }} java-compat-version: ${{ matrix.java-compat-version }} hadoop-version: ${{ matrix.hadoop-version }} python-version: ${{ matrix.python-version }} ================================================ FILE: .github/workflows/test-release.yml ================================================ name: Test release on: workflow_call: jobs: test: name: Test Release Spark ${{ matrix.spark-version }} Scala ${{ matrix.scala-version }} runs-on: ubuntu-latest strategy: fail-fast: false matrix: include: - spark-compat-version: '3.2' spark-version: '3.2.4' scala-compat-version: '2.12' scala-version: '2.12.15' java-compat-version: '8' hadoop-version: '2.7' python-version: '3.9' - spark-compat-version: '3.3' spark-version: '3.3.4' scala-compat-version: '2.12' scala-version: '2.12.15' java-compat-version: '8' hadoop-version: '3' python-version: '3.10' - spark-compat-version: '3.4' spark-version: '3.4.4' scala-compat-version: '2.12' scala-version: '2.12.17' java-compat-version: '8' hadoop-version: '3' python-version: '3.11' - spark-compat-version: '3.5' spark-version: '3.5.8' scala-compat-version: '2.12' scala-version: '2.12.18' java-compat-version: '8' hadoop-version: '3' python-version: '3.11' - spark-compat-version: '3.2' spark-version: '3.2.4' scala-compat-version: '2.13' scala-version: '2.13.5' java-compat-version: '8' hadoop-version: '3.2' - spark-compat-version: '3.3' spark-version: '3.3.4' scala-compat-version: '2.13' scala-version: '2.13.8' java-compat-version: '8' hadoop-version: '3' - spark-compat-version: '3.4' spark-version: '3.4.4' scala-compat-version: '2.13' scala-version: '2.13.8' java-compat-version: '8' hadoop-version: '3' - spark-compat-version: '3.5' spark-version: '3.5.8' scala-compat-version: '2.13' scala-version: '2.13.8' java-compat-version: '8' hadoop-version: '3' - spark-compat-version: '4.0' spark-version: '4.0.2' scala-compat-version: '2.13' scala-version: '2.13.16' java-compat-version: '17' hadoop-version: '3' python-version: '3.13' - spark-compat-version: '4.1' spark-version: '4.1.1' scala-compat-version: '2.13' scala-version: '2.13.17' java-compat-version: '17' hadoop-version: '3' python-version: '3.13' - spark-compat-version: '4.2' spark-version: '4.2.0-preview3' scala-compat-version: '2.13' scala-version: '2.13.18' java-compat-version: '17' hadoop-version: '3' python-version: '3.13' steps: - name: Checkout uses: actions/checkout@v4 - name: Test uses: ./.github/actions/test-release with: spark-version: ${{ matrix.spark-version }} scala-version: ${{ matrix.scala-version }} spark-compat-version: ${{ matrix.spark-compat-version }} spark-archive-url: ${{ matrix.spark-archive-url }} scala-compat-version: ${{ matrix.scala-compat-version }} java-compat-version: ${{ matrix.java-compat-version }} hadoop-version: ${{ matrix.hadoop-version }} python-version: ${{ matrix.python-version }} ================================================ FILE: .github/workflows/test-results.yml ================================================ name: Test Results on: workflow_run: workflows: ["CI"] types: - completed permissions: {} jobs: publish-test-results: name: Publish Test Results runs-on: ubuntu-latest if: github.event.workflow_run.conclusion != 'skipped' permissions: checks: write pull-requests: write steps: - name: Download and Extract Artifacts uses: dawidd6/action-download-artifact@09f2f74827fd3a8607589e5ad7f9398816f540fe with: run_id: ${{ github.event.workflow_run.id }} name: "^Event File$| Test Results " name_is_regexp: true path: artifacts - name: Publish Test Results uses: EnricoMi/publish-unit-test-result-action@v2 with: commit: ${{ github.event.workflow_run.head_sha }} event_file: artifacts/Event File/event.json event_name: ${{ github.event.workflow_run.event }} files: "artifacts/* Test Results*/**/*.xml" ================================================ FILE: .github/workflows/test-snapshots.yml ================================================ name: Test Snapshots on: workflow_call: jobs: test: name: Test (Spark ${{ matrix.spark-version }} Scala ${{ matrix.scala-version }}) runs-on: ubuntu-latest strategy: fail-fast: false matrix: include: - spark-compat-version: '3.2' spark-version: '3.2.5-SNAPSHOT' scala-compat-version: '2.12' scala-version: '2.12.15' java-compat-version: '8' - spark-compat-version: '3.3' spark-version: '3.3.5-SNAPSHOT' scala-compat-version: '2.12' scala-version: '2.12.15' java-compat-version: '8' - spark-compat-version: '3.4' spark-version: '3.4.5-SNAPSHOT' scala-compat-version: '2.12' scala-version: '2.12.17' java-compat-version: '8' - spark-compat-version: '3.5' spark-version: '3.5.9-SNAPSHOT' scala-compat-version: '2.12' scala-version: '2.12.18' java-compat-version: '8' - spark-compat-version: '3.2' spark-version: '3.2.5-SNAPSHOT' scala-compat-version: '2.13' scala-version: '2.13.5' java-compat-version: '8' - spark-compat-version: '3.3' spark-version: '3.3.5-SNAPSHOT' scala-compat-version: '2.13' scala-version: '2.13.8' java-compat-version: '8' - spark-compat-version: '3.4' spark-version: '3.4.5-SNAPSHOT' scala-compat-version: '2.13' scala-version: '2.13.8' java-compat-version: '8' - spark-compat-version: '3.5' spark-version: '3.5.9-SNAPSHOT' scala-compat-version: '2.13' scala-version: '2.13.8' java-compat-version: '8' - spark-compat-version: '4.0' spark-version: '4.0.3-SNAPSHOT' scala-compat-version: '2.13' scala-version: '2.13.16' java-compat-version: '17' - spark-compat-version: '4.1' spark-version: '4.1.2-SNAPSHOT' scala-compat-version: '2.13' scala-version: '2.13.17' java-compat-version: '17' - spark-compat-version: '4.1' spark-version: '4.2.0-SNAPSHOT' scala-compat-version: '2.13' scala-version: '2.13.18' java-compat-version: '17' steps: - name: Checkout uses: actions/checkout@v4 - name: Test uses: ./.github/actions/test-jvm env: CI_SLOW_TESTS: 1 with: spark-version: ${{ matrix.spark-version }} scala-version: ${{ matrix.scala-version }} spark-compat-version: ${{ matrix.spark-compat-version }}-SNAPSHOT scala-compat-version: ${{ matrix.scala-compat-version }} java-compat-version: ${{ matrix.java-compat-version }} ================================================ FILE: .gitignore ================================================ # use glob syntax. syntax: glob *.ser *.class *~ *.bak #*.off *.old # eclipse conf file .settings .classpath .project .manager .scala_dependencies # idea .idea *.iml # building target build null tmp* temp* dist test-output build.log # other scm .svn .CVS .hg* # switch to regexp syntax. # syntax: regexp # ^\.pc/ #SHITTY output not in target directory build.log # project specific python/**/__pycache__ spark-* .cache ================================================ FILE: .scalafmt.conf ================================================ version = 3.7.17 runner.dialect = scala213 rewrite.trailingCommas.style = keep docstrings.style = Asterisk maxColumn = 120 ================================================ FILE: CHANGELOG.md ================================================ # Changelog All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). ## [2.15.0] - 2025-12-13 ### Added - Support encrypted parquet files (#324) ### Changed - Remove support for Spark 3.0 and Spark 3.1 (#332) - Make all undocumented unintended public API parts private (#331) - Reading Parquet metadata can use Parquet Hadoop version different to version coming with Spark (#330) ## [2.14.2] - 2025-07-21 ### Changed - Fixed release process (#320) ## [2.14.1] - 2025-07-17 ### Changed - Fixed release process (#319) ## [2.14.0] - 2025-07-17 ### Added - Support for Spark 4.0 (#269, #272, #293) ### Changed - Improve backticks (#265) New: This escapes backticks that already exist in column names. Change: This does not quote columns that only contain letters, numbers and underscores, which were quoted before. - Move Python dependencies into `setup.py`, build jar from `setup.py` (#301) ## [2.13.0] - 2024-11-04 ### Fixes - Support diff for Spark Connect implemened via PySpark Dataset API (#251) ### Added - Add ignore columns to diff in Python API (#252) - Check that the Java / Scala package is installed when needed by Python (#250) ## [2.12.0] - 2024-04-26 ### Fixes - Diff change column should respect comparators (#238) ### Changed - Make create_temporary_dir work with pyspark-extension only (#222). This allows [installing PIP packages and Poetry projects](PYSPARK-DEPS.md) via pure Python spark-extension package (Maven package not required any more). - Add map diff comparator to Python API (#226) ## [2.11.0] - 2024-01-04 ### Added - Add count_null aggregate function (#206) - Support reading parquet schema (#208) - Add more columns to reading parquet metadata (#209, #211) - Provide groupByKey shortcuts for groupBy.as (#213) - Allow to install PIP packages into PySpark job (#215) - Allow to install Poetry projects into PySpark job (#216) ## [2.10.0] - 2023-09-27 ### Fixed - Update setup.py to include parquet methods in python package (#191) ### Added - Add --statistics option to diff app (#189) - Add --filter option to diff app (#190) ## [2.9.0] - 2023-08-23 ### Added - Add key order sensitive map comparator (#187) ### Changed - Use dataset encoder rather than implicit value encoder for implicit dataset extension class (#183) ### Fixed - Fix key-sensitivity in map comparator (#186) ## [2.8.0] - 2023-05-24 ### Added - Add method to set and automatically unset Spark job description. (#172) - Add column function that converts between .Net (C#, F#, Visual Basic) `DateTime.Ticks` and Spark timestamp / Unix epoch timestamps. (#153) ## [2.7.0] - 2023-05-05 ### Added - Spark app to diff files or tables and write result back to file or table. (#160) - Add null value count to `parquetBlockColumns` and `parquet_block_columns`. (#162) - Add `parallelism` argument to Parquet metadata methods. (#164) ### Changed - Change data type of column name in `parquetBlockColumns` and `parquet_block_columns` to array of strings. Cast to string to get earlier behaviour (string column name). (#162) ## [2.6.0] - 2023-04-11 ### Added - Add reader for parquet metadata. (#154) ## [2.5.0] - 2023-03-23 ### Added - Add whitespace agnostic diff comparator. (#137) - Add Python whl package build. (#151) ## [2.4.0] - 2022-12-08 ### Added - Allow for custom diff equality. (#127) ### Fixed - Fix Python API calling into Scala code. (#132) ## [2.3.0] - 2022-10-26 ### Added - Add diffWith to Scala, Java and Python Diff API. (#109) ### Changed - Diff similar Datasets with ignoreColumns. Before, only similar DataFrame could be diffed with ignoreColumns. (#111) ### Fixed - Cache before writing via partitionedBy to work around SPARK-40588. Unpersist via UnpersistHandle. (#124) ## [2.2.0] - 2022-07-21 ### Added - Add (global) row numbers transformation to Scala, Java and Python API. (#97) ### Removed - Removed support for Pyton 3.6 ## [2.1.0] - 2022-04-07 ### Added - Add sorted group methods to Dataset. (#76) ## [2.0.0] - 2021-10-29 ### Added - Add support for Spark 3.2 and Scala 2.13. - Support to ignore columns in diff API. (#63) ### Removed - Removed support for Spark 2.4. ## [1.3.3] - 2020-12-17 ### Added - Add support for Spark 3.1. ## [1.3.2] - 2020-12-17 ### Changed - Refine conditional transformation helper methods. ## [1.3.1] - 2020-12-10 ### Changed - Refine conditional transformation helper methods. ## [1.3.0] - 2020-12-07 ### Added - Add transformation to compute histogram. (#26) - Add conditional transformation helper methods. (#27) - Add partitioned writing helpers that simplifies writing optimally ordered partitioned data. (#29) ## [1.2.0] - 2020-10-06 ### Added - Add diff modes (#22): column-by-column, side-by-side, left and right side diff modes. - Adds sparse mode (#23): diff DataFrame contains only changed values. ## [1.1.0] - 2020-08-24 ### Added - Add Python API for Diff transformation. - Add change column to Diff transformation providing column names of all changed columns in a row. - Add fluent methods to change immutable diff options. - Add `backticks` method to handle column names that contain dots (`.`). ## [1.0.0] - 2020-03-12 ### Added - Add Diff transformation for Datasets. ================================================ FILE: CONDITIONAL.md ================================================ # DataFrame Transformations The Spark `Dataset` API allows for chaining transformations as in the following example: ```scala ds.where($"id" === 1) .withColumn("state", lit("new")) .orderBy($"timestamp") ``` When you define additional transformation functions, the `Dataset` API allows you to also fluently call into those: ```scala def transformation(df: DataFrame): DataFrame = df.distinct ds.transform(transformation) ``` Here are some methods that extend this principle to conditional calls. ## Conditional Transformations You can run a transformation after checking a condition with a chain of fluent transformation calls: ```scala import uk.co.gresearch._ val condition = true val result = ds.where($"id" === 1) .withColumn("state", lit("new")) .when(condition).call(transformation) .orderBy($"timestamp") ``` rather than ```scala val condition = true val filteredDf = ds.where($"id" === 1) .withColumn("state", lit("new")) val condDf = if (condition) ds.call(transformation) else ds val result = ds.orderBy($"timestamp") ``` In case you need an else transformation as well, try: ```scala import uk.co.gresearch._ val condition = true val result = ds.where($"id" === 1) .withColumn("state", lit("new")) .on(condition).either(transformation).or(other) .orderBy($"timestamp") ``` ## Fluent and conditional functions elsewhere The same fluent notation works for instances other than `Dataset` or `DataFrame`, e.g. for the `DataFrameWriter`: ```scala def writeData[T](writer: DataFrameWriter[T]): Unit = { ... } ds.write .when(compress).call(_.option("compression", "gzip")) .call(writeData) ``` ================================================ FILE: DIFF.md ================================================ # Spark Diff Add the following `import` to your Scala code: ```scala import uk.co.gresearch.spark.diff._ ``` or this `import` to your Python code: ```python # noinspection PyUnresolvedReferences from gresearch.spark.diff import * ``` This adds a `diff` transformation to `Dataset` and `DataFrame` that computes the differences between two datasets / dataframes, i.e. which rows of one dataset / dataframes to _add_, _delete_ or _change_ to get to the other dataset / dataframes. For example, in Scala ```scala val left = Seq((1, "one"), (2, "two"), (3, "three")).toDF("id", "value") val right = Seq((1, "one"), (2, "Two"), (4, "four")).toDF("id", "value") ``` or in Python: ```python left = spark.createDataFrame([(1, "one"), (2, "two"), (3, "three")], ["id", "value"]) right = spark.createDataFrame([(1, "one"), (2, "Two"), (4, "four")], ["id", "value"]) ``` diffing becomes as easy as: ```scala left.diff(right).show() ``` |diff |id |value | |:---:|:---:|:-----:| | N| 1| one| | D| 2| two| | I| 2| Two| | D| 3| three| | I| 4| four| With columns that provide unique identifiers per row (here `id`), the diff looks like: ```scala left.diff(right, "id").show() ``` |diff |id |left_value|right_value| |:---:|:---:|:--------:|:---------:| | N| 1| one| one| | C| 2| two| Two| | D| 3| three| *null*| | I| 4| *null*| four| Equivalent alternative is this hand-crafted transformation (Scala) ```scala left.withColumn("exists", lit(1)).as("l") .join(right.withColumn("exists", lit(1)).as("r"), $"l.id" <=> $"r.id", "fullouter") .withColumn("diff", when($"l.exists".isNull, "I"). when($"r.exists".isNull, "D"). when(!($"l.value" <=> $"r.value"), "C"). otherwise("N")) .show() ``` Statistics on the differences can be obtained by ```scala left.diff(right, "id").groupBy("diff").count().show() ``` |diff |count | |:----:|:-----:| | N| 1| | I| 1| | D| 1| | C| 1| The `diff` transformation can optionally provide a *change column* that lists all non-id column names that have changed. This column is an array of strings and only set for `"N"` and `"C"`action rows; it is *null* for `"I"` and `"D"`action rows. |diff |changes|id |left_value|right_value| |:---:|:-----:|:---:|:--------:|:---------:| | N| []| 1| one| one| | C|[value]| 2| two| Two| | D| *null*| 3| three| *null*| | I| *null*| 4| *null*| four| ## Features This `diff` transformation provides the following features: * id columns are optional * provides typed `diffAs` and `diffWith` transformations * supports *null* values in id and non-id columns * detects *null* value insertion / deletion * [configurable](#configuring-diff) via `DiffOptions`: * diff column name (default: `"diff"`), if default name exists in diff result schema * diff action labels (defaults: `"N"`, `"I"`, `"D"`, `"C"`), allows custom diff notation,
e.g. Unix diff left-right notation (<, >) or git before-after format (+, -, -+) * [custom equality operators](#comparators-equality) (e.g. double comparison with epsilon threshold) * [different diff result formats](#diffing-modes) * [sparse diffing mode](#sparse-mode) * optionally provides a *change column* that lists all non-id column names that have changed (only for `"C"` action rows) * guarantees that no duplicate columns exist in the result, throws a readable exception otherwise ## Configuring Diff Diffing can be configured via an optional `DiffOptions` instance (see [Methods](#methods) below). |option |default |description| |--------------------|:-------:|-----------| |`diffColumn` |`"diff"` |The 'diff column' provides the action or diff value encoding if the respective row has been inserted, changed, deleted or has not been changed at all.| |`leftColumnPrefix` |`"left"` |Non-id columns of the 'left' dataset are prefixed with this prefix.| |`rightColumnPrefix` |`"right"`|Non-id columns of the 'right' dataset are prefixed with this prefix.| |`insertDiffValue` |`"I"` |Inserted rows are marked with this string in the 'diff column'.| |`changeDiffValue` |`"C"` |Changed rows are marked with this string in the 'diff column'.| |`deleteDiffValue` |`"D"` |Deleted rows are marked with this string in the 'diff column'.| |`nochangeDiffValue` |`"N"` |Unchanged rows are marked with this string in the 'diff column'.| |`changeColumn` |*none* |An array with the names of all columns that have changed values is provided in this column (only for unchanged and changed rows, *null* otherwise).| |`diffMode` |`DiffModes.Default`|Configures the diff output format. For details see [Diff Modes](#diff-modes) section below.| |`sparseMode` |`false` |When `true`, only values that have changed are provided on left and right side, `null` is used for un-changed values.| |`defaultComparator` |`DiffComparators.default()`|The default equality for all value columns.| |`dataTypeComparators`|_empty_ |Map from data types to comparators.| |`columnNameComparators`|_empty_|Map from column names to comparators.| Either construct an instance via the constructor … ```scala // Scala import uk.co.gresearch.spark.diff.{DiffOptions, DiffMode} val options = DiffOptions("d", "l", "r", "i", "c", "d", "n", Some("changes"), DiffMode.Default, false) ``` ```python # Python from gresearch.spark.diff import DiffOptions, DiffMode options = DiffOptions("d", "l", "r", "i", "c", "d", "n", "changes", DiffMode.Default, False) ``` … or via the `.with*` methods. The former requires most options to be specified, whereas the latter only requires the ones that deviate from the default. And it is more readable. Start from the default options `DiffOptions.default` and customize as follows: ```scala // Scala import uk.co.gresearch.spark.diff.{DiffOptions, DiffMode, DiffComparators} val options = DiffOptions.default .withDiffColumn("d") .withLeftColumnPrefix("l") .withRightColumnPrefix("r") .withInsertDiffValue("i") .withChangeDiffValue("c") .withDeleteDiffValue("d") .withNochangeDiffValue("n") .withChangeColumn("changes") .withDiffMode(DiffMode.Default) .withSparseMode(true) .withDefaultComparator(DiffComparators.epsilon(0.001)) .withComparator(DiffComparators.epsilon(0.001), DoubleType) .withComparator(DiffComparators.epsilon(0.001), "float_column") ``` ```python # Python from pyspark.sql.types import DoubleType from gresearch.spark.diff import DiffOptions, DiffMode, DiffComparators options = DiffOptions() \ .with_diff_column("d") \ .with_left_column_prefix("l") \ .with_right_column_prefix("r") \ .with_insert_diff_value("i") \ .with_change_diff_value("c") \ .with_delete_diff_value("d") \ .with_nochange_diff_value("n") \ .with_change_column("changes") \ .with_diff_mode(DiffMode.Default) \ .with_sparse_mode(True) \ .with_default_comparator(DiffComparators.epsilon(0.01)) \ .with_data_type_comparator(DiffComparators.epsilon(0.001), DoubleType()) \ .with_column_name_comparator(DiffComparators.epsilon(0.001), "float_column") ``` ### Diffing Modes The result of the diff transformation can have the following formats: - *column by column*: The non-id columns are arranged column by column, i.e. for each non-id column there are two columns next to each other in the diff result, one from the left and one from the right dataset. This is useful to easily compare the values for each column. - *side by side*: The non-id columns from the left and right dataset are are arranged side by side, i.e. first there are all columns from the left dataset, then from the right one. This is useful to visually compare the datasets as a whole, especially in conjunction with the sparse mode. - *left side*: Only the columns of the left dataset are present in the diff output. This mode provides the left dataset as is, annotated with diff action and optional changed column names. - *right side*: Only the columns of the right dataset are present in the diff output. This mode provides the right dataset as given, as well as the diff action that has been applied to it. This serves as a patch that, applied to the left dataset, results in the right dataset. With the following two datasets `left` and `right`: ```scala case class Value(id: Int, value: Option[String], label: Option[String]) val left = Seq( Value(1, Some("one"), None), Value(2, Some("two"), Some("number two")), Value(3, Some("three"), Some("number three")), Value(4, Some("four"), Some("number four")), Value(5, Some("five"), Some("number five")), ).toDS val right = Seq( Value(1, Some("one"), Some("one")), Value(2, Some("Two"), Some("number two")), Value(3, Some("Three"), Some("number Three")), Value(4, Some("four"), Some("number four")), Value(6, Some("six"), Some("number six")), ).toDS ``` the diff modes produce the following outputs: #### Column by Column |diff |id |left_value|right_value|left_label |right_label | |:---:|:---:|:--------:|:---------:|:----------:|:----------:| |C |1 |one |one |*null* |one | |C |2 |two |Two |number two |number two | |C |3 |three |Three |number three|number Three| |N |4 |four |four |number four |number four | |D |5 |five |null |number five |*null* | |I |6 |*null* |six |*null* |number six | #### Side by Side |diff |id |left_value|left_label |right_value|right_label | |:---:|:---:|:--------:|:----------:|:---------:|:----------:| |C |1 |one |*null* |one |one | |C |2 |two |number two |Two |number two | |C |3 |three |number three|Three |number Three| |N |4 |four |number four |four |number four | |D |5 |five |number five |null |*null* | |I |6 |*null* |*null* |six |number six | #### Left Side |diff |id |value|label | |:---:|:---:|:---:|:----------:| |C |1 |one |null | |C |2 |two |number two | |C |3 |three|number three| |N |4 |four |number four | |D |5 |five |number five | |I |6 |null |null | #### Right Side |diff |id |value|label | |:---:|:---:|:---:|:----------:| |C |1 |one |one | |C |2 |Two |number two | |C |3 |Three|number Three| |N |4 |four |number four | |D |5 |null |null | |I |6 |six |number six | ### Sparse Mode The diff modes above can be combined with sparse mode. In sparse mode, only values that differ between the two datasets are in the diff result, all other values are `null`. Above [Column by Column](#column-by-column) example would look in sparse mode as follows: |diff |id |left_value|right_value|left_label |right_label | |:---:|:---:|:--------:|:---------:|:----------:|:----------:| |C |1 |null |null |null |one | |C |2 |two |Two |null |null | |C |3 |three |Three |number three|number Three| |N |4 |null |null |null |null | |D |5 |five |null |number five |null | |I |6 |null |six |null |number six | ### Comparators (Equality) Values are compared for equality with the default `<=>` operator, which considers values equal when both sides are `null`, or both sides are not `null` and equal. The following alternative comparators are provided: |Comparator|Description| |:---------|:----------| |`DiffComparators.epsilon(epsilon)`|Two values are equal when they are at most `epsilon` apart.

The comparator can be configured to use `epsilon` as an absolute (`.asAbsolute()`) threshold, or as relative (`.asRelative()`) to the larger value. Further, the threshold itself can be considered equal (`.asInclusive()`) or not equal (`.asExclusive()`):| |`DiffComparators.string()`|Two `StringType` values are compared while ignoring white space differences. For this comparison, sequences of whitespaces are collapesed into single whitespaces, leading and trailing whitespaces are removed. With `DiffComparators.string(false)`, string values are compared with the default comparator.| |`DiffComparators.duration(duration)`|Two `DateType` or `TimestampType` values are equal when they are at most `duration` apart. That duration is an instance of `java.time.Duration`.

The comparator can be configured to consider `duration` as equal (`.asInclusive()`) or not equal (`.asExclusive()`):